Understanding how biodiversity changes across space, time and human impact is a feat that requires a vast amount of data: new multimodal information, comprehensive observational data and historical time series collected from across Europe’s coasts.
This data is complex and heterogeneous. We want the data used within BIOcean5D to be compatible with both global archives and future work, to be FAIR (Findable, Accessible, Interoperable, Reusable). We need to find a way to bridge the gaps, so we can collaborate effectively and, ultimately, provide policy-makers with the holistic information they need to make knowledge-based decisions to protect our ocean.
That’s why we’ve built the BIOcean5D data hub: an integrated platform to streamline the handling of our multi-method, multi-stream biodiversity data.
Scouring for information
We shared the first version of our data hub in May 2024, alongside our deliverable report which outlines our strategy for the hub’s internal set up, as well as for our exchange of data across the team and to relevant data archives. Our initial priority was to consolidate, unify and enhance existing datasets to support our wider work packages.
“BIOcean5D is a highly collaborative, multidisciplinary project,” says Ivica Letunic, computational biologist and CEO of BioByte Solutions GmbH – a bioinformatics company and BIOcean5D partner. “The data hub can facilitate collaborative science like this by making it simple to access data generated by the project, as well as that available to us from other projects.”
Ivica has coordinated the development of the data hub’s back end (the operational code that can’t be accessed by users), as well as the interactive user interface at the front end. This includes a dedicated area for uploading and sharing documents, protocols, presentations, spreadsheets and other files related to the project.
Ivica and the team have also developed a basic search and navigation engine, covering most aspects of data currently available in the repository. As project data grows and evolves, they plan to expand the search engine to enable easier scouring of the data – including a selection of georeferenced datasets on a world map.
Metadata: as important as the data itself
As BIOcean5D evolves – and increasing volumes of new primary observation data join existing archives – so too will our data curation needs.
The key to managing and looking for meaningful patterns in these complex datasets is metadata. Metadata enriches raw data with descriptive, contextual information and captures interactions across datasets, ensuring the long-term FAIR principles and integrity of our overall data.
Stéphane Pesant, senior marine biocurator at EMBL-EBI, contributes to the development of metadata standards across BIOcean5D and coordinates the curation of metadata from the research expeditions TREC (Traversing European Coastlines) and Tara Europa. “We’ve established the hub so that it can provide associated metadata to help assemble, integrate and analyse data from different sources – such as environmental, imaging and genomics data,” he explains.
To reach this point, the team has had to find compromises between two approaches: 1) selecting, adapting and using a single metadata standard that already exists but isn’t designed to harmonise the different data sources, or 2) creating a new standard that is tailored to the project’s needs and maps to already existing ones.
“The first approach is readily applicable and satisfies those who work mainly with existing standards, but isn’t tailored to all the project’s needs,” Stéphane describes. “The second approach takes more time to develop, but works better for those who use other data types and require additional metadata standards.” Reaching a compromise that works for the project has been a key part of the first iteration of the data hub.
Looking ahead
The data hub is engineered to provide a scalable, secure environment for data storage and distribution, enabling seamless data sharing, efficient downstream analysis and collaborative research efforts.
“The hub is still in an early phase,” describes Ivica. It currently mostly includes archival information, and data that has been generated by other projects, such as the EU-funded AtlantECO. “We expect large amounts of data, primarily generated by TREC, to arrive later in 2024 and early 2025 and be shared on the hub.”
While the hub is currently only accessible to the BIOcean5D team, later in the project we will transform it into an open-access platform that shares all our digital assets with the wider community. This is outlined in our data management plan, underlining our dedication to fostering a collaborative, data-driven approach to biodiversity research.
.