Presented in a previous article following its launch last year, the data hub operates as a central repository that enables all data used within BIOcean5D to be accessed via a dedicated platform. We caught up with the EMBL team responsible for the conceptualisation of the data hub and the data flows into the hub to learn more about how their work makes BIOcean5D’s holistic exploration of marine biodiversity possible.
“The main challenge of the BIOcean5D data hub is the design of an infrastructure and data upload procedure capable of accommodating the highly diverse range of data used within the project,” explains Kerstin Leberecht, data manager for the Traversing European Coastlines (TREC) expedition. “Among others, the data includes sequencing, imaging and historic data, modelling data outputs, acoustic recordings, in addition to a wide range of physical, chemical and environmental context data.”
Development of the BIOcean5D data hub was carried out in parallel to an equivalent data hub dedicated to the TREC – Tara Europa expedition, with the design of both platforms supporting significant data flow between the projects. “The data integrated into the BIOcean5D data hub from TREC – Tara Europa includes aerosol, sediment and shallow water samples collected on land by EMBL’s mobile laboratories, together with water samples collected in parallel from further at sea by the Tara Ocean Foundation’s schooner Tara,” explains Kerstin.
Credit: Kogia | Sumer Verma
In addition to the diversity of data types and formats, the BIOcean5D hub also needs to ensure compatibility of data originating from a variety of sources. Firstly, from the 31 partners across 11 countries involved in the project. But furthermore, from the combination of newly-generated data with existing historical datasets and archives from European marine stations, recent major ocean biodiversity surveys, as well as relevant data from a scattered network of previous and ongoing EU, international, national and local projects and citizen science initiatives.
Compliance must also be ensured with international data management standards, in particular, the Open Science (OS) framework, FAIR (Findable, Accessible, Interoperable, Reusable) principles and regulations concerning data protection. “These requirements mean that each dataset submitted to the data hub must be accompanied and enriched by well-curated and detailed metadata,” explains Matej Trojak, Planetary Biology Biocurator at EMBL. Metadata enriches raw data with descriptive (such as how and why the data was created) or contextual information (within the context of TREC, for example, where the sample was collected). The difference between data and metadata, however, is subtle and context-dependent. “Temperature, for example, could be considered data or metadata depending on the research context,” explains Matej. The diversity of data and number of partners further complicated the management of metadata submission and, as a result, the system developed for the BIOcean5D data hub underwent multiple cycles of refinement and optimisation.
Credit: Kogia | Sumer Verma
The overarching ambition of collecting, integrating and harmonising such a diverse range of new and existing biodiversity data is to enable a holistic exploration and understanding of marine biodiversity through multidisciplinary collaboration. “We’ve worked to ensure the data hub is user-friendly, intuitive and as seamless as possible, which has required careful balancing between compliance and usability,” explains Kerstin. Dedicated data hub training sessions have also been organised. “These sessions have proved to be a valuable source of feedback and inspiration for further development, helping us to identify and prioritise improvements.”
With the data hub now up and running, with continuous technical improvements and integration of ever more data, “The rest is now up to the scientists!” explains Anthony Fullam, Senior Bioinformatics Software Engineer at EMBL. “Most data hubs specialise in one particular type of data, there aren’t many that integrate such a range and scale of data. The analyses made possible by connecting and combining the information available in the BIOcean5D data hub could be quite powerful and unique.” The immense potential of collaborative data sharing has not gone unnoticed! “We’re getting very positive feedback about the data hub as well as requests to help set up similar platforms for other major international research projects,” explains Kerstin.
It is planned that the BIOcean5D data hub will become publicly accessible with continued maintenance to support ongoing access, usability and the generation of insights to guide the protection and restoration of our Ocean.