Blogpost: From endangered languages to the Science Mesh - petascale FAIR data repositories

Contact

About
Science Mesh
Data Services
Use cases
News & Events
- News
- Events
- Webinars
Media
Log in
Join the science mesh

05 March 2021

The Science Mesh is a rich ecosystem for frictionless scientific collaboration and access to research services, where data, applications and computation are brought together to enable federated usage within and across scientific domains. The Science Mesh also encourages FAIR data practices and provides the opportunity for the user to both access to new research services, as well as contribute to the development of the Science Mesh itself.

Marco La Rosa, from the Pacific and Regional Archive for Digital Sources in Endangered Cultures - (PARADISEC), The University of Melbourne and UTS has been rebuilding the core storage component of the PARADISEC archive as a foundation for their next generation FAIR data archive.

Read the blogpost from Marco below to find out more about the knowledge and tools crossing over from PARADISEC and UTS to the Science Mesh.

PARADISEC has been operating for 18 years and currently holds material in 1,270 languages across Australia and the Pacific. The archive contains over 115TB of content including more than 14,000 hours of audio recordings, 1,600 hours of video and 8,000 transcriptions. It is a facility that acts as an archive of research recordings as well as forming an integral part of the research workflow in which primary data is made citable, is preserved, and is publicised (with licence agreements) for access.

In its current form, the archive is driven by a monolithic Ruby on Rails application. Although the application is showing its age it doesn’t suffer from issues of scale because of the early design decisions made around the storage of the data. Specifically, data is stored by collection identifier and item identifier resulting in adequate file system distribution and folders that don’t have too many entries. Further, the export of the metadata from the application database to XML (every time an item in the catalogue is saved) and its storage with the data means each item and collection is portable and a new system can be recovered from the on-disk store.

This design has a lot in common with the application-independent approach to storage described by the Oxford Common File Layout (OCFL) and the idea of data packaged with metadata that is formalised in the Research Object Crate spec (RO-Crate).

Accordingly, in 2019, with the support of a very small grant from the Australian Research Data Commons (ARDC), Nick Thieberger (director of PARADISEC) and I started working with Peter Sefton and the eResearch team at UTS to develop a proof of concept demonstrator of a next-generation language archive using OCFL to store the content and RO-Crate to describe it. At this point, Peter and his team had significant experience working with OCFL filesystems and RO-Crates in their own applications and their expertise aligned well with our goals.

When thinking about what a next-generation catalogue might look like we were drawn to the guarantees of completeness offered by OCFL (i.e. a repo can be rebuilt from its filestores) in addition to features like data integrity, versioning and diversity of underlying storage. Much of the content housed by PARADISEC can never be collected again so it is crucial that we can verify the integrity of the content whilst also supporting easy movement/replication of objects; capabilities either offered natively or easily supported within OCFL. Describing the data using RO-Crate makes the content even FAIRer than it already is by using an open standard and working in accessible formats (JSON linked data - JSON-LD).

The demonstrator that we developed to test these ideas currently has > 70TB of content described as RO-Crates and stored as OCFL objects. It comprises a single page webapp (SPA) served via an nginx server along with an elastic search service. Deployed as docker containers the service is easy to manage and trivial to scale. That said, in the time we’ve been developing it we have not seen the need to scale it in any way. Further, the service remains performant regardless of the amount of data in the underlying OCFL filesystem. Traditional API based repository technologies can suffer performance degradation as they scale to ever-larger sizes but we have not seen this in our demonstrator with these technologies.

So how does this relate to the Science Mesh?

As it turns out, the CS3MESH4EOSC project, which is behind the Science Mesh, has been thinking about similar issues. What tools are required to make research data FAIR and how do you apply them at scale to handle the often massive datasets coming from scientific collaborations? The research data management lifecycle consists of a number of phases typically focussed on collection and analysis but data description to support long term preservation is not always well defined; in many cases, it’s not even considered. Not so PARADISEC who have had processes and systems in place to ensure appropriate description from the very beginning of the collection/analysis cycle.

By collaborating with the Science Mesh, we (PARADISEC, Peter Sefton) are helping to define and build the tools that form part of a scalable, performant and well-described data ecosystem. Indeed, one of the contributions from our partners at UTS is Describo; a tool for creating and updating RO-Crates. Describo allows researchers to describe their data using an open community standard so that it is ready for preservation, dissemination and reuse. Describo is the tool to make data FAIR.

The Science Mesh, as a partner in the development of Describo, will be adapting it to work with the mesh infrastructure to form a key description tool available to all users of the mesh. Further, conversations about what to do with the soon-to-be well-described data are already occurring. Where does the data go when it needs to be published/archived/shared and how does that process look? What types of new services are enabled by using these open standards and having the data in this form?

Indeed, these questions are being considered by an even wider community of global RO-Crate and OCFL users. So to that end, it’s humbling to think that a demonstrator born from a small Australian research grant has opened up access to an international community of researchers, developers and systems all facing similar challenges and collaborating to develop the next generation of scalable and performant FAIR data services.

A blogpost by Marco La Rosa – PARADISEC / The University of Melbourne / UTS

Contacts: https://www.linkedin.com/in/marcolarosa/ | m@lr.id.au

Learn more about how Science Mesh can serve you & your community from the CS3MESH4EOSC!

KNOW MORE ABOUT THE SCIENCE MESH AND ITS DIFFERENT FUNCTIONALITIES TO UNITE DATA SERVICES & SUPPORT RESEARCHERS

follow us

Blogpost: From endangered languages to the Science Mesh - petascale FAIR data repositories

So how does this relate to the Science Mesh?

Learn more about how Science Mesh can serve you & your community from the CS3MESH4EOSC!

PARTNERS