PARADISEC currently holds material in 1,270 languages across Australia and the Pacific. The archive contains over 115TB of content including more than 14,000 hours of audio recordings, 1,600 hours of video and 8,000 transcriptions. It is a facility that acts as an archive of research recordings as well as forming an integral part of the research workflow in which primary data is made citable, is preserved, and is publicised (with licence agreements) for access. As the data itself is hard to search by keyword, the facility relies heavily on the quality, scalability and searchability of its metadata descriptors.

Field of study

Humanities: endangered languages

The Scientific challenge

Searchability in non-textual (sound/video) archives is crucial for scientific output, but as data volumes grow, manual methods need to be augmented with automation.

The Technical challenge

Maintaining high-quality metadata descriptors in a distributed archive with many researchers depositing data, while keeping costs down and reusing existing scalable infrastructure as much as possible.

The Business challenge

Cost of bespoke solutions vs. user training vs. reliability of COTS e-Infra vs. long-term reliability of research data and metadata quality.

The Solution

Adding user-friendly metadata annotation and packaging to COTS EFSS systems, base this on emerging standards in the eScience space (RO-Crate, OCFL) to guarantee the long-term usefulness of metadata and packaging presentation.

The scientific, societal, economical and policy impact

The methods proposed in "the solution" are generic and, as per the CS3MESH4EOSC paradigm, can be universally deployed across mesh nodes. This way, metadata annotation and packaging on live data will become ubiquitously available -- boosting FAIRness, increasing the survival chances for the implementation codebase, and the chosen standards.
Metadata aware, community-run live data stores can play a role in the service mix that is currently dominated by commercial cloud solutions and publishers.


The solution would remain local to PARADISEC and AARNet, quite probably raising the maintenance burden to unsustainable levels and thus causing too much long-term insecurity for a data longevity project.


The same codebase and choice of standards will be endorsed by many more sites, lowering the per-node cost-base and substantially raising the visibility of the solution and the standards it employs.


icons End-users and research communities

"Open Data Systems" professionals will considerably broaden the userbase for FAIR services -- from niche, in-the-know users, to data producers who don't usually involve the library. These users will now be able to more easily comply with the rising tide of FAIR requirements and be able to more easily articulate the impact of their research in the upcoming altimetric policy environment.

icons Institutional operators and services

Operators will have a low barrier of entry option to FAIRify the data held in their stores, delighting policymakers and end-users alike. The burden of having to pick a set of standards for long-term data custodianship will be eased.

icons Commercial software developers

With a single API and a clear set of standards, connecting a 3rd party commercial data handling solution to the nesh nodes will be eased.
Write once, deploy many.
Examples are commercial archives, publishers, data quality checkers (plagiarism, pseudonymising etc).

icons Non-commercial software developers

Research support / eInfrastructure developers will have an easier time writing research workflow code against a standardised api to access metadata -- no need to reinvent the wheel, and no interoperability problems between mesh nodes. The resultant code itself becomes more future proof too.

icons Policy makers & citizens

FAIR tools on live data will lift up the FAIR game from its present, fairly niche existence in repositories whose audience is essentially the library and the science support office). This in turn provides excellent opportunities for policymakers to make FAIR data relevant in the day to day science practice of all researchers.

Services used

Open Data Systems

Technical implementation

Use of DescriboOnline for metadata annotation and packaging recoded to use REVA as the data/identities access layer; and Describo, in turn, proffering packaged RO-crates to ScieboRDS as the data expunge layer that negotiates with 3rd party systems (archives, repos) about deposit of RO-crates, rights management, PID minting etc.

Contribution to the EOSC

Making CS3MESH4EOSC FAIR-compliant is a significant policy objective that the EC are keen to ensure is dealt with.

Future developments

Automated metadata harvesting (using heuristics), automated file classification, standardisation of the expunge mechanism for crates (probably through SWORDv3)