Metadata Validation
Metadata should help users assess the usability of a data set for their own purposes and help users to understand their quality. Assessing the quality of metadata may guide the stakeholders in future governance of the system.
In terms of metadata, SoilWise Repository aims for the approach to balance harvesting between quantity and quality. See for more information in the Harvester Component. Catalogues which capture metadata authored by various data communities typically have a wide range of metadata completion and accuracy. Therefore, the SoilWise Repository employs metadata validation mechanisms to provide additional information about metadata completeness, conformance and integrity. Information resulting from the validation process are stored together with each metadata record in a relation database and updated after registering a new metadata version. Within the first iteration, they are not displayed in the SoilWise Catalogue.
On this topic 2 components are available which monitor aspects of metadata
- Metadata completeness calculates a score based on selected populated metadata properties
- Link liveliness assessment validates the availability of resources described by the record
Metadata completeness
Metadata structure validation
Before validation, metadata is harmonized to a common model. If harmonization fails for some reason, the record is not ingested and validated. The error is logged, so an administrator is able to follow up.
Metadata completeness indication
The indication calculates a level of completeness of a record, indicated in % of 100 for endorsed properties, considering that some properties are conditional based on selected values in other properties.
Completeness is evaluated against a set of metadata elements for each harmonized metadata record in the SWR platform. Records for which harmonisation fails are not evaluated (nor imported).
Label | Description | Score |
---|---|---|
Identification | Unique identification of the dataset (A UUID, URN, or URI, such as DOI) | 10 |
Title | Short meaningful title | 20 |
Abstract | Short description or abstract (1/2 page), can include (multiple) scientific/technical references | 20 |
Author/Organisation | An entity responsible for the resource (e.g. person or organisation) | 20 |
Date | Last update date | 10 |
Type | The nature of the content of the resource | 10 |
Rights | Information about rights and licences | 10 |
Extent (geographic) | Geographical coverage (e.g. EU, EU & Balkan, France, Wallonia, Berlin) | 5 |
Extent (temporal) | Temporal coverage | 5 |
Metadata ETS/ATS checking
The methodology of ETS/ATS has been suggested to develop validation tests.
Abstract Executable Test Suites (ATS) define a set of abstract test cases or scenarios that describe the expected behaviour of metadata without specifying the implementation details. These test suites focus on the logical aspects of metadata validation and provide a high-level view of metadata validation requirements, enabling stakeholders to understand validation objectives and constraints without getting bogged down in technical details. They serve as a valuable communication and documentation tool, facilitating collaboration between metadata producers, consumers, and validators. ATS are often documented using natural language descriptions, diagrams, or formal specifications. They outline the expected inputs, outputs, and behaviours of the metadata under various conditions.
The SWR ATS is under development at https://github.com/soilwise-he/metadata-validator/blob/main/docs/abstract_test_suite.md
Executable Test Suites (ETS) are sets of tests designed according to ATS to perform the metadata validation. These tests are typically automated and can be run repeatedly to ensure consistent validation results. Executable test suites consist of scripts, programs, or software tools that perform various validation checks on metadata. These checks can include:
- Data Integrity: Checking for inconsistencies or errors within the metadata. This includes identifying missing values, conflicting information, or data that does not align with predefined constraints.
- Standard Compliance: Assessing whether the metadata complies with relevant industry standards, such as Dublin Core, MARC, or specific domain standards like those for scientific data or library cataloguing.
- Interoperability: Evaluating the metadata's ability to interoperate with other systems or datasets. This involves ensuring that metadata elements are mapped correctly to facilitate data exchange and integration across different platforms.
- Versioning and Evolution: Considering the evolution of metadata over time and ensuring that the validation process accommodates versioning requirements. This may involve tracking changes, backward compatibility, and migration strategies.
- Quality Assurance: Assessing the overall quality of the metadata, including its accuracy, consistency, completeness, and relevance to the underlying data or information resources.
- Documentation: Documenting the validation process itself, including any errors encountered, corrective actions taken, and recommendations for improving metadata quality in the future.
ETS is implemented as a python module which is triggered from a CI-CD pipeline.
Technology
- Python Used for the linkchecker integration, API development, and database interactions
- PostgreSQL Primary database for storing and managing link information
- CI/CD Automated pipeline for continuous integration and deployment, with scheduled weekly runs for link liveliness assessment
Results of metadata validation are stored on PostgreSQL database, table is called validation in a schema validation.
identifier | Score | Date |
---|---|---|
abc-123-cba | 60 | 2025-01-20T12:20:10Z |
Validation runs every week as a CI-CD pipeline on records which have not been validated for 2 weeks. This builds up a history to understand validation results over time (consider that both changes in records, as well as the ETS itself may cause differences in score).
Future work
- Extend the ATS to JRC and user needs.
- Extend ETS based on ATS
Link liveliness assessment
Metadata (and data and knowledge sources) tend to contain links to other resources. Not all of these URIs are persistent, so over time they can degrade. In practice, many non-persistent knowledge sources and assets exist that could be relevant for SWR, e.g. on project websites, in online databases, on the computers of researchers, etc. Links pointing to such assets might however be part of harvested metadata records or data and content that is stored in the SWR.
The link liveliness assessment subcomponent runs over the available links stored with the SWR assets and checks their status. The function is foreseen to run frequently over the URIs in the SWR repository, assessing and storing the status of the link.
While evaluating the context of a link, the assessment tool may derive some contextual metadata, which can augment the metadata record. These results are stored in the metadata augmentation table. Metadata aspects derived are file size, file format.
The link liveliness privides the following functions:
- OGC API Catalogue Integration
- Designed to work specifically with OGC API - Records
- Extracts and evaluates URLs from catalogue items
- Link Validation
- Evaluates the validity of links to external sources and within the repository
- Checks if metadata accurately represents the source
- Support for OGC service links
- Health Status Tracking
- Provides up-to-date status history for every assessed link
- Maintains a history of link health over time
- Flexible Evaluation
- Supports single resource evaluation on demand
- Performs periodic tests to provide availability history
- Broken link management
- Identifies and categorizes broken links based on their status code (
401 Unauthorized
,404 Not Found
,500 Server Error
) - Flags deprecated links after consecutive failed tests and excludes them from future check
- Identifies and categorizes broken links based on their status code (
- Timeout management
- Identifies resources exceeding specified timeout thresholds
A javascript widget is further used to display the link status directly in the SWR Catalogue record.
The API can be used to identify which records have broken links.
Technology
- Python Used for the linkchecker integration, API development, and database interactions
- PostgreSQL Primary database for storing and managing link information
- FastAPI Employed to create and expose REST API endpoints. Utilizes FastAPI's efficiency and auto-generated Swagger documentation
- Docker Used for containerizing the application, ensuring consistent deployment across environments
- CI/CD Automated pipeline for continuous integration and deployment, with scheduled weekly runs for link liveliness assessment