Metadata Augmentation

This set of components augments metadata statements using various techniques. Augmentations are stored on a dedicated augmentation table, indicating the process which produced it. The statements are combined with the ingested content to offer users an optimal catalogue experience.

At the moment, the functionality of the Metadata Augmentation component comprises these components:

Keyword-matcher
Translation module
Metadata interlinker

Upcoming components

Keyword extraction
Spatial locator
Spatial scope analyser
EUSO high-value dataset tagging

Metadata augmentation results are stored in a augmentation table (unless mentioned otherwise).

metadata-uri	metadata-element	source	value	proces	date
https://geo.fi/data/ee44-aa22-33	spatial-scope	16.7,62.2,18,81.5	https://inspire.ec.europa.eu/metadata-codelist/SpatialScope/national	spatial-scope-analyser	2024-07-04
https://geo.fi/data/abc1-ba27-67	soil-thread	This dataset is used to evaluate Soil Compaction in Nuohous Sundström	http://aims.fao.org/aos/agrovoc/c_7163	keyword-analyser	2024-06-28

Keyword matcher

Info

Current version: 0.2.0

Technology: Python

Release: https://doi.org/10.5281/zenodo.14924181

Projects: Keyword matcher

Keywords are an important mechanism to filter and cluster records. Similar keywords need to be clustered to be able to match them. This module evaluates keywords of existing records to make them equal in case of high similarity.

Analyses existing keywords on a metadata record. Two cases can be identified:

If a keyword, having a skos identifier, has a closeMatch or sameAs relation to a prefered keyword, the prefered keyword is used.
If an existing keyword, without skos identifier, matches a prefered keyword by (translated) string or synonym, then append the matched keyword (including skos identifier).

To facilitate this use case the SWR contains a knowledge graph of prefered keywords in the soil domain derived from Agrovoc, Gemet and ISO11074. This knowledge graph is maintained at https://github.com/soilwise-he/soil-health-knowledge-graph. These vocabularies are multilingual, facilitating the translation case.

For metadata records which have not been analysed yet (in that iteration), the module extracts the keywords, for each keyword an analyses is made if it matches any of the prefered keywords, the prefered keyword is added to the augmentation results for that record. For string matching a fuzzy match algorithm is used, requiring a 90% match (configurable). Translations are matched using the metadata language as indicated in the record.

The process runs as a CI-CD pipeline at dayly intervals.

Technology

Python Used for the keyword matching and database interactions
PostgreSQL Primary database for storing and managing information
Docker Used for containerizing the application, ensuring consistent deployment across environments
CI/CD Automated pipeline for continuous integration and deployment, with scheduled dayly runs

Translation module

Info

Current version: 0.2.0

Technology: Python

Projects: Translation

Some records arrive in a local language, SWR translates the main properties for the record: title and abstract into English, to offer a single language user experience. The translations are used in filtering and display of records.

The translation module builds on the EU translation service (API documentation at https://language-tools.ec.europa.eu/). Translations are stored in a database for reuse by the SWR.

The EU translation returns asynchronous responses to translation requests, this means that translations may not yet be available after initial load of new data. A callback operation populates the database, from that moment a translation is available to SWR. The translation service uses 2-letter language codes, it means a translation from a 3-letter iso code (as used in for example iso19139:2007) to 2-letter code is required. The EU translation service has a limited set of translations from a certain to alternative language available, else returns an error.

Initial translation is triggered by a running harvester. The translations will then be available once the record is ingested to the triplestore and catalogue database in a followup step of the harvester.

Technology

Python Used for the translation module, API development, and database interactions
PostgreSQL Primary database for storing and managing information
FastAPI Employed to create and expose REST API endpoints. Utilizes FastAPI's efficiency and auto-generated Swagger documentation
Docker Used for containerizing the application, ensuring consistent deployment across environments
CI/CD Automated pipeline for continuous integration and deployment, with scheduled dayly runs

Metadata interlinker

To be able to provide interlinked data and knowledge assets (e.g. a dataset, the project in which it was generated and the operating procedure used) links between metadata must be identified and registered ideally as part of the SWR Triple Store.

We distinguish between explicit and implicit links:

Explicit links can be directly derived from the data and/or metadata. E.g. projects in CORDIS are explicitly linked to documents and datasets.
Implicit links can not be directly derived from the (meta)data. They may be derived by spatial or temporal extent, keyword usage, or shared author/publisher.

SWR-1 implements the interlinking of data and knowledge assets based on explicit links that are found in the harvested metadata. The harvesting processes implemented in SWR-1 have been extended with this function to detect such linkages and store them in the repository and add them to the SWR knowledge graph. This allows e.g. exposing this additional information to the UI for displaying and linkage the and other functions.

Foreseen functionality

In the next iterations, Metadata augmentation component is foreseen to include the following additional functions:

Keyword extraction

The value of relevant keywords is often underestimated by data producers. This proof-of-concept module evaluates the metadata title/abstract to identify relevant keywords using NLP/NER technology. Integration with the catalogue is foreseen.

Spatial Locator

The module is foreseen to analyse existing keywords to find a relevant geography for the record. It will then use a gazeteer to find spatial coordinates for the geography, which will be inserted into the metadata record. Vice versa, if the record has a geography it will use reverse gazeteer to find a matching location keyword.

Spatial scope analyser

A module that is foreseen to analyse the spatial scope of a resource.

The bounding box will be matched to country or continental bounding boxes using a gazeteer.

To understand if the dataset has a global, continental, national or regional scope

Retrieves all datasets (as iso19139 xml) from database (records table joined with augmentations) which:
- have a bounding box
- no spatial scope
- in iso19139 format
For each record it compares the boundingbox to country bounding boxes:
- if bigger then continents > global
- If matches a continent > continental
- if matches a country > national
- if smaller > regional
result is written to as an augmentation in a dedicated table

EUSO-high-value dataset tagging

The EUSO high-value datasets are those with substantial potential to assess soil health status, as detailed on the EUSO dashboard. This framework includes the concept of soil degradation indicator metadata-based identification and tagging. Each dataset (possibly only those with the supra-national spatial scope - under discussion) will be annotated with a potential soil degradation indicator for which it might be utilised. Users can then filter these datasets according to their specific needs.

The EUSO soil degradation indicators employ specific methodologies and thresholds to determine soil health status, see also the Table below. These methodologies will also be considered, as they may have an impact on the defined thresholds. This issue will be examined in greater detail in the future.

Soil Degradation	Soil Indicator	Type of methodic for threshold
Soil erosion	Water erosion	RUSLE2015
	Wind erosion	GIS-RWEQ
	Tillage erosion	SEDEM
	Harvest erosion	Textural index
	Post-fire recovery	USLE (Type of RUSLE)
Soil pollution	Arsenic excess	GAMLSS-RF
	Copper excess	GLM and GPR
	Mercury excess	LUCAS topsoil database
	Zinc Excess	LUCAS topsoil database
	Cadmium Excess	GEMAS
Soil nutrients	Nitrogen surplus	NNB
	Phosphorus deficiency	LUCAS topsoil database
	Phosphorus excess	LUCAS topsoil database
Loss of soil organic carbon	Distance to maximum SOC level	qGAM
Loss of soil biodiversity	Potential threat to biological functions	Expert Polling, Questionnaire, Data Collection, Normalization and Analysis
Soil compaction	Packing density	Calculation of Packing Density (PD)
Salinization	Secondary salinization	-
Loss of organic soils	Peatland degradation	-
Soil consumption	Soil sealing	Raster remote sense data

Technically, we forsee the metadata tagging process as illustrated below. At first, metadata record's title, abstract and keywords will be checked for the occurence of specific values from the Soil Indicator and Soil Degradation Codelists, such as Water erosion or Soil erosion (see the Table above). If found, the Soil Degradation Indicator Tag (corresponding value from the Soil Degradation Codelist) will be displayed to indicate suitability of given dataset for soil indicator related analyses. Additionally, a search for corresponding methodology will be conducted to see if the dataset is compliant with the EUSO Soil Health indicators presented in the EUSO Dashboard. If found, the tag EUSO High-value dataset will be added. In later phase we assume search for references to Scientific Methodology papers in metadata record's links. Next, the possibility of involving a more complex search using soil thesauri will also be explored.

flowchart TD
    subgraph ic[Indicators Search]
        ti([Title Check]) ~~~ ai([Abstract Check])
        ai ~~~ ki([Keywords Check])
    end
    subgraph Codelists
        sd ~~~ si
    end
    subgraph M[Methodologies Search]
        tiM([Title Check]) ~~~ aiM([Abstract Check])
        kl([Links check]) ~~~ kM([Keywords Check])
    end
    m[(Metadata Record)] --> ic
    m --> M
    ic-- + ---M
    sd[(Soil Degradation Codelist)] --> ic
    si[(Soil Indicator Codelist)] --> ic
    em[(EUSO Soil Methodologies list)] --> M
    M --> et{{EUSO High-Value Dataset Tag}}
    et --> m
    ic --> es{{Soil Degradation Indicator Tag}}
    es --> m
    th[(Thesauri)]-- synonyms ---Codelists