Skip to content

Metadata Augmentation

Info

Current version: 0.1.0

Project: Metadata augmentation

Functionality

In this component scripting / NLP / LLM are used on a metadata record to augment metadata statements about the resource. Augmentations are stored on a dedicated augmentation table, indicating the process which produced it.

metadata-uri metadata-element source value proces date
https://geo.fi/data/ee44-aa22-33 spatial-scope 16.7,62.2,18,81.5 https://inspire.ec.europa.eu/metadata-codelist/SpatialScope/national spatial-scope-analyser 2024-07-04
https://geo.fi/data/abc1-ba27-67 soil-thread This dataset is used to evaluate Soil Compaction in Nuohous Sundström http://aims.fao.org/aos/agrovoc/c_7163 keyword-analyser 2024-06-28

For the first SoilWise prototype, the functionality of the Metadata Augmentation component comprises:

Automatic metadata generation

To generate metadata (data set and service metadata), activate the corresponding button(s) when setting up the theme for the transformation process. The steps are described here

Translation module

Many records arrive in a local language, SWR translates the main properties for the record: title and abstract into English, to offer a single language user experience. The translations are used in filtering and display of records.

The translation module builds on the EU translation service (API documentation at https://language-tools.ec.europa.eu/). Translations are stored in a database for reuse by the SWR. The EU translation returns asynchronous responses to translation requests, this means that translations may not yet be available after initial load of new data. A callback operation populates the database, from that moment a translation is available to SWR. The translation service uses 2-letter language codes, it means a translation from a 3-letter iso code (as used in for example iso19139:2007) to 2-letter code is required. The EU translation service has a limited set of translations from a certain to alternative language available, else returns an error.

Initial translation is triggered by a running harvester. The translations will then be available once the record is ingested to the triplestore and catalogue database in a followup step of the harvester.

Foreseen functionality

In the next iterations, Metadata augmentation component is foreseen to include the following additional functions:

Keyword matcher

Keywords are an important mechanism to filter and cluster records. But similar keywords need to be equal to be able to match them. This module evaluates keywords of existing records to make them equal in case of high similarity.

Analyses existing keywords on a metadata record. Two cases can be identified:

  • If a keyword, having a skos identifier, has a closeMatch or sameAs relation to a prefered keyword, the prefered keyword is used.
  • If an existing keyword, without skos identifier, matches a prefered keyword by (translated) string or synonym, then append the matched keyword (including skos identifier). Consider the risk of false positives.

To facilitate this use case the SWR contains a knowledge graph of prefered keywords in the soil domain with relations to alternative keywords, such as agrovoc, gemet, dpedia, iso. This knowledge graph is maintained at https://github.com/soilwise-he/soil-health-knowledge-graph. Agrovoc is multilingual, facilitating the translation case.

For metadata records which have not been analysed yet (in that iteration), the module extracts the records, for each keyword an analyses is made if it maches any of the prefered keywords, if so, the prefered keyword is added to the record.

Spatial Locator

Analyses existing keywords to find a relevant geography for the record, it then uses the GeoNames API to find spatial coordinates for the geography, which are inserted into the metadata record.

Spatial scope analyser

A script that analyses the spatial scope of a resource

The bounding box is matched to country bounding boxes

To understand if the dataset has a global, continental, national or regional scope

  • Retrieves all datasets (as iso19139 xml) from database (records table joined with augmentations) which:
    • have a bounding box
    • no spatial scope
    • in iso19139 format
  • For each record it compares the boundingbox to country bounding boxes:
    • if bigger then continents > global
    • If matches a continent > continental
    • if matches a country > national
    • if smaller > regional
  • result is written to as an augmentation in a dedicated table

EUSO-high-value dataset tagging

The EUSO high-value datasets are those with substantial potential to assess soil health status, as detailed on the EUSO dashboard. This framework includes the concept of soil degradation indicator metadata-based identification and tagging. Each dataset (possibly only those with the supra-national spatial scope - under discussion) will be annotated with a potential soil degradation indicator for which it might be utilised. Users can then filter these datasets according to their specific needs.

The EUSO soil degradation indicators employ specific methodologies and thresholds to determine soil health status, see also the Table below. These methodologies will also be considered, as they may have an impact on the defined thresholds. This issue will be examined in greater detail in the future.

Soil Degradation Soil Indicator Type of methodic for threshold
Soil erosion Water erosion RUSLE2015
Wind erosion GIS-RWEQ
Tillage erosion SEDEM
Harvest erosion Textural index
Post-fire recovery USLE (Type of RUSLE)
Soil pollution Arsenic excess GAMLSS-RF
Copper excess GLM and GPR
Mercury excess LUCAS topsoil database
Zinc Excess LUCAS topsoil database
Cadmium Excess GEMAS
Soil nutrients Nitrogen surplus NNB
Phosphorus deficiency LUCAS topsoil database
Phosphorus excess LUCAS topsoil database
Loss of soil organic carbon Distance to maximum SOC level qGAM
Loss of soil biodiversity Potential threat to biological functions Expert Polling, Questionnaire, Data Collection, Normalization and Analysis
Soil compaction Packing density Calculation of Packing Density (PD)
Salinization Secondary salinization -
Loss of organic soils Peatland degradation -
Soil consumption Soil sealing Raster remote sense data

Technically, we forsee the metadata tagging process as illustrated below. At first, metadata record's title, abstract and keywords will be checked for the occurence of specific values from the Soil Indicator and Soil Degradation Codelists, such as Water erosion or Soil erosion (see the Table above). If found, the Soil Degradation Indicator Tag (corresponding value from the Soil Degradation Codelist) will be displayed to indicate suitability of given dataset for soil indicator related analyses. Additionally, a search for corresponding methodology will be conducted to see if the dataset is compliant with the EUSO Soil Health indicators presented in the EUSO Dashboard. If found, the tag EUSO High-value dataset will be added. In later phase we assume search for references to Scientific Methodology papers in metadata record's links. Next, the possibility of involving a more complex search using soil thesauri will also be explored.

flowchart TD
    subgraph ic[Indicators Search]
        ti([Title Check]) ~~~ ai([Abstract Check])
        ai ~~~ ki([Keywords Check])
    end
    subgraph Codelists
        sd ~~~ si
    end
    subgraph M[Methodologies Search]
        tiM([Title Check]) ~~~ aiM([Abstract Check])
        kl([Links check]) ~~~ kM([Keywords Check])
    end
    m[(Metadata Record)] --> ic
    m --> M
    ic-- + ---M
    sd[(Soil Degradation Codelist)] --> ic
    si[(Soil Indicator Codelist)] --> ic
    em[(EUSO Soil Methodologies list)] --> M
    M --> et{{EUSO High-Value Dataset Tag}}
    et --> m
    ic --> es{{Soil Degradation Indicator Tag}}
    es --> m
    th[(Thesauri)]-- synonyms ---Codelists