Metadata Augmentation
Functionality
In this component scripting / NLP / LLM are used on a metadata record to augment metadata statements about the resource. Augmentations are stored on a dedicated augmentation table, indicating the process which produced it.
metadata-uri | metadata-element | source | value | proces | date |
---|---|---|---|---|---|
https://geo.fi/data/ee44-aa22-33 | spatial-scope | 16.7,62.2,18,81.5 | https://inspire.ec.europa.eu/metadata-codelist/SpatialScope/national | spatial-scope-analyser | 2024-07-04 |
https://geo.fi/data/abc1-ba27-67 | soil-thread | This dataset is used to evaluate Soil Compaction in Nuohous Sundström | http://aims.fao.org/aos/agrovoc/c_7163 | keyword-analyser | 2024-06-28 |
For the first SoilWise prototype, the functionality of the Metadata Augmentation component comprises:
Automatic metadata generation
To generate metadata (data set and service metadata), activate the corresponding button(s) when setting up the theme for the transformation process. The steps are described here
Translation module
Many records arrive in a local language, SWR translates the main properties for the record: title and abstract into English, to offer a single language user experience. The translations are used in filtering and display of records.
The translation module builds on the EU translation service (API documentation at https://language-tools.ec.europa.eu/). Translations are stored in a database for reuse by the SWR. The EU translation returns asynchronous responses to translation requests, this means that translations may not yet be available after initial load of new data. A callback operation populates the database, from that moment a translation is available to SWR. The translation service uses 2-letter language codes, it means a translation from a 3-letter iso code (as used in for example iso19139:2007) to 2-letter code is required. The EU translation service has a limited set of translations from a certain to alternative language available, else returns an error.
Initial translation is triggered by a running harvester. The translations will then be available once the record is ingested to the triplestore and catalogue database in a followup step of the harvester.
Foreseen functionality
In the next iterations, Metadata augmentation component is foreseen to include the following additional functions:
Keyword matcher
Keywords are an important mechanism to filter and cluster records. But similar keywords need to be equal to be able to match them. This module evaluates keywords of existing records to make them equal in case of high similarity.
Analyses existing keywords on a metadata record. Two cases can be identified:
- If a keyword, having a skos identifier, has a closeMatch or sameAs relation to a prefered keyword, the prefered keyword is used.
- If an existing keyword, without skos identifier, matches a prefered keyword by (translated) string or synonym, then append the matched keyword (including skos identifier). Consider the risk of false positives.
To facilitate this use case the SWR contains a knowledge graph of prefered keywords in the soil domain with relations to alternative keywords, such as agrovoc, gemet, dpedia, iso. This knowledge graph is maintained at https://github.com/soilwise-he/soil-health-knowledge-graph. Agrovoc is multilingual, facilitating the translation case.
For metadata records which have not been analysed yet (in that iteration), the module extracts the records, for each keyword an analyses is made if it maches any of the prefered keywords, if so, the prefered keyword is added to the record.
Spatial Locator
Analyses existing keywords to find a relevant geography for the record, it then uses the GeoNames API to find spatial coordinates for the geography, which are inserted into the metadata record.
Spatial scope analyser
A script that analyses the spatial scope of a resource
The bounding box is matched to country bounding boxes
To understand if the dataset has a global, continental, national or regional scope
- Retrieves all datasets (as iso19139 xml) from database (records table joined with augmentations) which:
- have a bounding box
- no spatial scope
- in iso19139 format
- For each record it compares the boundingbox to country bounding boxes:
- if bigger then continents > global
- If matches a continent > continental
- if matches a country > national
- if smaller > regional
- result is written to as an augmentation in a dedicated table
EUSO-high-value dataset tagging
The EUSO high-value datasets are those with substantial potential to assess soil health status, as detailed on the EUSO dashboard. This framework includes the concept of soil degradation indicator metadata-based identification and tagging. Each dataset (possibly only those with the supra-national spatial scope - under discussion) will be annotated with a potential soil degradation indicator for which it might be utilised. Users can then filter these datasets according to their specific needs.
The EUSO soil degradation indicators employ specific methodologies and thresholds to determine soil health status, see also the Table below. These methodologies will also be considered, as they may have an impact on the defined thresholds. This issue will be examined in greater detail in the future.
Soil Degradation | Soil Indicator | Type of methodic for threshold |
---|---|---|
Soil erosion | Water erosion | RUSLE2015 |
Wind erosion | GIS-RWEQ | |
Tillage erosion | SEDEM | |
Harvest erosion | Textural index | |
Post-fire recovery | USLE (Type of RUSLE) | |
Soil pollution | Arsenic excess | GAMLSS-RF |
Copper excess | GLM and GPR | |
Mercury excess | LUCAS topsoil database | |
Zinc Excess | LUCAS topsoil database | |
Cadmium Excess | GEMAS | |
Soil nutrients | Nitrogen surplus | NNB |
Phosphorus deficiency | LUCAS topsoil database | |
Phosphorus excess | LUCAS topsoil database | |
Loss of soil organic carbon | Distance to maximum SOC level | qGAM |
Loss of soil biodiversity | Potential threat to biological functions | Expert Polling, Questionnaire, Data Collection, Normalization and Analysis |
Soil compaction | Packing density | Calculation of Packing Density (PD) |
Salinization | Secondary salinization | - |
Loss of organic soils | Peatland degradation | - |
Soil consumption | Soil sealing | Raster remote sense data |
Technically, we forsee the metadata tagging process as illustrated below. At first, metadata record's title, abstract and keywords will be checked for the occurence of specific values from the Soil Indicator and Soil Degradation Codelists, such as Water erosion
or Soil erosion
(see the Table above). If found, the Soil Degradation Indicator Tag
(corresponding value from the Soil Degradation Codelist) will be displayed to indicate suitability of given dataset for soil indicator related analyses. Additionally, a search for corresponding methodology will be conducted to see if the dataset is compliant with the EUSO Soil Health indicators presented in the EUSO Dashboard. If found, the tag EUSO High-value dataset
will be added. In later phase we assume search for references to Scientific Methodology papers in metadata record's links. Next, the possibility of involving a more complex search using soil thesauri will also be explored.
flowchart TD
subgraph ic[Indicators Search]
ti([Title Check]) ~~~ ai([Abstract Check])
ai ~~~ ki([Keywords Check])
end
subgraph Codelists
sd ~~~ si
end
subgraph M[Methodologies Search]
tiM([Title Check]) ~~~ aiM([Abstract Check])
kl([Links check]) ~~~ kM([Keywords Check])
end
m[(Metadata Record)] --> ic
m --> M
ic-- + ---M
sd[(Soil Degradation Codelist)] --> ic
si[(Soil Indicator Codelist)] --> ic
em[(EUSO Soil Methodologies list)] --> M
M --> et{{EUSO High-Value Dataset Tag}}
et --> m
ic --> es{{Soil Degradation Indicator Tag}}
es --> m
th[(Thesauri)]-- synonyms ---Codelists