Metadata Augmentation
Introduction
Overview and Scope
This set of components augments metadata statements using various techniques. Augmentations are stored on a dedicated augmentation table, indicating the process which produced it. The statements are combined with the ingested content to offer users an optimal catalogue experience.
At the moment, Metadata Augmentation functionality is covered by the following components:
- Keyword Matcher
- Element Matcher
- Translation Module
- Link Liveliness Assessment
- Spatial Locator
- Metadata Interlinker
Upcoming components
Intended Audience
Metadata Augmentation is a backend component providing outputs, which users can see displayed in the Metadata Catalogue. Therefore the Intended Audience corresponds to the one of the Metadata Catalogue. Additionally we expect a maintenance role:
- SWC Administrator monitoring the augmentation processes, access to history, logs and statistics. Administrators can manually start a specific augmentation process.
Keyword matcher
Info
Current version: 0.2.0
Technology: Python
Release: https://doi.org/10.5281/zenodo.14924181
Projects: Keyword matcher
Overview and Scope
Keywords are an important mechanism to filter and cluster records. Similar keywords need to be clustered to be able to match them. This module evaluates keywords of existing records to make them equal in case of high similarity.
Analyses existing keywords on a metadata record. Two cases can be identified:
- If a keyword, having a skos identifier, has a closeMatch or sameAs relation to a prefered keyword, the prefered keyword is used.
- If an existing keyword, without skos identifier, matches a prefered keyword by (translated) string or synonym, then append the matched keyword (including skos identifier).
To facilitate this use case the SWR contains a Knowledge graph of prefered keywords in the soil domain derived from Agrovoc, Gemet and ISO11074. This knowledge graph is maintained at https://github.com/soilwise-he/soil-health-knowledge-graph. These vocabularies are multilingual, facilitating the translation case.
For metadata records which have not been analysed yet (in that iteration), the module extracts the keywords, for each keyword an analyses is made if it matches any of the prefered keywords, the prefered keyword is added to the augmentation results for that record. For string matching a fuzzy match algorithm is used, requiring a 90% match (configurable). Translations are matched using the metadata language as indicated in the record.
Key Features
Architecture
Technological Stack
| Technology | Description |
|---|---|
| Python | Used for the keyword matching and database interactions. |
| PostgreSQL | Primary database for storing and managing information. |
| Docker | Used for containerizing the application, ensuring consistent deployment across environments. |
| * CI/CD | Automated pipeline for continuous integration and deployment, with scheduled dayly runs. |
Main Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
- The process runs as a CI-CD pipeline at daily intervals.
Risks & Limitations
Element Matcher
Overview and Scope
Key Features
Architecture
Technological Stack
Main Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Translation module
Overview and Scope
Some records arrive in a local language, SWR translates the main properties for the record: title and abstract into English, to offer a single language user experience. The translations are used in filtering and display of records.
The translation module builds on the EU translation service (API documentation at https://language-tools.ec.europa.eu/). Translations are stored in a database for reuse by the SWR.
The EU translation returns asynchronous responses to translation requests, this means that translations may not yet be available after initial load of new data. A callback operation populates the database, from that moment a translation is available to SWR. The translation service uses 2-letter language codes, it means a translation from a 3-letter iso code (as used in for example iso19139:2007) to 2-letter code is required. The EU translation service has a limited set of translations from a certain to alternative language available, else returns an error.
Initial translation is triggered by a running harvester. The translations will then be available once the record is ingested to the triplestore and catalogue database in a followup step of the harvester.
Key Features
Architecture
Technological Stack
| Technology | Description |
|---|---|
| Python | Used for the translation module, API development, and database interactions. |
| PostgreSQL | Primary database for storing and managing information. |
| FastAPI | Employed to create and expose REST API endpoints. Utilizes FastAPI's efficiency and auto-generated Swagger documentation. |
| Docker | Used for containerizing the application, ensuring consistent deployment across environments. |
| CI/CD | Automated pipeline for continuous integration and deployment, with scheduled dayly runs. |
Main Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Link Liveliness Assessment
Info
Current version: 1.1.4
Technology: Python, FastAPI
Release: https://doi.org/10.5281/zenodo.14923790
Projects: Link liveliness assessment
Overview and Scope
Metadata (and data and knowledge sources) tend to contain links to other resources. Not all of these URIs are persistent, so over time they can degrade. In practice, many non-persistent knowledge sources and assets exist that could be relevant for SWR, e.g. on project websites, in online databases, on the computers of researchers, etc. Links pointing to such assets might however be part of harvested metadata records or data and content that is stored in the SWC.
The Link Liveliness Assessment (LLA) component runs over the available links stored with the SWC assets and checks their status. The function is foreseen to run frequently over the URIs in the SWC, assessing and storing the status of the link.
A link in metadata record either points to:
- another metadata record
- a downloadable instance (pdf/zip/sqlite/mp4/pptx) of the resource
- the resource itself
- documentation about the resource
- identifier of the resource (DOI)
- a webservice or API (sparql, openapi, graphql, ogc-api)
Linkchecker evaluates for a set of metadata records, if:
- the links to external sources are valid
- the links within the repository are valid
- link metadata represents accurately the resource (mime type, size, data model, access constraints)
While evaluating the context of a link, the LLA component may derive some contextual metadata, which can augment the metadata record. These results are stored in the metadata augmentation table. Metadata aspects derived are file size, file format.
Key Features
The LLA component privides the following functions:
- Link validation: Returns HTTP status codes for each link, along with other important information such as the parent URL, any warnings, and the date and time of the test. Additionally, the tool enhances link analysis by identifying various metadata attributes, including file format type (e.g., image/jpeg, application/pdf, text/html), file size (in bytes), and last modification date. This provides users with valuable insights about the resource before accessing it.
- Broken link categorization: Identifies and categorizes broken links based on status codes, including Redirection Errors, Client Errors, and Server Errors.
- Deprecated links identification: Flags links as deprecated if they have failed for X consecutive tests, in our case X equals to 10. Deprecated links are excluded from future tests to optimize performance.
- Timeout management: Allows the identification of URLs that exceed a timeout threshold which can be set manually as a parameter in linkchecker's properties.
- Availability monitoring: When run periodically, the tool builds a history of availability for each URL, enabling users to view the status of links over time.
- OWS services (WMS, WFS, WCS, CSW) typically return a HTTP 500 error when called without the necessary parameters. A handling for these services has been applied in order to detect and include the necessary parameters before being checked.
A javascript widget is further used to display the link status directly in the SoilWise Metadata Catalogue record.
The API can be used to identify which records have broken links.
Architecture
Technological Stack
| Technology | Description |
|---|---|
| Python | Used for the linkchecker integration, API development, and database interactions. |
| PostgreSQL | Primary database for storing and managing link information. |
| FastAPI | Employed to create and expose REST API endpoints. Utilizes FastAPI's efficiency and auto-generated Swagger documentation. |
| Docker | Used for containerizing the application, ensuring consistent deployment across environments. |
| CI/CD | Automated pipeline for continuous integration and deployment, with scheduled weekly runs for link liveliness assessment. |
Main Component Diagram
flowchart LR
H["Harvester"]-- "writes" -->MR[("Record Table")]
MR-- "reads" -->LAA["Link Liveliness Assessment"]
MR-- "reads" -->CA["Catalogue"]
LAA-- "writes" -->LLAL[("Links Table")]
LAA-- "writes" -->LLAVH[("Validation History Table")]
CA-- "reads" -->API["**API**"]
LLAL-- "writes" -->API
LLAVH-- "writes" -->API
Main Sequence Diagram
sequenceDiagram
participant Linkchecker
participant DB
participant Catalogue
Linkchecker->>DB: Establish Database Connection
Linkchecker->>Catalogue: Extract Relevant URLs
loop URL Processing
Linkchecker->DB: Check URL Existence
Linkchecker->DB: Check Deprecation Status
alt URL Not Deprecated
Linkchecker-->DB: Insert/Update Records
Linkchecker-->DB: Insert/Update Links with file format type, size, last_modified
Linkchecker-->DB: Update Validation History
else URL Deprecated
Linkchecker-->DB: Skip Processing
end
end
Linkchecker->>DB: Close Database Connection
Database Design
classDiagram
Links <|-- Validation_history
Links <|-- Records
Links : +Int ID
Links : +Int fk_records
Links : +String Urlname
Links : +String deprecated
Links : +String link_type
Links : +Int link_size
Links : +DateTime last_modified
Links : +String Consecutive_failures
class Records{
+Int ID
+String Records
}
class Validation_history{
+Int ID
+Int fk_link
+String Statuscode
+String isRedirect
+String Errormessage
+Date Timestamp
}
Integrations & Interfaces
- Visualisation of evaluation in Metadata Catalogue, the assessment report is retrieved using ajax from the each record page
- FastAPI now incorporates additional metadata for links, including file format type, size, and last modified date.
Key Architectural Decisions
Initially we started with linkchecker library, but performance was really slow, because it tested the same links for each page again and again.
We decided to only test the links section of ogc-api:records, it means that links within for example metadata abstract are no longer tested.
OGC OWS services are a substantial portion of links, these services return error 500, if called without parameters. For this scenario we created a dedicated script.
If tests for a resource fail a number of times, the resource is no longer tested, and the resource tagged as deprecated.
Links via a facade, such as DOI, are followed to the page they are referring to. It means the LLA tool can understand the relation between DOI and the page it refers to.
For each link it is known on which record(s) it is mentioned, so if a broken link occurs, we can find a contact to notify in the record.
For the second release we have enhanced the link liveliness assement tool to collect more information about the resources:
- File type format (media type) to help users understand what format they'll be accessing (e.g., image/jpeg, application/pdf, text/html)
- File size to inform users about download expectations
- Last modification date to indicate how recent the resource is
API Updates: The API has been extended to include the newly tracked metadata fields:
- link_type: Shows the file format type of the resource (e.g., image/jpeg, application/pdf)
- link_size: Indicates the size of the resource in bytes
- last_modified: Provides the timestamp when the resource was modified
Risks & Limitations
Spatial Locator
Overview and Scope
Key Features
Architecture
Technological Stack
Main Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Metadata Interlinker
Overview and Scope
To be able to provide interlinked data and knowledge assets (e.g. a dataset, the project in which it was generated and the operating procedure used) links between metadata must be identified and registered ideally as part of the SWR Triple Store.
We distinguish between explicit and implicit links:
- Explicit links can be directly derived from the data and/or metadata. E.g. projects in CORDIS are explicitly linked to documents and datasets.
- Implicit links can not be directly derived from the (meta)data. They may be derived by spatial or temporal extent, keyword usage, or shared author/publisher.
SWC implements the interlinking of data and knowledge assets based on explicit links that are found in the harvested metadata. The harvesting processes implemented in SWC have been extended with this function to detect such linkages and store them in the repository and add them to the SWC knowledge graph. This allows e.g. exposing this additional information to the UI for displaying and linkage the and other functions.
Key Features
Architecture
Technological Stack
Main Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Foreseen functionality
In the next iterations, Metadata augmentation component is foreseen to include the following additional functions:
Keyword extraction
The value of relevant keywords is often underestimated by data producers. This proof-of-concept module evaluates the metadata title/abstract to identify relevant keywords using NLP/NER technology. Integration with the catalogue is foreseen.
Spatial scope analyser
A module that is foreseen to analyse the spatial scope of a resource.
The bounding box will be matched to country or continental bounding boxes using a gazeteer.
To understand if the dataset has a global, continental, national or regional scope
- Retrieves all datasets (as iso19139 xml) from database (records table joined with augmentations) which:
- have a bounding box
- no spatial scope
- in iso19139 format
- For each record it compares the boundingbox to country bounding boxes:
- if bigger then continents > global
- If matches a continent > continental
- if matches a country > national
- if smaller > regional
- result is written to as an augmentation in a dedicated table
EUSO-high-value dataset tagging
The EUSO high-value datasets are those with substantial potential to assess soil health status, as detailed on the EUSO dashboard. This framework includes the concept of soil degradation indicator metadata-based identification and tagging. Each dataset (possibly only those with the supra-national spatial scope - under discussion) will be annotated with a potential soil degradation indicator for which it might be utilised. Users can then filter these datasets according to their specific needs.
The EUSO soil degradation indicators employ specific methodologies and thresholds to determine soil health status, see also the Table below. These methodologies will also be considered, as they may have an impact on the defined thresholds. This issue will be examined in greater detail in the future.
| Soil Degradation | Soil Indicator | Type of methodic for threshold |
|---|---|---|
| Soil erosion | Water erosion | RUSLE2015 |
| Wind erosion | GIS-RWEQ | |
| Tillage erosion | SEDEM | |
| Harvest erosion | Textural index | |
| Post-fire recovery | USLE (Type of RUSLE) | |
| Soil pollution | Arsenic excess | GAMLSS-RF |
| Copper excess | GLM and GPR | |
| Mercury excess | LUCAS topsoil database | |
| Zinc Excess | LUCAS topsoil database | |
| Cadmium Excess | GEMAS | |
| Soil nutrients | Nitrogen surplus | NNB |
| Phosphorus deficiency | LUCAS topsoil database | |
| Phosphorus excess | LUCAS topsoil database | |
| Loss of soil organic carbon | Distance to maximum SOC level | qGAM |
| Loss of soil biodiversity | Potential threat to biological functions | Expert Polling, Questionnaire, Data Collection, Normalization and Analysis |
| Soil compaction | Packing density | Calculation of Packing Density (PD) |
| Salinization | Secondary salinization | - |
| Loss of organic soils | Peatland degradation | - |
| Soil consumption | Soil sealing | Raster remote sense data |
Technically, we forsee the metadata tagging process as illustrated below. At first, metadata record's title, abstract and keywords will be checked for the occurence of specific values from the Soil Indicator and Soil Degradation Codelists, such as Water erosion or Soil erosion (see the Table above). If found, the Soil Degradation Indicator Tag (corresponding value from the Soil Degradation Codelist) will be displayed to indicate suitability of given dataset for soil indicator related analyses. Additionally, a search for corresponding methodology will be conducted to see if the dataset is compliant with the EUSO Soil Health indicators presented in the EUSO Dashboard. If found, the tag EUSO High-value dataset will be added. In later phase we assume search for references to Scientific Methodology papers in metadata record's links. Next, the possibility of involving a more complex search using soil thesauri will also be explored.
flowchart TD
subgraph ic[Indicators Search]
ti([Title Check]) ~~~ ai([Abstract Check])
ai ~~~ ki([Keywords Check])
end
subgraph Codelists
sd ~~~ si
end
subgraph M[Methodologies Search]
tiM([Title Check]) ~~~ aiM([Abstract Check])
kl([Links check]) ~~~ kM([Keywords Check])
end
m[(Metadata Record)] --> ic
m --> M
ic-- + ---M
sd[(Soil Degradation Codelist)] --> ic
si[(Soil Indicator Codelist)] --> ic
em[(EUSO Soil Methodologies list)] --> M
M --> et{{EUSO High-Value Dataset Tag}}
et --> m
ic --> es{{Soil Degradation Indicator Tag}}
es --> m
th[(Thesauri)]-- synonyms ---Codelists
