Harvester

Info

Current version: 0.3.0

Technology: Git pipelines

Release: https://doi.org/10.5281/zenodo.14923562

Project repository: Harvesters

The Harvester component is dedicated to automatically harvest sources to populate SWR with metadata on datasets and knowledge sources.

Introduction

Overview and Scope

Metadata harvesting is the process of ingesting metadata, i.e. evidence on data and knowledge, from remote sources and storing it locally in the catalogue for fast searching. It is a scheduled process, so local copy and remote metadata are kept aligned. Various components exist which are able to harvest metadata from various (standardised) API's. SoilWise aims to use existing components where available.

The harvesting mechanism relies on the concept of a universally unique identifier (UUID) or unique resource identifier (URI) that is being assigned commonly by metadata creator or publisher. Another important concept behind the harvesting is the last change date. Every time a metadata record is changed, the last change date is updated. Just storing this parameter and comparing it with a new one allows any system to find out if the metadata record has been modified since last update. An exception is if metadata is removed remotely. SoilWise Catalogue can only derive that fact by harvesting the full remote content. Discussion is needed to understand if SWR should keep a copy of the remote source anyway, for archiving purposes. All metadata with an update date newer then last-identified successfull harvester run are extracted from remote location.

A harvesting task typically extracts records with update-date later then the last-identified successfull harvester run. In case the remote system supports such a filter, else the full set is harvested.

Local improvements to metadata records should be stored separately from the harvested content for the following reasons:

The harvesting is periodic so any local change to harvested metadata will be lost during the next run.
The change date may be used to keep track of changes so if the metadata gets changed, the harvesting mechanism may be compromised.

Potentially, if inconsistencies with imported metadata are identified, we can add a statement to the graph of such inconsistencies. We can also notify the author of the inconsistency so they can fix the inconsistency on their side.

On top of a unique identification, SWR also captures a unique calculated string (a hash) for the harvested content. This allows to identify changes even if the update date has not changed.

Typical tasks of a harvester:

Define a harvester job
- Schedule (on request, weekly, daily, hourly)
- Endpoint / Endpoint type (example.com/csw -> OGC:CSW)
- Apply a filter (only records with keyword='soil-mission')
Understand success of a harvest job
- overview of harvested content (120 records)
- which runs failed, why? (today failed -> log, yesterday successfull -> log)
- Monitor running harvestors (20% done -> cancel)
Define behaviours on harvested content
- skip records with low quality (if test xxx fails)
- mint identifier if missing ( https://example.com/data/{uuid} )
- a model transformation before ingestion ( example-transform.xsl / do-something.py )

Intended Audience

Harvester is a backend component, therefore we only expect a maintenance role:

SWC Administrator monitoring the health status, logs... Administrators can manually start a specific harvesting pipelines.

Key features

The Harvester component currently comprises of the following functions:

Harvest records from metadata and knowledge resources
Metadata RDF Turtle Serialization
RDF to Triple Store
Duplication Identification

Resource Types

Metadata for following resource types are foreseen to be harvested:

Data & Knowledge Resources (datasets, services, software, documents, articles, videos)
Organisations, Projects, LTE, Living labs initiatives
News items from relevant websites

These entities relate to each other as:

flowchart LR
    people -->|memberOf| o[organisations] 
    o -->|partnerIn| p[projects]
    p -->|produce| d[data & knowledge resources]
    o -->|publish| d
    d -->|describedIn| c[catalogues]
    p -->|part-of| fs[Fundingscheme]

Harvesting Strategy

Overarching Philosophy The core of the SoilWise harvesting strategy is to harvest preferably from secondary sources (aggregators of multiple primary repositories such as OpenAIRE, CORDIS, INSPIRE geoportal, and data.europa.eu). This approach allows SoilWise to delegate harmonization and aggregation to secondary parties, avoid duplication of effort, and minimize direct primary harvesting. Primary sources (national, institutional, or thematic repositories) will only be harvested directly if there are clear benefits, such as missing resources in the aggregators, missing relevant metadata, missing lineage/provenance, or if better filtering options are locally available.

Currently SoilWise harvests the following repositories:

Major aggregators

OpenAIRE For those resources, discovered via Cordis/ESDAC, and identified by a DOI, a harvester fetches additional metadata from OpenAire. OpenAire is a catalogue initiative which harvests metadata from popular scientific repositories, such as Zenodo, Dataverse, etc. Not all DOI's registered in Cordis are available in OpenAire. OpenAire only lists resources with an open access license. Other DOI's can be fetched from the DOI registry directly or via Crossref.org. This work is still in preparation. Records in OpenAire are stored in the Open Aire Research Graph (OAF) format, which is transformed to a metadata set based on Dublin Core.
CORDIS European Research projects typically advertise their research outputs via Cordis. This makes Cordis a likely candidate to discover research outputs, such as reports, articles and datasets. Cordis does not capture many metadata properties. In those cases where a resource is identified by a DOI, additional metadata can be found in OpenAire via the DOI. The scope of projects, from which to include project deliverables is still under discussion.

Which projects to include is derived from 2 sources:
- ESDAC maintains a list of historic EU funded research projects
- Mission soil platform maintains a list of current Mission soil projects
A script fetches the content from these 2 sources and prepares relevant content for the CORDIS and OpenAire harvesting. The content in these pages is unstructured html. The content is scraped using a python library. This is not optimal, because the scraper expects a dedicated html structure, which is fragile.

Results of the scrape activity are stored in table harvest.projects. For each project a Record control number (RCN) is retrieved from the Cordis knowledge graph. This RCN could be used to filter OpenAire, however OpenAire can also be filtered using project grant number. At this moment in time the Cordis Knowledge graph does not contain the Mission Soil projects yet.

Currently we do not harvest resources from Cordis which do not have a DOI. This includes mainly progress reports of the projects.
data.europe.eu harvesting all datasets with keyword = soil
INSPIRE Although INSPIRE Geoportal does offer a CSW endpoint, due to a technical reasons, we have not been able to harvest from it. Instead we have developed a dedicated harvester via the Elastic Search API endpoint of the Geoportal. If at some point the technical issue has been resolved, use of the CSW harvest endpoint is favourable. Harvesting covers all records fulfilling criteria inspire theme = soil/lpis.

Directly harvested Mission Soil Projects

The following repositories are harvested without filters.

EJPSoil
Impact4Soil
Islandr
Prepsoil is build on a headless CMS. The CMS at times provides an API to retrieve datasets, knowledge items, living labs, lighthouses and communities of practice. The API provides minimal metadata, incidentally a DOI is included. DOI is used to capture additional metadata from OpenAire.

Edge if scope (selective coverage)

The following repositories are harvested with minimum, or soil-keyword-based filtering.

ISRIC World Soil Information
FAO
EEA Geoportal

Many (spatial) catalogues advertise their metadata via the catalogue Service for the Web standard, such as INSPIRE GeoPortal, Bonares, ISRIC. The OWSLib library is used to query records from CSW endpoints. A filter can be configured to retrieve subsets of the catalogue.

Incidentally, records advertised as CSW also include a DOI reference (Bonares/ISRIC). Additional metadata for these DOI's is extracted from OpenAire/Crossref.

News feeds

From the project websites mentioned at https://mission-soil-platform.ec.europa.eu/project-hub/funded-projects-under-mission-soil a harvester algorithm fetches the contents of the RSS feed, if the website provides one. The harvested entries are stored on a database.

Harvesting governance

To implement the harvesting strategy, SoilWise is evaluating and potentially combining multiple governance scenarios.

Broad Crawling and Ranking: Ingesting a large volume of resources from many sources (e.g., downloading a monthly snapshot of OpenAIRE) to maximize coverage, though this carries the risk of including lower-quality materials if robust filters are not applied.
Remote Search: Delegating searches entirely to known aggregators (OpenAIRE, Crossref, data.europa.eu) without maintaining a searchable index locally within SoilWise.
Citations and Meta Studies: Utilizing existing knowledge portals and standardized citations (e.g., via Crossref) to identify domain-relevant content.
Authoritative Data Pipelines: Quality-controlled ingestion where only resources meeting predefined criteria are ingested (e.g., funded by Horizon Europe, published as part of the INSPIRE regulation, peer-reviewed, or High Value Datasets)
Curated Content: Utilizing human experts or community moderation to select, annotate, and categorize resources to ensure high relevance and trustworthiness, despite the manual labor required.

Based on the February 2026 screening of 30 Mission Soil projects (analyzing 259 outputs), the strategic targeting of endpoints has been refined.

Zenodo: Zenodo communities are a key harvesting target for datasets and knowledge sources. Zenodo is the most cited data endpoint among surveyed projects, accounting for roughly 36% of direct mentions and housing a vast majority of the open/CC-BY licensed outputs.
CORDIS: While previously considered a primary hub, CORDIS is now treated as a complementary source. It is primarily utilized for hosting deliverables and reports but has limited dataset-level metadata.
Domain-Specific Repositories: For specific data types like biodiversity, specialized repositories (e.g., GBIF, iNaturalist, NCBI/EBI, DataDryad) are emerging as vital targets.
Project Websites: Project websites account for roughly 20% of endpoints. While important for immediate visibility, they are not guaranteed for long-term sustainability and are considered secondary. Active outreach is required for projects that rely solely on websites or internal systems rather than established repositories like Zenodo.
Spatial and Thematic Catalogues: SoilWise continues to target standard spatial catalogues like the INSPIRE GeoPortal, BonaRes, and ISRIC, primarily via OGC-CSW endpoints or dedicated Elastic Search API harvesters where technical limitations arise.

Adoption of standards

With respect to harvesting, it is important to note that a wide range of levels of adoption of standards is implemented by repositories. Both for metadata models, identification, as well as access protocols. This will, in some cases, make it necessary to develop customized harvesting and metadata extraction processes. It also means that informed decisions need to be made on which resources to include, based on priority, required efforts and available capacity.

Architecture

Technological Stack

Technology	Description
Git actions/pipelines	Automated processes which run at intervals or events. Git platforms typically offer this functionality including extended logging, queueing, and manual job monitoring and interaction (start/stop).

Main Sequence Diagram

Each harvester runs in a dedicated container. The result of the harvester is ingested into a (temporary) storage. Follow up processes (harmonisation, augmentation, validation) pick up the results from the temporary storage.

flowchart LR
    c[CI-CD] -->|task| q[/Queue\]
    r[Runner] --> q
    r -->|deploys| hc[Harvest container]
    hc -->|harvests| db[(temporary storage)]
    hc -->|data cleaning| db[(temporary storage)]

Harvester tasks are triggered from Git CI-CD, Git provides options to cancel and trigger tasks and review CI-CD logs to check errors

Integrations & Interfaces

The Automatic metadata harvesting component will show its full potential when being in the SWR tightly connected to (1) SWR Catalogue, (2) Metadata authoring and (3) ETS/ATS, i.e. test suites.

Key Architectural Decisions - Harvesting Strategy

OGC-CSW

Many (spatial) catalogues advertise their metadata via the catalogue Service for the Web standard, such as INSPIRE GeoPortal, Bonares, ISRIC.

CORDIS - OpenAire

Cordis does not capture many metadata properties. We harvest the title of a project publication and, if available, the DOI. In those cases where a resource is identified by a DOI, additional metadata can be found in OpenAire via the DOI. For those resources a harvester fetches additional metadata from OpenAire.

A second mechanism is available to link from Cordis to OpenAire, the RCN number. The OpenAire catalogue can be queried using an RCN filter to retrieve only resources relevant to a project. This work is still in preparation.

Not all DOI's registered in Cordis are available in OpenAire. OpenAire only lists resources with an open access license. Other DOI's can be fetched from the DOI registry directly or via Crossref.org. This work is still in preparation. Detailed technical information can be found in the technical description.

OpenAire and other sources

The software used to query OpenAire by DOI or by RCN is not limited to be used by DOIs or RCNs that come from Cordis. Any list of DOIs or list of RCNs can be handled by the software.