Skip to content

Metadata Catalogue

Info

Current version:

Technology: Apache Solr, React, pycsw

Project: Catalogue UI; Solr; pycsw

Access point: https://client.soilwise-he.containers.wur.nl/catalogue/search

Introduction

Overview and Scope

The Metadata Catalogue is a central piece of the architecture, giving access to individual metadata records. In the catalogue domain, various effective metadata catalogues are developed around the standards issued by the OGC, the Catalogue Service for the Web (CSW) and the OGC API Records, Open Archives Initiative (OAI-PMH), W3C (DCAT), FAIR science (Datacite) and Search Engine community (schema.org). For our first project iteration we've selected the pycsw software, which supports most of these standards. In the second iteration pycsw continues to provide standardized APIs, however to improve search performance and user experience, it was supplemented by Apache Solr and React frontend.

Intended Audience

The SoilWise Metadata Catalogue targets the following user groups:

  • Soil scientists and researchers working with European soil health data and seeking catalogued knowledge, publications, and datasets.
  • Living Labs' data scientists working with European soil health data and seeking catalogued knowledge, publications, and datasets.
  • Mission Soil Project Data Managers searching for datasets published by other Mission Soil Projects, or veryfiing if their published data are recognized by SoilWise (EUSO).
  • Policy Makers working with European soil health data and seeking catalogued knowledge, publications, and datasets.

Key Features

User interface

The SoilWise Metadata Catalogue adopts a React frontend, focusing on:

  1. Paginated search results - Search results are displayed per page in ranked order, in the form of overview table comprising preview of resource type, title, abstract, date and preview.
  2. Fulltext search - + autocomplete
  3. Resource type filter - enabling to filter out certain types of resources, e.g. journal articles, datasets, reports, software.
  4. Term filter - enabling to filter out resources containing certain keywords, resources published by specific projects, etc.
  5. Date filter - enabling to filter out resources based on their creation, or revision date
  6. Spatial filter - enabling to filter out resources based on their spatial extent using countries or regions, drawn bounding box, vicinity of user's location, or by searching for geographical names.
  7. Record detail view - After clicking result's table item, a record's detail is displayed at unique URL address to facilitate sharing. Record's detail currently comprises: record's type tag,full title, full abstract, keywords' tags, preview of record's geographical extent, record's preview image, if available, information about relevant HE funding project, list of source repositories,- indication of link availability, see Link liveliness assessment, last update date, all other record's items...
  8. Resource preview - 3 types of preview are currently supported: (1) Display resource geographical extent, which is available in the record's detail, as well in the search results list. (2) Display of a graphic preview (thumbnail) in case it is advertised in metadata. (3) Map preview of OGC:WMS services advertised in metadata enables standard simple user interaction (zoom, changing layers).
  9. Display results of metadata augmentation - Results of metadata augmentation are stored on a dedicated database table. The results are merged into the harvested content during publication to the catalogue. At the moment it is not possible to differentiate between original and augmented content. For next iterations we aim to make this distinction more clear.
  10. Display links of related information - Download of data "as is" is currently supported through the links section from the harvested repository. Note, "interoperable data download" has been only a proof-of-concept in the first iteration phase, i.e. is not integrated into the SoilWise Catalogue. Download of knowledge source "as is" is currently supported through the links section from the harvested repository.

Index and search strategies

The SoilWise Metadata Catalogue implements back-end index and search functions based on Apache Solr, focusing on:

  1. Denormalising metadata - Solr is set up as a document indexing infrastructure, working on rather "flat" textual formats instead of normalised database models. The first step is therefore a conversion to a denormalised structure, currently implemented as a (single) database view.
  2. Composing Solr documents - From the denormalised view, the individual metadata records are processed into Solr.documents.
  3. Transforming/Indexing - Solr uses transformers to process and index Solr.documents. This is a combination of sequential sub processes (e.g. tokenizers) and configurations that determine how the documents are indexed and how they can be searched, ranked, feceted etc.
  4. Search API - The Solr search API Allows query access to the Solr index, so the UI (and other clients) can search the metadata through the index.

Supported standards

In order to interact with the many relevant data communities, SoilWise aims to support a range of catalogue standards through pycsw backend, for more info see Integrations & Interfaces.

Architecture

Technological Stack

Backend

Technology Description
pycsw v3.0 Pycsw, written in python, allows for the publishing and discovery of geospatial metadata via numerous APIs (CSW 2/CSW 3, OAI-PMH, providing a standards-based metadata and catalogue component of spatial data infrastructures. pycsw is Open Source, released under an MIT license, and runs on all major platforms (Windows, Linux, Mac OS X).
Apache Lucene v11.x Apache Lucene is a open source high-performance Java-based search engine library.
Apache Solr v9.7.0 Open source full text, vector and geo-spatial search framework on top of the Apache Lucene Index.
Java vx.x
OpenStreetMaps API

Frontend

Technology Description
React Javascript framework that implements the search interface and access to Solr API
pycsw v3.0 (depricated) Pycsw also offers its own User interface, which was used as a default in previous SoilWise prototype.

Infrastructure

Component Technology
Container Docker (multi-stage build, Eclipse Temurin JDK 21)
CI/CD GitLab CI with semantic release (conventional commits)
Orchestration Kubernetes (liveness/readiness probes)

Main Component Diagram

Main Sequence Diagram

Integrations & Interfaces

Service Auth Endpoint Purpose
Catalogue Service for the Web (CSW) https://repository.soilwise-he.eu/cat/csw Catalogue service for the web (CSW) is a standardised pattern to interact with (spatial) catalogues, maintained by OGC.
OGC API - Records https://repository.soilwise-he.eu/cat/openapi OGC is currently in the process of adopting a revised edition of its catalogue standards. The new standard is called OGC API - Records. OGC API - Records is closely related to Spatio Temporal Asset Catalogue (STAC), a community standard in the Earth Observation community.
Protocol for metadata harvesting (OAI-PMH) https://repository.soilwise-he.eu/cat/oaipmh The open archives initiative has defined a common protocol for metadata harvesting (oai-pmh), which is adopted by many catalogue solutions, such as Zenodo, OpenAire, CKAN. The oai-pmh endpoint of Soilwise can be harvested by these repositories.
Spatio Temporal Asset Catalog (STAC) https://repository.soilwise-he.eu/cat/stac/openapi
OpenSearch https://repository.soilwise-he.eu/cat/opensearch

Key Architectural Decisions

Risks & Limitations

Risk / Limitation Description Mitigation
Transferability The differences in technology stack between the implementing consortium and the final owner (JRC) might lead to transferability and integration issues Use of broadly adopted open source products. Alignment with JRC technical team
Metadata quality The performance of the search functionality is highly dependent on the completeness and quality of the harvested metadata which is out of scope for SoilWise. Metadata augmnentation will allow to partly mitigate
Transparency and explainability The dependency on metadata completeness and quality in combination with the large amount of interdependent options for (fuzzy) search strategies and the different combinations of UI search features will make it hard to understand the logic behind search results. Documentation of metadata augmentation, search strategies etc.
Usability The diversity of user groups and their requirements and expectations make it difficult to find balance between functionality/complexity/user-friendliness. Iterative appraoch and validation/testing with user groups to align.