Repository Storage

Info

Current version: Postgres release 12.2; Virtuoso release 07.20.3239

Technology: Postgres, Virtuoso

Access point: Triple Store (SWR SPARQL endpoint) https://repository.soilwise-he.eu/sparql

The SoilWise repository aims at merging and seamlessly providing different types of content. To host this content and to be able to efficiently drive internal processes and to offer performant end user functionality, different storage options are implemented.

A relational database management system for the storage of the core metadata of both data and knowledge assets.
A Triple Store to store the metadata of data and knowledge assets as a graph, linked to soil health and related knowledge as a linked data graph.
Git for storage of user-enhanced metadata.

Functionality

PostgreSQL RDBMS: storage of raw and augmented metadata

A "conventional" RDBMS is used to store the (augmented) metadata of data and knowledge assets. The harvester process uses it to store the raw results of the metadata harvesting of the different resources that are currently connected. Various metadata augmentation jobs use it as input and write their input to this data store. The catalogue also queries the Postgress database.

There are several reasons for choosing an RDBMS as the main source for metadata storage and metadata querying

An RDBMS provides good options to efficiently structure and index its contents, thus allowing performant access for both internal processes and end user interface querying.
An RDBMS easily allows implementing constraints and checks to keep data and relations consistent and valid.
Various extensions, e.g. search engines, are available to make querying, aggregations even more performant and fitted for end users.

Virtuoso Triple Store: storage of SWR knowledge graph

A Triple Store is implemented as part of the SWR infrastructure to allow a more flexible linkage between the knowledge captured as metadata and various sources of internal and external knowledge sources, particularly taxonomies, vocabularies and ontologies that are implemented as RDF graphs. Results of the harvesting and metadata augmentation that are stored in the RDBMS are converted to RDF and stored in the Triple Store.

A Triple Store is selected as a parallel storage because it offers several capabilites

It allows the linking of different knowledge models, e.g. to connect the SWR metadata model with existing and new knowledge structures on soil health and related domains.
It allows reasoning over the relations in the stored graph, and thus allows connecting and smartly combining knowledge from those domains.
Through the SPARQL interface, it allows users and processes to use such reasoning and exploit previously unconnected sets of knowledge.

Git: User enhanced metadata

The current setup of SWR, using the pycsw infrastructure, allows users to propose metadata enhancements. Such enhancements are managed in Git at: https://github.com/soilwise-he/soilinfohub/discussions.

Ongoing Developments

In the next iteration of the SWR development, the currently deployed storage options will be extended to support new features and functions. Such extensions can improve performance and usability. Moreover, we expect that the integration of AI/ML based functions will require additional types of storage and better integration to exploit their combined power. Exploratory work that was performed in the first development cycle, but is not yet integrated into the deployment of iteration 2 include:

Apache Lucene for Lexical Search

A search engine, ingesting data from the RDBMS, will increase the perfomance of end user queries. It will also offer better usability, e.g. by offering aggregation functions for faceted search and ranking of search results. They are also implementing the indexation of unstructured content, and are therefore a good starting point (or alternative?) to offer smart searches on unstructured text, using more conventional and broadly adopted software. It will support SoilWise extending the indexation from (meta)data to knowledge, e.g. unstructured content for documents, websites etc.

As part of the first develoment cycle of SWR, SoilWise has deployed an experimental setup that uses the Solr search platform. Apache Lucene is the search library under Solr facilitating the storage of SWR indexed content.

Apache Lucene for Semantic Search

Besides for lexical search it is also possible to use Apache Lucene for semantic search. The first tries to match on the literals of words or their variants, the later focusses on the intent or meaning of the data. To that end the data (usually text) is translated by a model into a multi-dimensional vector representation (called an embedding), which is then used with a proximity search algorithm. Tyically deep learning models are used to create the embeddings and they are trained so that the embeddings of semantically similar pieces of data are close to another. Semantic search capabilities can be used for many applications, amongst other LLM-driven systems like chatbots or RAG systems to provide them with content (pieces of text data) relevant to a question.

Although dedicated vector stores are available, SoilWise foresees the use of the Solr extension for storage of text embeddings. There are several advantages in using Solr to implement the SWR vector database. First of all it is an open source product. Second, as it is an extension to the Solr search engine platform, it allows adding vector embeddings, without introducing dependencies on additional components. Third, although part of the Solr platform, it allows maintaining a modular setup, where for a final deployment at EC-JRC it keeps the option open to include or exclude the foreseen SWR NLQ components.

Technology & Integration

Components used:

Virtuoso (version 07.20.3239)
PostgreSQL (release 12.22)
Solr (release 9.8.0)