Data publication support
Introduction
Overview and Scope
This suite of tools is designed to support soil data and knowledge publishers, targeting selected challenges, and ensuring that the published data adheres to FAIR principles (Findable, Accessible, Interoperable, Reusable).
The design of these components reflects an understanding of the practical limits and needs of FAIRification of soil data. We recognise the specific challenges in soil data findability and interoperability, including open formats, standardisation, and annotation of soil properties in metadata.
At the moment, SoilWise supports data publishers with the following tools:
- DOI Resolution Widget
- Tabular Soil Data Annotation to help users create semantic metadata for tabular datasets.
- INSPIRE Geopackage Transformation
- Soil Vocabulary Browser, part of the Knowledge Graph component, visualizes and links different soil-domain vocabularies and terms.
Intended Audience
The Data Publication Support tools and documentation are designed for the following user groups: - Soil Data Providers & Stewards publishing soil data in line with FAIR principles, annotating datasets with metadata in the repositories, and supporting the findability of their resources through SoilWise.
DOI Resolution Widget
Info
Current version:
Technology:
Project:
Access Point:
Overview and Scope
Key Features
Architecture
Technological Stack
Main Sequence Diagram
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Tabular Soil Data Annotation
Info
Current version:
Technology: Streamlit, Python, OpenAI API
Project: Tabular Data Annotator
Access Point: https://dataannotator-swr.streamlit.app/
Overview and Scope
DataAnnotator is a Streamlit-based web application designed to help users create semantic metadata for tabular datasets. It combines optional Large Language Model (LLM) assistance with semantic embeddings to annotate data columns with machine-readable descriptions, element definitions, units, methods, and vocabulary mappings.
The tool addresses the metadata annotation workflow by:
- Enabling manual annotation: Users directly enter descriptions for data columns
- Automating description generation (optional): If users have context documentation, LLMs can help extract and structure descriptions automatically
- Linking to vocabularies: Semantic embeddings match descriptions to controlled vocabularies for standardization
The LLM layer is optional—users can skip automated generation and manually provide descriptions, which the system will then semantically match to existing vocabulary terms.
Key Features
| Feature | Implementation | Purpose |
|---|---|---|
| Auto Type Detection | Statistical sampling | Identify data patterns |
| Manual Description Entry | Streamlit text inputs | Direct user annotation |
| Optional LLM Assistance | OpenAI/Apertus integration | Auto-extract descriptions from docs |
| Semantic Vocabulary Matching | FAISS vector search | Link descriptions to standard vocabularies |
| Context Awareness | PDF/DOCX import + prompting | Extract domain-specific info when available |
| Multi-format Export | flat csv/TableSchema/CSVW | Integration with downstream tools |
Architecture
Technological Stack
| Component | Technology |
|---|---|
| Frontend | Streamlit 1.51+ |
| Backend Logic | Python 3.12+ |
| LLM Integration | OpenAI API, Apertus HTTP |
| Embeddings | Sentence Transformers 5.1+ |
| Vector Search | FAISS (CPU) 1.12+ |
| File Parsing | PyPDF2, python-docx, openpyxl |
| ML Libraries | scikit-learn, NumPy |
Dependencies & Models
-
Python Packages
streamlit>=1.51.0- Web UI frameworkopenai>=2.7.2- LLM API clientsentence-transformers>=5.1.2- Semantic embeddingfaiss-cpu>=1.12.0- Vector similarity searchpandas>=2.0- Data manipulationopenpyxl>=3.1.5- Excel handlingpython-docx>=1.2.0- Word document parsingpypdf2>=3.0.1- PDF text extraction
-
Pre-trained Models/Data
- Embedding Model:
all-MiniLM-L6-v2(384 dimensions, 22M parameters) - FAISS Index: Pre-computed vocabulary embeddings (stored in
data/directory)
- Embedding Model:
Main Components
1. Data Input Module
- Supported Formats - Tabular Data: CSV, Excel (XLSX)
- Supported Formats - Context Documents: Free-form text input, PDF documents, DOCX files
- Processing Functions:
read_csv_with_sniffer(): Auto-detects CSV delimitersimport_metadata_from_file(): Reads existing metadata if providedread_context_file(): Extracts context from PDFs/DOCX for LLM-assisted annotation
2. Data Analysis & Type Detection
- Function:
detect_column_type_from_series() - Detects:
- String: Text data
- Numeric: Integers and floats
- Date: Temporal values
- Approach: Statistical sampling of column values (up to 200 non-null entries)
3. Metadata Framework
- Function:
build_metadata_df_from_df() - Template Fields:
name: Column identifierdatatype: Type classification (string/numeric/date)element: Semantic element definitionunit: Measurement unitmethod: Collection/calculation methoddescription: Human-readable descriptionelement_uri: Link to external vocabulary
4. LLM Integration Layer (Optional)
Purpose: Automate the extraction and structuring of descriptions from existing documentation when users have context materials.
When to Use: - User has documentation (PDFs, Word docs, etc.) describing variables - Manual annotation is time-consuming for large datasets - Descriptions need to be extracted from unstructured text
Supported Providers:
-
OpenAI (Recommended)
- Uses GPT models for high-quality response generation
get_response_OpenAI(): Direct API calls- Best for complex, domain-specific text extraction
-
Apertus (Alternative)
- Self-hosted LLM option
get_response_Apertus(): HTTP-based endpoint- Swiss-based open-source model
Functionality:
- Function:
generate_descriptions_with_LLM() - Inputs:
- Variable names to describe
- Context information from documents or text input
- Output Format: Structured JSON with descriptions for each variable
5. Semantic Embedding & Vocabulary Matching
- Model: Sentence Transformers (default:
all-MiniLM-L6-v2) - Functions:
load_sentence_model(): Load embedding modelload_vocab_indexes(): Load pre-computed FAISS indexesembed_vocab(): Generate embeddings with optional definition weighting
- Purpose: Match generated or manually-entered descriptions to controlled vocabularies
Pre-computed Vocabulary Sources
The FAISS vectorstore was pre-computed by embedding terms from four major public vocabularies:
| Vocabulary | Domain | Source |
|---|---|---|
| Agrovoc | Agricultural and food sciences | FAO - Food and Agriculture Organization |
| GEMET | Environmental terminology | European Environment Agency (EEA) |
| GLOSIS | Soil science and properties | FAO Global Soil Information System |
| ISO 11074:2005 | Soil quality terminology | International Organization for Standardization |
This multi-vocabulary approach enables annotation of diverse datasets including agricultural, environmental, and soil-related data.
FAISS Index Structure:
Index File: vocabCombined-{modelname}.index
Metadata File: vocabCombined-{modelname}-meta.npz
Metadata Dictionary Format:
{
index_id: {
"uri": "vocabulary_uri",
"label": "preferred_label",
"definition": "term_definition",
"QC_label": "prefLabel|altLabel"
},
...
}
6. Export & Download Module
- Function:
download_bytes() - Supported Formats:
- Excel (XLSX) - for human review
- JSON - for machine processing
- CSV - for spreadsheet tools
- Implementation: Streamlit session-based download management
Main Sequence Diagram
graph TB
A["User Interface
Streamlit App"] -->|Upload Data| B["Data Input Handler
CSV/Excel/PDF"]
B -->|Parse Data| C["Data Analysis
Column Type Detection"]
C -->|Analyze Structure| D["Metadata Framework
Build Template"]
D --> E{"Description Source?"}
E -->|Manual Entry| F["User Provides
Descriptions"]
E -->|Optional: Auto-generate| I[/"LLM Provider
(Optional Tool)"\]
I -->|OpenAI API| J["OpenAI Client
GPT Models"]
I -->|Apertus API| K["Apertus Client
Local LLM"]
J -->|JSON Descriptions| L["Response Parser
JSON Extraction"]
K -->|JSON Descriptions| L
L --> M["LLM-generated
Descriptions"]
F --> N["Semantic Embedding
Sentence Transformers"]
M --> N
N -->|Vector Query| O["proposal for generalized ObservedProperty"]
V1["Agrovoc
Agricultural"] -.->|Pre-embedded| G["FAISS vector store"]
V2["GEMET
Environmental"] -.->|Pre-embedded| G
V3["GLOSIS
Soil Science"] -.->|Pre-embedded| G
V4["ISO 11074:2005
Soil Quality"] -.->|Pre-embedded| G
G --> O
O -->Q["Export Handler
flat csv/TableSchema/CSVW"]
Q -->|Downloaded metadata| A
Key Architectural Decisions
Optimization Strategies:
- Model Caching: Streamlit
@st.cache_resourcefor persistent model loading - API Caching: JSON-based result memoization to avoid redundant API calls
- FAISS Optimization: Pre-computed indexes for O(log n) vector search
- Batch Processing: Process multiple columns in single LLM call
SoilWise GeoPackage
Overview and Scope
The Soilwise GeoPackage is the relational (SQLite‑based) container to enable exchange, storage, and GIS‑native use of soil data, with the explicit goal of making them FAIR and reusable across European policies, research, and land‑management workflows.
GeoPackage is an OGC open, portable, self‑contained standard for geospatial data. Being an SQLite container, it allows direct use of vector features, rasters/tiles and attribute data in a single file, without intermediate format translations. This makes it ideal for GIS environments and for constrained connectivity scenarios.
Continuity with INSPIRE This GeoPackage implements a relational schema that is a faithful transposition of the INSPIRE Soil conceptual model (UML) and its classes/associations, as described in the INSPIRE Soil Technical Guidelines and Feature Catalogue. It also integrates the OGC SensorThings API 2.0 (STA2) model for the management and exposure of observations (time‑series and observation metadata).
The Soilwise database builds upon—and updates—the work carried out around INSPIRE, including the EJP SOIL GeoPackage template for the Soil (SO) theme. That template focused on semantic harmonisation, code‑list management, and repeatable transformations, and is a relevant baseline for Soilwise’s relational modelling approach. This direction aligns with community guidance on publishing INSPIRE data as a relational database (GeoPackage as a specialisation of SQLite), including recipes and patterns for harmonisation and publication.
Key Features
-
Unified Data Storage: Acts as a relational, SQLite-based container that allows for the direct, single-file storage of vector features, rasters, and attribute data without the need for intermediate format translations.
-
Standardized Conceptual Mapping: Translates the INSPIRE Soil conceptual model (UML) into a functional relational schema, mapping features like SoilSite, SoilPlot, SoilProfile, and ProfileElement into dedicated database tables.
-
Semantic Harmonization & Interoperability: Uses reference tables to manage controlled vocabularies and code-lists (keeping URI, notation, label, authority, and version), ensuring that data is functionally interoperable and semantically harmonized.
-
Relational Integrity Management: Enforces data relationships using foreign keys and link tables, and manages cascade behaviors to automatically handle the consequences of data updates or deletions across linked parent/child tables.
-
Time-Series and Sensor Integration: Integrates the OGC SensorThings API 2.0 (STA2) to manage and expose sensor metadata and observational time-series data via HTTP and MQTT, serving as a "data-in-motion" layer alongside static geographic data.
-
Native GIS Support: Natively integrates with QGIS for immediate editing, styling, and map production.
-
Guided Data Entry via Custom Forms: Provides pre-configured, custom QGIS attribute forms featuring drop-down menus, default values, tooltips, and automated validation checks to ensure data entry is fast, standardized, and error-free.
-
Structured Data Loading: Includes a defined workflow for data loading that enforces necessary dependencies, loading orders, and pre/post-loading verification constraints.
A more detailed technical documentation, including some tutorials for using SoilWise GeoPackage in QGIS can be found at: https://soilwise-he.github.io/Geopackage-so/.
Other recommended tools acknowledged by SoilWise community
The following components are not a product of SoilWise project, and not an integral part of the SoilWise Catalogue, but are recommended by the SoilWise community.
Hale Studio
A proven ETL tool optimised for working with complex structured data, such as XML, relational databases, or a wide range of tabular formats. It supports all required procedures for semantic and structural transformation. It can also handle reprojection. While Hale Studio exists as a multi-platform interactive application, its capabilities can be provided through a web service with an OpenAPI.
User Manual
A comprehensive tutorial video on soil data harmonisation with hale studio can be found here.
Setting up a transformation process in hale»connect
Complete the following steps to set up soil data transformation, validation and publication processes:
- Log into hale»connect.
- Create a new transformation project (or upload it).
- Specify source and target schemas.
- Create a theme (this is a process that describes what should happen with the data).
- Add a new transformation configuration. Note: Metadata generation can be configured in this step.
- A validation process can be set up to check against conformance classes.
Executing a transformation process
- Create a new dataset and select the theme of the current source data, and provide the source data file.
- Execute the transformation process. ETF validation processes are also performed. If successful, a target dataset and the validation reports will be created.
- View and download services will be created if required.
To create metadata (data set and service metadata), activate the corresponding button(s) when setting up the theme for the transformation process.