Soil Companion (chatbot)
Info
Current version: 1.2.x
Technology: Retrieval Augmented Generative Artificial Intelligence
Release: https://soil-companion.containers.wur.nl/app/index.html
Project: Soil Companion
Introduction
Component Overview and Scope
The Soil Companion is an AI chatbot developed in the SoilWise project. It provides an intelligent conversational interface through which users can explore European soil metadata, query global and country-specific soil data services, and receive answers grounded in the SoilWise knowledge repository.
The chatbot uses an agentic tool-calling approach: a large language model (LLM) autonomously decides which external data sources to consult for each question, executes the relevant tool calls, and synthesizes the results into a coherent response. Answers are enriched with auto-generated links to SoilWise vocabulary terms and Wikipedia articles. A sidebar Insight panel displays related SKOS vocabulary concepts and clickable chips that allow users to explore connected topics.
Users
The Soil Companion targets the following user groups:
- Soil scientists and researchers working with European soil health data and seeking catalogued knowledge, publications, and datasets from the SoilWise repository.
- Agricultural experts and extension officers looking for soil property data, field-level KPIs, and crop information to inform land management decisions.
- Students and educators exploring soil science concepts through a conversational interface that provides definitions, vocabulary hierarchies, and links to authoritative sources.
- Farmers and land managers (in selected regions) who want accessible field-level agricultural data such as crop history, soil physical properties, and greenness indices.
Authentication is currently limited to a configurable demo mode (single account) suitable for development and demonstration purposes.
Requirements
Functional Requirements
The Soil Companion currently provides the following capabilities:
- SoilWise Catalog Search — Searches datasets and knowledge/publications in the SoilWise catalog via Solr, using boosted field relevance ranking. Optionally includes full-text snippets from harvested PDF content. Creates verified back-links to catalog items by identifier.
- ISRIC SoilGrids — Queries the ISRIC SoilGrids v2.0 REST API for indicative soil property estimates (bulk density, organic carbon, clay, sand, silt, pH) at a given location, at multiple depth intervals.
- AgroDataCube (Netherlands) — Integrates with the WUR AgroDataCube v2 API for Dutch agricultural field parcels, crop history, and soil/crop KPIs. Maintains session memory of the last looked-up field for follow-up questions.
- Vocabulary Lookup — Queries the SoilWise SPARQL endpoint to retrieve SKOS concept hierarchies (broader, narrower, related, exact match terms with definitions) for soil science terminology.
- Wikipedia — Searches and retrieves Wikipedia articles in multiple languages (English, Dutch, French, Spanish, Italian, Czech) for general concepts and background information.
- Auto-Linking — Automatically links soil vocabulary terms and Wikipedia articles appearing in responses, making terms navigable to the SoilWise vocabulary browser and Wikipedia.
- Insight Panel — A sidebar panel that extracts SoilWise and Wikipedia links from the latest response and displays related vocabulary concepts with definitions, organized by broader/narrower/related relationships. Clickable vocabulary chips allow quick exploration of connected topics.
- Location Context — An interactive Leaflet map in the sidebar allows users to set a geographic context (with country search autocomplete), which is then used by location-aware tools.
- File Upload — Users can upload text or Markdown files (max 200 KB) as temporary context for the conversation.
- RAG (Retrieval-Augmented Generation) — Local core SoilWise knowledge documents (PDF, text, Markdown) are embedded on startup and used as supplementary context for answers, retrieved by semantic similarity.
- Real-Time Streaming — Responses are streamed token-by-token over WebSocket for low-latency feedback.
- Feedback Collection — Thumbs up/down feedback per response, logged in daily JSONL files for quality analysis. Evaluation tools can merge logs with feedback and compute quality metrics.
- Initial Question via URL — An initial question can be passed as a URL parameter, enabling deep-linking into the chatbot with a pre-filled query.
- Health Endpoints — Kubernetes-compatible liveness (
/healthz) and readiness (/readyz) probes, including version and uptime reporting. - Auto-Update Detection — The frontend polls the health endpoint to detect server version changes and notifies the user when an update is available.
Non-functional Requirements
| Aspect | Constraint |
|---|---|
| Chat memory | 50 messages per session (configurable) |
| Max sequential tool calls | 10 per query (configurable) |
| Max LLM retries | 3 (configurable) |
| Session idle timeout | 10 minutes (configurable) |
| Max prompt size | 120,000 characters (safety limit) |
| Max upload size | 200,000 characters |
| RAG chunk size | 500 characters with 100-character overlap |
| RAG retrieval | Top 5 results, minimum similarity score 0.6 |
| Catalog search results | Max 10 per query |
| External API timeout | 15,000 ms (Solr, SoilGrids, AgroDataCube, Wikipedia) |
| SPARQL timeout | 5,000 ms connection + 10,000 ms read |
| WebSocket heartbeat | Every 15 seconds |
| Log retention | 30 days (daily rotation, gzip compressed) |
| HTML sanitization | DOMPurify on all rendered markdown output |
| Link safety | All external links open with noopener noreferrer |
| Accessibility | ARIA labels, roles, live regions; keyboard navigation |
Architecture
Technological Stack
Backend (JVM)
| Component | Technology |
|---|---|
| Language | Scala 3.8.1 on JDK 17+ (tested 17–25) |
| Build | SBT 1.11.x / 1.12.x (cross-build JS/JVM) |
| LLM Framework | LangChain4j 1.10.x (OpenAI integration, agentic tool calling, embeddings, RAG) |
| LLM Provider | OpenAI (gpt-4o-mini for chat, gpt-4o for reasoning, text-embedding-3-small for embeddings) |
| Local Embeddings | AllMiniLmL6V2 (offline, ~33 MB model for RAG document retrieval) |
| Vector Store | In-memory embedding store (with experimental Chroma support) |
| Logging | SLF4J 3.0.x + Logback 1.5.x (daily rotation, 30-day retention) |
| Document Parsing | Apache Tika (PDF, text, Markdown) |
Frontend (Browser)
| Component | Technology |
|---|---|
| Language | Scala.js (compiled to JavaScript) |
| Sanitization | DOMPurify 3.x |
| Maps | Leaflet 1.9.x |
| Communication | WebSocket (real-time streaming) |
Infrastructure
| Component | Technology |
|---|---|
| Container | Docker (multi-stage build, Eclipse Temurin JDK 21) |
| CI/CD | GitLab CI with semantic release (conventional commits) |
| Orchestration | Kubernetes (liveness/readiness probes) |
Overview of Key Features
The chatbot combines agentic LLM tool calling with retrieval-augmented generation and post-response enrichment to deliver grounded, linked answers. The key features are:
- Agentic tool calling — The LLM autonomously decides which of the available tool integrations to invoke (catalog search, SoilGrids, AgroDataCube, Wikipedia, vocabulary SPARQL), executing up to 10 sequential tool-call iterations per query.
- RAG from local core knowledge — Documents (PDF, text, Markdown) are split into chunks, embedded with a local model (AllMiniLmL6V2), and stored in memory. Relevant chunks are retrieved by cosine similarity and injected into the prompt.
- Response enrichment — After the LLM generates a response, auto-linkers scan for vocabulary terms and Wikipedia article titles, inserting navigable links into the rendered output.
- Insight panel — The frontend extracts SoilWise and Wikipedia links from responses and displays broader/narrower/related vocabulary concepts with definitions in a sidebar panel.
- Token streaming — Responses are streamed token-by-token over WebSocket, giving users immediate visual feedback.
- Feedback loop — Thumbs up/down ratings are logged to daily JSONL files; evaluation tools compute quality metrics (like rate, NSAT, Wilson lower bound).
Component Diagrams
High-level component overview:
┌─────────────────────────────────────────────────────────────────────┐
│ Browser (Scala.js) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ SoilCompanionApp │ │
│ │ - WebSocket client (real-time chat streaming) │ │
│ │ - Authentication & session management │ │
│ │ - Location picker (Leaflet map) │ │
│ │ - Insight panel (vocabulary concepts, Wikipedia links) │ │
│ │ - File upload, feedback, theme toggle │ │
│ └───────────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
│ WebSocket + HTTP
┌──────────────────────────────▼──────────────────────────────────────┐
│ SoilCompanionServer (Cask, JVM) │
│ │
│ Routes: /healthz, /readyz, /login, /logout, /session, │
│ /subscribe/:id (WS), /query, /clear, /upload, │
│ /feedback, /location, /vocab, /app/* │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────────┐ │
│ │ Config │ │ Feedback │ │ Session Management │ │
│ │ (PureConfig) │ │ Logger │ │ (ConcurrentHashMaps) │ │
│ └──────────────┘ └──────────────┘ └───────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Assistant (per session) │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ LangChain4j AiServices │ │ │
│ │ │ - StreamingChatModel (OpenAI) │ │ │
│ │ │ - ChatMemory (50 messages) │ │ │
│ │ │ - RAG ContentRetriever (embeddings + local docs) │ │ │
│ │ │ - Tool methods (5 integrations) │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────┐ ┌────────────────┐ │
│ │ VocabLinker │ │WikipediaLinker │ (post-response linking) │
│ └───────────────┘ └────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
│ HTTP calls
┌──────────────────────────────▼──────────────────────────────────────┐
│ External Services │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────────┐ │
│ │ OpenAI API │ │ Solr │ │ ISRIC SoilGrids v2.0 │ │
│ │ (LLM, embed) │ │ (catalog) │ │ (global soil properties) │ │
│ └──────────────┘ └──────────────┘ └────────────────────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────────┐ │
│ │ SoilWise │ │ Wikipedia │ │ WUR AgroDataCube v2 │ │
│ │ SPARQL │ │ (6 langs) │ │ (NL field data) │ │
│ └──────────────┘ └──────────────┘ └────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Sequence Diagram
User query to response flow:
Client (Browser) Server (JVM) External APIs
───────────────── ──────────── ──────────────
│ │ │
│──── GET /session ───────────────▶│ │
│◀─── { sessionId: UUID } ─────────│ │
│ │ │
│──── WS /subscribe/:sessionId ───▶│ (store connection) │
│◀─── connection established ──────│ │
│ │ │
│──── POST /login ────────────────▶│ (validate credentials) │
│◀─── { ok: true } ────────────────│ │
│ │ │
│──── POST /query ────────────────▶│ │
│ { sessionId, content } │ │
│ │ │
│◀─── QueryEvent("received") ──────│ (generate questionId) │
│◀─── QueryEvent("thinking") ──────│ │
│◀─── QueryEvent("retrieving_ │ │
│ context") ──────────────────│ │
│ │──── RAG: embed query ───────────▶│
│ │◀─── top-5 document chunks ───────│
│ │ │
│ │──── LLM: evaluate tools ────────▶│ OpenAI
│ │◀─── tool call decision ──────────│
│ │ │
│ │──── Tool: e.g. Solr search ─────▶│ Solr
│ │◀─── search results ──────────────│
│ │ │
│ │──── LLM: synthesize answer ─────▶│ OpenAI
│ │◀─── token stream begins ─────────│
│ │ │
│◀─ QueryPartialResponse(token) ───│ (repeated per token) │
│◀─ QueryPartialResponse(token) ───│ │
│◀─ ... │ │
│ │ │
│ │ (stream complete) │
│ │ Apply VocabLinker │
│ │ Apply WikipediaLinker │
│ │ │
│◀─ QueryEvent("links_added", │ │
│ linkedResponse) ──────────────│ │
│◀─ QueryEvent("done") ────────────│ │
│ │ │
│ (render markdown, show │ │
│ feedback buttons) │ │
│ │ │
│──── POST /feedback ─────────────▶│ (log to JSONL) │
│ { questionId, vote } │ │
WebSocket event types:
| Event | Direction | Purpose |
|---|---|---|
received |
Server → Client | Query acknowledged, questionId assigned |
thinking |
Server → Client | LLM is analysing the question |
retrieving_context |
Server → Client | RAG retrieval in progress |
generating |
Server → Client | LLM is generating the answer |
links_added |
Server → Client | Auto-linked response replacing the streamed version |
done |
Server → Client | Response complete |
error |
Server → Client | An error occurred |
heartbeat |
Server → Client | Keep-alive (every 15 seconds) |
session_expired |
Server → Client | Session timed out due to inactivity |
prompt_truncated |
Server → Client | Input was truncated to stay within limits |
QueryPartialResponse |
Server → Client | Single streamed token |
Database Design
The Soil Companion does not use a traditional database. All runtime state is held in-memory; only feedback and application logs are persisted to disk.
In-memory state (per server process):
| Store | Type | Purpose |
|---|---|---|
wsConnections |
ConcurrentHashMap[String, WsChannelActor] |
Active WebSocket connections |
assistants |
ConcurrentHashMap[String, Assistant] |
LLM chat state per session |
uploadedTexts |
ConcurrentHashMap[String, String] |
Temporary uploaded file content |
uploadedFilenames |
ConcurrentHashMap[String, String] |
Original filenames of uploads |
locationContexts |
ConcurrentHashMap[String, String] |
Location JSON per session |
authenticatedSessions |
ConcurrentHashMap[String, Boolean] |
Authentication status |
lastActivity |
ConcurrentHashMap[String, Long] |
Session inactivity tracking |
In-memory vector store:
| Store | Type | Purpose |
|---|---|---|
embeddingStore |
InMemoryEmbeddingStore[TextSegment] |
Embedded knowledge document chunks |
Documents from the data/knowledge/ directory are loaded, split into 500-character chunks (100-character overlap), embedded using AllMiniLmL6V2, and stored at startup.
Persistent file storage:
| Data | Location | Format |
|---|---|---|
| Feedback | data/feedback-logs/feedback-YYYY-MM-DD.jsonl |
Daily JSONL, auto-rotated, gzip compressed |
| Application logs | data/logs/soil-companion.log |
Logback rolling file (30-day retention, gzip) |
| Knowledge documents | data/knowledge/ |
PDF, text, Markdown (read-only at startup) |
| Vocabulary | data/vocab/soilvoc_concepts_*.csv |
CSV (loaded at startup for auto-linking) |
Shared domain models (Scala, cross-compiled for JVM and JS):
QueryRequest(sessionId: String, content: String)— client-to-server query messageQueryPartialResponse(content: String)— streamed token from serverQueryEvent(event: String, detail: Option[String], questionId: Option[String])— lifecycle event
Integrations & Interfaces
| Service | Auth | Endpoint | Purpose |
|---|---|---|---|
| OpenAI API | Bearer token (OPENAI_API_KEY) |
via LangChain4j | Chat completion (gpt-4o-mini), reasoning (gpt-4o), embeddings (text-embedding-3-small) |
| Solr (SoilWise Catalog) | Basic Auth (SOLR_USERNAME / SOLR_PASSWORD) |
SOLR_BASE_URL |
Search datasets and publications; full-text content retrieval |
| ISRIC SoilGrids v2.0 | None (public) | SOILGRIDS_BASE_URL |
Soil property estimates at lat/lon (~250 m resolution) |
| SoilWise SPARQL | None | VOCAB_SPARQL_ENDPOINT |
SKOS concept hierarchies (broader, narrower, related terms) |
| Wikipedia | None (public) | WIKIPEDIA_BASE_URL (per language) |
Article search and content retrieval (6 languages) |
| WUR AgroDataCube v2 | Token header (AGRODATACUBE_ACCESS_TOKEN) |
AGRODATACUBE_BASE_URL |
NL field parcels, crop history, soil/crop KPIs |
All external service credentials and endpoints are configured through HOCON (application.conf) with environment variable overrides.
HTTP endpoints exposed by the server:
| Method | Path | Purpose |
|---|---|---|
GET |
/healthz |
Liveness probe (version, uptime) |
GET |
/readyz |
Readiness probe (config + API key checks) |
POST |
/login |
Demo authentication |
POST |
/logout |
Session teardown |
GET |
/session |
New session ID |
WS |
/subscribe/:sessionId |
WebSocket for streaming chat |
POST |
/query |
Submit a question |
POST |
/clear |
Clear chat history |
POST |
/upload |
Upload text/Markdown context |
POST |
/feedback |
Submit thumbs up/down |
POST |
/location |
Set geographic context |
POST |
/vocab |
Batch vocabulary concept lookup |
GET |
/app/* |
Static frontend assets |
Key Architectural Decisions
| Decision | Rationale |
|---|---|
| Scala 3 + Scala.js cross-build | Enables shared domain models (QueryRequest, QueryEvent, QueryPartialResponse) between backend and frontend, eliminating serialization mismatches and reducing code duplication. |
| Cask HTTP micro-framework | Lightweight, Scala-native server with built-in WebSocket support. Suitable for a single-service chatbot without the overhead of a full application framework. |
| LangChain4j for LLM integration | Provides a mature JVM-native abstraction for tool calling, RAG, streaming, and chat memory — avoiding the need to call OpenAI APIs directly. The @Tool annotation enables declarative tool registration. |
| In-memory embedding store | Simplifies deployment (no external vector database required). Sufficient for the current knowledge base (~5 documents). Trade-off: state is lost on restart and capacity is limited by server memory. |
| AllMiniLmL6V2 for local embeddings | Runs offline without API calls, keeping RAG retrieval fast and cost-free. The ~33 MB model is small enough to bundle in the Docker image. |
| Per-session Assistant instances | Each session gets its own Assistant with isolated chat memory, tool state (e.g. AgroDataCube field context), and uploaded file context. Prevents cross-session contamination. |
| Post-response auto-linking | VocabLinker and WikipediaLinker run after the LLM completes, replacing the streamed response with an enriched version. This avoids asking the LLM to generate links (which is unreliable) while still providing navigable references. |
| WebSocket token streaming | Provides immediate visual feedback during LLM generation, reducing perceived latency. A 15-second heartbeat prevents proxy/ingress idle timeouts. |
| Environment variable configuration | All credentials and endpoints are overridable via environment variables, following 12-factor app principles for containerized deployment. |
| Demo authentication | A simple single-user mode enables local development and demonstrations without requiring an external identity provider. Production deployment would integrate with an external auth layer. |
Risks & Limitations
| Risk / Limitation | Description | Mitigation |
|---|---|---|
| SoilGrids accuracy | Returned values are modelled estimates at ~250 m grid resolution, not field measurements. | Tool responses include explicit disclaimers advising users to verify with local data. |
| Single-user demo auth | The demo authentication mode uses a single configurable account with no roles or authorization. Not suitable for production multi-user scenarios. | Designed for development/testing; production deployment requires integration with an external authentication and authorization layer. |
| In-memory state loss | All session state, chat memory, uploaded context, and the embedding store are lost on server restart. | Acceptable for a demo/chatbot use case. Persistent vector store (Chroma) support exists experimentally for future use. |
| No horizontal scaling | All sessions are held in a single JVM process with no shared session store. | Sufficient for current usage levels. Horizontal scaling would require an external session store and load balancer. |
| External API availability | The chatbot depends on multiple external APIs (Solr, SoilGrids, AgroDataCube, OpenAgroKPI, Wikipedia). Downtime or rate limits on any service degrades functionality. | Tool methods handle errors gracefully, returning informative messages. The LLM can fall back to other tools if one fails. LangChain4j provides configurable retry logic (max 3 retries). |
| Geographic coverage | AgroDataCube and OpenAgroKPI are Netherlands-only. SoilGrids is global but at coarse resolution. | Tool descriptions inform the LLM of geographic scope so it can communicate limitations to users. |
| Knowledge base is static | Local documents are loaded and embedded only at startup. No hot-reload mechanism exists. | A server restart picks up new documents. This is acceptable for infrequently changing knowledge resources. |
| LLM hallucination | Despite RAG grounding and tool results, the LLM may still generate inaccurate statements. | System prompts instruct the model to include disclaimers and prefer tool-grounded answers. User feedback collection enables ongoing quality monitoring. |
| Prompt injection via uploads | Uploaded files and location contexts could contain adversarial content. | Input sanitization is applied; prompt size is capped at 120,000 characters; file uploads are limited to 200 KB. |
| CORS policy | The file upload endpoint uses a permissive Access-Control-Allow-Origin: * header. |
Acceptable for demo deployment; should be tightened for production. |