System & Usage Monitoring
Info
Technology: Grafana / Prometheus / Loki
Project repository: Usage statistics
Access point: https://grafana.soilwise.wetransform.eu/
Introduction
Overview and Scope
All components and services of the SWC are monitored at different levels to ensure robust operations and security of the system. A central monitoring service based on Prometheus and Grafana is deployed to the SoilWise Kubernetes cluster alongside the SWC services.
Monitoring covers:
- Node and container resource utilisation (RAM, CPU, volumes, transfer, uptime)
- Service health checks (liveness and readiness probes)
- Operational alerting via Slack and PagerDuty
- Availability statistics and trend analysis
- Usage statistics for SWC services and the public website
- Log aggregation and filtering by HTTP status code to identify errors (4xx, 5xx)
Architecture
System Health Monitoring
Infrastructure and application monitoring is implemented using a Prometheus + Grafana stack, deployed via the kube-prometheus-stack Helm chart. Log aggregation is deployed via the loki and promtail Helm charts.
Prometheus scrapes metrics from all services, the Kubernetes cluster, and host nodes via Node Exporter and kube-state-metrics. Retention is configured to support trend analysis and availability reporting.
Grafana provides the dashboarding layer. User sign-up is disabled, dashboards are provisioned via code (no UI edits), and the admin password is stored in vault. 29 dashboards are provisioned covering:
- Kubernetes cluster health (nodes, volumes, autoscaler)
- Service performance and request metrics (OWS, nginx, Java/Vert.x services)
- Container and pod resource consumption
- Log analysis via Loki (HTTP status codes, error rates, OWS access patterns)
- AWS infrastructure (billing, EFS, S3, ELB)
- Long-term OWS availability statistics
Alertmanager is configured with Slack notifications for operational alerts. PagerDuty integration is available for critical environments.
Loki with Promtail handles log aggregation. It is enabled optionally per environment and feeds structured log dashboards in Grafana, including filtering by HTTP status code to identify 4xx/5xx errors.
Usage Monitoring
Usage of the SWC services is tracked at two levels:
Infrastructure-level usage is captured through nginx access logs, which record per-request timing, upstream response times, HTTP status codes, and user agent information. These logs are ingested into Loki and visualised in Grafana through dedicated OWS log analysis dashboards. Detailed per-publication, per-service-type OWS metrics are enabled for the SoilWise deployment (nginx_detailed_ows: true).
Website-level usage of the SoilWise public website is tracked via Hotjar and Google Analytics. These are managed separately from the infrastructure monitoring stack and are configured at the website/CMS level.
The data-portal platform includes built-in usage reporting features (CSV usage reports per organisation, WMS/WFS usage statistics) that can be enabled as feature toggles.
Technological Stack
| Component | Purpose |
|---|---|
| Grafana | Dashboards and visualisation |
| Prometheus | Metrics collection and retention |
| Alertmanager | Alert routing (Slack, PagerDuty) |
| Loki | Log aggregation |
| Promtail | Log shipping |
| Node Exporter | Host metrics |
| kube-state-metrics | Kubernetes cluster metrics |
| Hotjar | Website usage analytics (frontend) |
| Google Analytics | Website traffic statistics (frontend) |
| nginx | monitoring |
Deployed via the kube-prometheus-stack, loki, and promtail Helm charts.
Grafana data sources configured: Prometheus (default), Loki, CloudWatch.
Integrations & Interfaces
- Slack — Operational alerts via Alertmanager (configured for the SoilWise setup)
- PagerDuty — Critical alerts (available, optional per environment)
- Loki — Structured log queries accessible from Grafana dashboards
- nginx — Access logs feed into Loki for OWS log analysis and HTTP status code filtering
- Kubernetes — Liveness probes configured on service deployments (e.g. Solr API) for automated health checking; resource limits/requests (CPU, memory) defined per deployment feed into cluster-level monitoring