Skip to content

System & Usage Monitoring

Info

Current version: Grafana 12.1.1 / Prometheus v2.55.1 / Prometheus LTS v2.18.1

Project: Usage statistics

Introduction

Overview and Scope

All components and services of the SWC are monitored at different levels to ensure robust operations and security of the system. A central monitoring service based on Prometheus and Grafana is deployed to the SoilWise Kubernetes cluster alongside the SWC services.

Monitoring covers:

  • Node and container resource utilisation (RAM, CPU, volumes, transfer, uptime)
  • Service health checks (liveness and readiness probes)
  • Operational alerting via Slack and PagerDuty
  • Availability statistics and trend analysis
  • Usage statistics for SWC services and the public website
  • Log aggregation and filtering by HTTP status code to identify errors (4xx, 5xx)

Architecture

System Health Monitoring

Infrastructure and application monitoring is implemented using a Prometheus + Grafana stack.

Prometheus (v2.55.1) scrapes metrics from all services, the Kubernetes cluster, and host nodes via Node Exporter (v1.9.1) and kube-state-metrics. A second Prometheus LTS instance (v2.18.1) retains metrics for longer periods (currently configured at 4 GB / ~30 days) to support trend analysis and availability reporting.

Grafana (v12.1.1) provides the dashboarding layer. User sign-up is disabled, dashboards are provisioned via code (no UI edits), and the admin password is stored in vault. 67 dashboards are provisioned covering:

  • Kubernetes cluster health (nodes, volumes, autoscaler)
  • Service performance and request metrics (OWS, nginx, Java/Vert.x services)
  • Container and pod resource consumption
  • Log analysis via Loki (HTTP status codes, error rates, OWS access patterns)
  • AWS infrastructure (billing, EFS, S3, ELB)
  • Long-term OWS availability statistics

Alertmanager (v0.28.1) is configured with Slack notifications for operational alerts. PagerDuty integration is available for critical environments.

Loki (v3.4.1) with Promtail (v3.0.0) handles log aggregation. It is enabled optionally per environment and feeds structured log dashboards in Grafana, including filtering by HTTP status code to identify 4xx/5xx errors.

Usage Monitoring

Usage of the SWC services is tracked at two levels:

Infrastructure-level usage is captured through nginx access logs, which record per-request timing, upstream response times, HTTP status codes, and user agent information. These logs are ingested into Loki and visualised in Grafana through dedicated OWS log analysis dashboards. Detailed per-publication, per-service-type OWS metrics are enabled for the SoilWise deployment (nginx_detailed_ows: true).

Website-level usage of the SoilWise public website is tracked via Hotjar and Google Analytics. These are managed separately from the infrastructure monitoring stack and are configured at the website/CMS level.

The hale-connect platform includes built-in usage reporting features (CSV usage reports per organisation, WMS/WFS usage statistics) that can be enabled as feature toggles.

Technological Stack

Component Version Purpose
Grafana 12.1.1 Dashboards and visualisation
Prometheus v2.55.1 Metrics collection
Prometheus LTS v2.18.1 Long-term metric retention (~30 days)
Alertmanager v0.28.1 Alert routing (Slack, PagerDuty)
Loki 3.4.1 Log aggregation
Promtail 3.0.0 Log shipping
Node Exporter v1.9.1 Host metrics
kube-state-metrics Kubernetes cluster metrics
Hotjar Website usage analytics (frontend)
Google Analytics Website traffic statistics (frontend)

Grafana data sources configured: Prometheus (default), Prometheus LTS, Loki, CloudWatch.

Integrations & Interfaces

  • Slack — Operational alerts via Alertmanager (configured for the SoilWise setup)
  • PagerDuty — Critical alerts (available, optional per environment)
  • Loki — Structured log queries accessible from Grafana dashboards
  • nginx — Access logs feed into Loki for OWS log analysis and HTTP status code filtering
  • Kubernetes — Liveness probes configured on service deployments (e.g. Solr API) for automated health checking; resource limits/requests (CPU, memory) defined per deployment feed into cluster-level monitoring