Configuration: Observability¶
This page covers actuator and Prometheus integration.
Default Actuator Endpoints¶
| Path | Description |
|---|---|
/actuator/health |
Health endpoint |
/actuator/prometheus |
Prometheus scrape endpoint |
/actuator/info |
Build/runtime info |
All of the above are exposed by default.
To customize exposure, set:
management.endpoints.web.exposure.include=health,info,prometheus
Prometheus Scrape Example¶
scrape_configs:
- job_name: kairos
static_configs:
- targets: ["kairos:8080"]
metrics_path: /actuator/prometheus
Helm Chart Integration¶
The Helm chart annotates the Kairos Service for Prometheus scraping by default:
prometheus.io/scrape: "true"
prometheus.io/path: /actuator/prometheus
prometheus.io/port: "8080"
This works for Prometheus installations that scrape annotated services.
If you use Prometheus Operator, enable the chart's ServiceMonitor:
metrics:
serviceMonitor:
enabled: true
Optional chart values:
metrics:
path: /actuator/prometheus
serviceAnnotations:
enabled: true
podAnnotations:
enabled: false
serviceMonitor:
enabled: false
namespace: ""
interval: 30s
scrapeTimeout: 10s
Grafana Dashboard¶
A ready-to-import Grafana dashboard is available at:
docs/assets/kairos-grafana-dashboard.json
Import it in Grafana through Dashboards -> New -> Import, upload the JSON file, and select the Prometheus datasource that scrapes Kairos.
The dashboard includes:
- Resource availability, available/down/unknown totals, and current status mix
- Resource status timeline and most volatile resources over the selected range
- Resource type breakdown for HTTP, Docker, TCP, and any future resource types
- Check latency overview, p95/p99 latency, and latest DNS/connect/TLS phase latency
- Check outcome rates and failure reasons by normalized error code
- Active outages, active outage duration, outage events, and resolved outage duration
- Prometheus scrape health and scrape duration
- Spring Boot/Micrometer runtime panels for process uptime, HTTP traffic, JVM memory, CPU, threads, and GC pause pressure
The dashboard uses Kairos-specific metrics for resource health, check latency, and outages, plus standard Spring Boot actuator metrics for runtime panels.
Kairos Metrics¶
| Metric | Type | Labels | Meaning |
|---|---|---|---|
kairos_resource_status |
Gauge | resource_name, resource_type |
Current resource status: 1 available, 0 not available, -1 unknown |
kairos_resource_last_check_timestamp_seconds |
Gauge | resource_name, resource_type |
Unix timestamp of the latest persisted check |
kairos_resource_last_check_latency_seconds |
Gauge | resource_name, resource_type, phase |
Latest latency for total, dns, connect, or tls; optional phases appear only when measured |
kairos_resource_checks_total |
Counter | resource_name, resource_type, status, error_code |
Persisted check results by status and normalized error code |
kairos_resource_check_duration_seconds |
Timer / histogram | resource_name, resource_type, status |
Total check duration distribution |
kairos_resource_check_phase_duration_seconds |
Timer / histogram | resource_name, resource_type, phase |
DNS, connect, and TLS phase duration distribution |
kairos_active_outages |
Gauge | none | Total active outages |
kairos_resource_outage_active |
Gauge | resource_name, resource_type |
1 when the resource has an active outage, otherwise 0 |
kairos_resource_active_outage_duration_seconds |
Gauge | resource_name, resource_type |
Duration of the active outage, or 0 when inactive |
kairos_resource_outage_started_total |
Counter | resource_name, resource_type |
Outage start events |
kairos_resource_outage_resolved_total |
Counter | resource_name, resource_type |
Outage resolution events |
kairos_resource_outage_duration_seconds |
Timer / histogram | resource_name, resource_type |
Resolved outage duration distribution |
All resource metrics use resource_type values such as HTTP, DOCKER, and TCP.
Timer metrics expose Prometheus _count, _sum, and _bucket series so Grafana can calculate averages and quantiles with histogram_quantile(...).
PromQL Examples¶
Current active outages:
kairos_active_outages
p95 check latency by resource:
histogram_quantile(
0.95,
sum by (le, resource_name) (
rate(kairos_resource_check_duration_seconds_bucket[5m])
)
)
Failure rate by resource:
sum by (resource_name, error_code) (
rate(kairos_resource_checks_total{status="NOT_AVAILABLE"}[5m])
)
Latest phase latency:
kairos_resource_last_check_latency_seconds{phase=~"dns|connect|tls|total"}