Configuration: Observability¶

This page covers actuator and Prometheus integration.

Default Actuator Endpoints¶

Path	Description
`/actuator/health`	Health endpoint
`/actuator/prometheus`	Prometheus scrape endpoint
`/actuator/info`	Build/runtime info

All of the above are exposed by default.

To customize exposure, set:

management.endpoints.web.exposure.include=health,info,prometheus

Prometheus Scrape Example¶

scrape_configs:
  - job_name: kairos
    static_configs:
      - targets: ["kairos:8080"]
    metrics_path: /actuator/prometheus

Helm Chart Integration¶

The Helm chart annotates the Kairos Service for Prometheus scraping by default:

prometheus.io/scrape: "true"
prometheus.io/path: /actuator/prometheus
prometheus.io/port: "8080"

This works for Prometheus installations that scrape annotated services.

If you use Prometheus Operator, enable the chart's ServiceMonitor:

metrics:
  serviceMonitor:
    enabled: true

Optional chart values:

metrics:
  path: /actuator/prometheus
  serviceAnnotations:
    enabled: true
  podAnnotations:
    enabled: false
  serviceMonitor:
    enabled: false
    namespace: ""
    interval: 30s
    scrapeTimeout: 10s

Grafana Dashboard¶

A ready-to-import Grafana dashboard is available at:

docs/assets/kairos-grafana-dashboard.json

Import it in Grafana through Dashboards -> New -> Import, upload the JSON file, and select the Prometheus datasource that scrapes Kairos.

The dashboard includes:

Resource availability, available/down/unknown totals, and current status mix
Resource status timeline and most volatile resources over the selected range
Resource type breakdown for HTTP, Docker, TCP, and any future resource types
Check latency overview, p95/p99 latency, and latest DNS/connect/TLS phase latency
Check outcome rates and failure reasons by normalized error code
Active outages, active outage duration, outage events, and resolved outage duration
Prometheus scrape health and scrape duration
Spring Boot/Micrometer runtime panels for process uptime, HTTP traffic, JVM memory, CPU, threads, and GC pause pressure

The dashboard uses Kairos-specific metrics for resource health, check latency, and outages, plus standard Spring Boot actuator metrics for runtime panels.

Kairos Metrics¶

Metric	Type	Labels	Meaning
`kairos_resource_status`	Gauge	`resource_name`, `resource_type`	Current resource status: `1` available, `0` not available, `-1` unknown
`kairos_resource_last_check_timestamp_seconds`	Gauge	`resource_name`, `resource_type`	Unix timestamp of the latest persisted check
`kairos_resource_last_check_latency_seconds`	Gauge	`resource_name`, `resource_type`, `phase`	Latest latency for `total`, `dns`, `connect`, or `tls`; optional phases appear only when measured
`kairos_resource_checks_total`	Counter	`resource_name`, `resource_type`, `status`, `error_code`	Persisted check results by status and normalized error code
`kairos_resource_check_duration_seconds`	Timer / histogram	`resource_name`, `resource_type`, `status`	Total check duration distribution
`kairos_resource_check_phase_duration_seconds`	Timer / histogram	`resource_name`, `resource_type`, `phase`	DNS, connect, and TLS phase duration distribution
`kairos_active_outages`	Gauge	none	Total active outages
`kairos_resource_outage_active`	Gauge	`resource_name`, `resource_type`	`1` when the resource has an active outage, otherwise `0`
`kairos_resource_active_outage_duration_seconds`	Gauge	`resource_name`, `resource_type`	Duration of the active outage, or `0` when inactive
`kairos_resource_outage_started_total`	Counter	`resource_name`, `resource_type`	Outage start events
`kairos_resource_outage_resolved_total`	Counter	`resource_name`, `resource_type`	Outage resolution events
`kairos_resource_outage_duration_seconds`	Timer / histogram	`resource_name`, `resource_type`	Resolved outage duration distribution

All resource metrics use resource_type values such as HTTP, DOCKER, and TCP.

Timer metrics expose Prometheus _count, _sum, and _bucket series so Grafana can calculate averages and quantiles with histogram_quantile(...).

PromQL Examples¶

Current active outages:

kairos_active_outages

p95 check latency by resource:

histogram_quantile(
  0.95,
  sum by (le, resource_name) (
    rate(kairos_resource_check_duration_seconds_bucket[5m])
  )
)

Failure rate by resource:

sum by (resource_name, error_code) (
  rate(kairos_resource_checks_total{status="NOT_AVAILABLE"}[5m])
)

Latest phase latency:

kairos_resource_last_check_latency_seconds{phase=~"dns|connect|tls|total"}