metrics-alerting.md 28 KB

Metrics and Alerting Reference

Comprehensive reference for metrics collection, visualization, alerting, SLOs, and uptime monitoring.


Prometheus

Architecture Overview

┌─────────────┐     ┌─────────────┐     ┌─────────────────┐
│ Application │────▶│  Prometheus  │────▶│  Alertmanager   │
│  /metrics   │pull │  (TSDB)      │push │  (routing/notif) │
└─────────────┘     └──────┬──────┘     └─────────────────┘
                           │query
                    ┌──────▼──────┐
                    │   Grafana    │
                    │ (dashboards) │
                    └─────────────┘

Key characteristics:

  • Pull-based model (Prometheus scrapes targets)
  • Local time-series database (TSDB)
  • PromQL query language
  • Built-in alerting rules evaluated by Prometheus, routed by Alertmanager
  • Service discovery (Kubernetes, Consul, DNS, file-based, EC2)

Prometheus Configuration (prometheus.yml)

global:
  scrape_interval: 15s          # Default scrape interval
  evaluation_interval: 15s      # Rule evaluation interval
  scrape_timeout: 10s           # Per-scrape timeout

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Rule files
rule_files:
  - "rules/*.yml"

# Scrape targets
scrape_configs:
  # Self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Application with static targets
  - job_name: "api-server"
    metrics_path: /metrics
    scheme: https
    static_configs:
      - targets: ["api1:8080", "api2:8080"]
        labels:
          environment: production

  # Kubernetes service discovery
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

  # Node exporter
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

PromQL Basics

Rate and Increase

# Per-second rate over 5 minutes (use for counters)
rate(http_requests_total[5m])

# Per-second rate for specific status codes
rate(http_requests_total{status_code=~"5.."}[5m])

# Total increase over 1 hour (use for counters)
increase(http_requests_total[1h])

# irate: instant rate using last two data points (more volatile)
irate(http_requests_total[5m])

Rule: Always use rate() or increase() with counters. Never display raw counter values.

Aggregation Operators

# Sum across all instances
sum(rate(http_requests_total[5m]))

# Sum by specific label
sum by (method, path) (rate(http_requests_total[5m]))

# Average across instances
avg(node_cpu_seconds_total{mode="idle"})

# Maximum value across instances
max by (instance) (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)

# Count number of time series
count(up == 1)

# Top 5 by value
topk(5, rate(http_requests_total[5m]))

# Bottom 5 by value
bottomk(5, rate(http_requests_total[5m]))

Histogram Quantiles

# 99th percentile latency
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# 95th percentile latency by service
histogram_quantile(0.95,
  sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)

# 50th percentile (median)
histogram_quantile(0.50,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Average latency from histogram
sum(rate(http_request_duration_seconds_sum[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

Useful Functions

# Detect missing metrics (target down)
absent(up{job="api-server"})

# Time since last change (staleness)
time() - process_start_time_seconds

# Predict value in 4 hours using linear regression
predict_linear(node_filesystem_avail_bytes[6h], 4*3600)

# Compare to 1 week ago
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 7d)

# Clamping values
clamp_min(free_disk_percentage, 0)
clamp_max(cpu_usage_percentage, 100)

# Label manipulation
label_replace(up, "short_instance", "$1", "instance", "(.*):.*")

Recording Rules

Pre-compute expensive queries for dashboards and alerts:

# rules/recording-rules.yml
groups:
  - name: http_request_rules
    interval: 15s
    rules:
      # Pre-compute request rate by service and status
      - record: job:http_requests:rate5m
        expr: sum by (job, status_code) (rate(http_requests_total[5m]))

      # Pre-compute error rate percentage
      - record: job:http_request_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # Pre-compute p99 latency
      - record: job:http_request_duration_seconds:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      # Pre-compute availability
      - record: job:availability:ratio5m
        expr: |
          1 - (
            sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
            /
            sum by (job) (rate(http_requests_total[5m]))
          )

Alerting Rules

# rules/alerting-rules.yml
groups:
  - name: service_alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: job:http_request_errors:ratio5m > 0.01
        for: 5m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
          runbook_url: "https://runbooks.example.com/high-error-rate"
          dashboard_url: "https://grafana.example.com/d/service-overview?var-service={{ $labels.job }}"

      # Critical error rate
      - alert: CriticalErrorRate
        expr: job:http_request_errors:ratio5m > 0.05
        for: 2m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Critical error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook_url: "https://runbooks.example.com/critical-error-rate"

      # High latency
      - alert: HighLatencyP99
        expr: job:http_request_duration_seconds:p99_5m > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2s on {{ $labels.job }}"
          description: "P99 latency is {{ $value | humanizeDuration }}"

      # Target down
      - alert: TargetDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Target {{ $labels.instance }} is down"
          description: "Prometheus cannot scrape {{ $labels.job }}/{{ $labels.instance }}"

  - name: infrastructure_alerts
    rules:
      # Disk space prediction
      - alert: DiskWillFillIn24Hours
        expr: |
          predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }} will fill within 24 hours"

      # High memory usage
      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage above 90% on {{ $labels.instance }}"

      # High CPU usage
      - alert: HighCPUUsage
        expr: |
          1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 85% on {{ $labels.instance }}"

Grafana

Dashboard JSON Structure

{
  "dashboard": {
    "title": "Service Overview",
    "uid": "service-overview",
    "tags": ["production", "services"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up, job)",
          "refresh": 2,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "interval",
          "type": "interval",
          "options": [
            {"text": "1m", "value": "1m"},
            {"text": "5m", "value": "5m"},
            {"text": "15m", "value": "15m"}
          ],
          "current": {"text": "5m", "value": "5m"}
        }
      ]
    },
    "panels": []
  }
}

Panel Types

Time Series Panel

{
  "type": "timeseries",
  "title": "Request Rate",
  "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
  "targets": [
    {
      "expr": "sum by (status_code) (rate(http_requests_total{job=~\"$service\"}[$interval]))",
      "legendFormat": "{{status_code}}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",
      "custom": {
        "drawStyle": "line",
        "fillOpacity": 10,
        "stacking": {"mode": "none"}
      }
    }
  }
}

Stat Panel

{
  "type": "stat",
  "title": "Current Error Rate",
  "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{job=~\"$service\",status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=~\"$service\"}[5m]))",
      "instant": true
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "thresholds": {
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 0.001},
          {"color": "red", "value": 0.01}
        ]
      }
    }
  }
}

Gauge Panel

{
  "type": "gauge",
  "title": "CPU Usage",
  "targets": [
    {
      "expr": "1 - avg(rate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\"}[5m]))",
      "instant": true
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "min": 0,
      "max": 1,
      "thresholds": {
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 0.7},
          {"color": "red", "value": 0.9}
        ]
      }
    }
  }
}

Table Panel

{
  "type": "table",
  "title": "Top Endpoints by Error Rate",
  "targets": [
    {
      "expr": "topk(10, sum by (method, path) (rate(http_requests_total{status_code=~\"5..\"}[5m])))",
      "instant": true,
      "format": "table"
    }
  ],
  "transformations": [
    {"id": "organize", "options": {"excludeByName": {"Time": true}}}
  ]
}

Grafana Variables

Type Use Case Example
Query Dynamic from datasource label_values(up, job)
Custom Fixed list of values production,staging,development
Interval Time range intervals 1m,5m,15m,1h
Datasource Multiple Prometheus instances Type: datasource, Query: Prometheus
Text box Free-form input Filter by custom string

Annotations

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "enable": true,
        "expr": "changes(process_start_time_seconds{job=\"api-server\"}[1m]) > 0",
        "tagKeys": "job",
        "titleFormat": "Deployment: {{job}}"
      },
      {
        "name": "Alerts",
        "datasource": "-- Grafana --",
        "enable": true,
        "type": "alert"
      }
    ]
  }
}

OpenTelemetry Metrics

Go SDK Setup

package main

import (
    "context"
    "log"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/prometheus"
    "go.opentelemetry.io/otel/metric"
    sdkmetric "go.opentelemetry.io/otel/sdk/metric"
)

func initMeterProvider() (*sdkmetric.MeterProvider, error) {
    exporter, err := prometheus.New()
    if err != nil {
        return nil, err
    }

    mp := sdkmetric.NewMeterProvider(
        sdkmetric.WithReader(exporter),
    )
    otel.SetMeterProvider(mp)
    return mp, nil
}

func main() {
    mp, err := initMeterProvider()
    if err != nil {
        log.Fatal(err)
    }
    defer mp.Shutdown(context.Background())

    meter := otel.Meter("myapp")

    // Counter
    requestCounter, _ := meter.Int64Counter(
        "http.server.request.total",
        metric.WithDescription("Total HTTP requests"),
        metric.WithUnit("{request}"),
    )

    // Histogram
    latencyHistogram, _ := meter.Float64Histogram(
        "http.server.request.duration",
        metric.WithDescription("HTTP request latency"),
        metric.WithUnit("s"),
        metric.WithExplicitBucketBoundaries(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10),
    )

    // UpDownCounter (gauge-like)
    activeConnections, _ := meter.Int64UpDownCounter(
        "http.server.active_connections",
        metric.WithDescription("Active HTTP connections"),
    )

    // Usage
    ctx := context.Background()
    requestCounter.Add(ctx, 1, metric.WithAttributes(
        attribute.String("method", "GET"),
        attribute.String("path", "/api/users"),
        attribute.Int("status_code", 200),
    ))

    start := time.Now()
    // ... handle request ...
    latencyHistogram.Record(ctx, time.Since(start).Seconds())

    activeConnections.Add(ctx, 1)   // connection opened
    activeConnections.Add(ctx, -1)  // connection closed
}

Python SDK Setup

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server

# Prometheus exporter
reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)

# Start Prometheus HTTP server on port 8000
start_http_server(8000)

meter = metrics.get_meter("myapp")

# Counter
request_counter = meter.create_counter(
    name="http.server.request.total",
    description="Total HTTP requests",
    unit="{request}",
)

# Histogram
latency_histogram = meter.create_histogram(
    name="http.server.request.duration",
    description="HTTP request latency",
    unit="s",
)

# UpDownCounter
active_connections = meter.create_up_down_counter(
    name="http.server.active_connections",
    description="Active HTTP connections",
)

# Usage
request_counter.add(1, {"method": "GET", "path": "/api/users", "status_code": 200})
latency_histogram.record(0.045, {"method": "GET", "path": "/api/users"})
active_connections.add(1)

Node.js SDK Setup

const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const { metrics } = require('@opentelemetry/api');

const exporter = new PrometheusExporter({ port: 9464 });
const meterProvider = new MeterProvider({
  readers: [exporter],
});
metrics.setGlobalMeterProvider(meterProvider);

const meter = metrics.getMeter('myapp');

// Counter
const requestCounter = meter.createCounter('http.server.request.total', {
  description: 'Total HTTP requests',
  unit: '{request}',
});

// Histogram
const latencyHistogram = meter.createHistogram('http.server.request.duration', {
  description: 'HTTP request latency',
  unit: 's',
  advice: {
    explicitBucketBoundaries: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  },
});

// UpDownCounter
const activeConnections = meter.createUpDownCounter('http.server.active_connections', {
  description: 'Active HTTP connections',
});

// Usage
requestCounter.add(1, { method: 'GET', path: '/api/users', status_code: 200 });
latencyHistogram.record(0.045, { method: 'GET', path: '/api/users' });
activeConnections.add(1);

StatsD

Protocol Format

<metric_name>:<value>|<type>|@<sample_rate>|#<tags>
Type Code Example
Counter c page.views:1\|c
Gauge g fuel.level:0.5\|g
Timer ms request.duration:320\|ms
Set s users.uniques:user123\|s
Histogram h request.size:512\|h (DogStatsD)
Distribution d request.duration:320\|d (DogStatsD)

DogStatsD Extensions (Datadog)

# Counter with tags
http.requests:1|c|#method:GET,path:/api/users,status:200

# Histogram with sample rate
http.request.duration:45.2|h|@0.5|#service:api

# Gauge
system.cpu.usage:72.5|g|#host:web01

# Service check
_sc|myservice.health|0|#env:production|m:Service is healthy

When to use StatsD over Prometheus:

  • Existing StatsD infrastructure
  • Simple counter/gauge/timer needs without complex queries
  • Push model required (ephemeral jobs, serverless)
  • Language/framework has StatsD client but no Prometheus client

Custom Metrics Design

Naming Conventions

Follow OpenMetrics/Prometheus naming:

<namespace>_<subsystem>_<name>_<unit>_<suffix>
Component Rules Examples
Namespace Application or domain myapp, payment, auth
Subsystem Component within app http, db, cache, queue
Name What is measured request, connection, query
Unit SI unit (base, not milli/micro) seconds, bytes, ratio
Suffix Metric type _total (counter), _info (metadata), _bucket (histogram)

Good names:

http_server_request_duration_seconds          # histogram
http_server_requests_total                    # counter
db_connection_pool_active_connections         # gauge
cache_hit_ratio                               # gauge (0-1)
queue_messages_total                          # counter
payment_processing_duration_seconds           # histogram

Bad names:

requestCount          # No namespace, no suffix, camelCase
latency_ms            # Milliseconds (use seconds), no namespace
errors                # Vague, no namespace, no suffix
HttpRequests          # PascalCase

Label Best Practices

Do:

  • Use labels for dimensions you will filter/aggregate by
  • Keep label cardinality bounded (< 100 unique values per label)
  • Use consistent label names across metrics (method, not http_method in some and request_method in others)

Don't:

  • Use user IDs, email addresses, or request IDs as labels (unbounded cardinality)
  • Use full URL paths as labels (use route templates: /api/users/{id}, not /api/users/12345)
  • Use error messages as labels (unbounded text)
  • Create more than 5-7 labels per metric

Avoiding Cardinality Bombs

# BAD: unbounded path label
http_requests_total{path="/api/users/12345"}    # Millions of unique series
http_requests_total{path="/api/users/67890"}

# GOOD: use route template
http_requests_total{route="/api/users/{id}"}    # One series per route

# BAD: error message as label
errors_total{message="connection refused to 10.0.0.5:5432"}

# GOOD: error category as label
errors_total{type="connection_refused", target="postgres"}

Cardinality check query:

# Find high-cardinality metrics
topk(10, count by (__name__) ({__name__=~".+"}))

# Check specific metric cardinality
count(http_requests_total)

SLI / SLO / SLA

Definitions

Term Definition Example
SLI (Service Level Indicator) Quantitative measure of service behavior 99.2% of requests complete in < 500ms
SLO (Service Level Objective) Target value for an SLI 99.5% of requests should complete in < 500ms
SLA (Service Level Agreement) Business contract with consequences 99.9% availability or credit issued

Relationship: SLI measures reality → SLO sets the target → SLA defines business consequences.

Error Budget Calculation

Error budget = 1 - SLO target

Example:
  SLO = 99.9% availability
  Error budget = 0.1% = 43.2 minutes/month

  In a 30-day month:
  - Total minutes: 43,200
  - Allowed downtime: 43.2 minutes
  - Allowed error requests: 0.1% of total

Burn Rate Alerting

Burn rate = rate at which error budget is being consumed relative to the budget period.

burn_rate = error_rate / (1 - SLO_target)
Burn Rate Budget Exhaustion Alert?
1x 30 days (full period) No
2x 15 days No
6x 5 days Ticket (warning)
14.4x 2 days Page (critical)
36x 20 hours Page immediately

Multi-window burn rate alert (recommended):

# Fast burn: 14.4x burn rate over 1-hour window, confirmed by 5-minute window
- alert: SLOHighBurnRate
  expr: |
    (
      sum(rate(http_requests_total{status_code=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)
    and
    (
      sum(rate(http_requests_total{status_code=~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    ) > (14.4 * 0.001)
  labels:
    severity: critical
  annotations:
    summary: "High error budget burn rate"

# Slow burn: 6x burn rate over 6-hour window, confirmed by 30-minute window
- alert: SLOSlowBurnRate
  expr: |
    (
      sum(rate(http_requests_total{status_code=~"5.."}[6h]))
      /
      sum(rate(http_requests_total[6h]))
    ) > (6 * 0.001)
    and
    (
      sum(rate(http_requests_total{status_code=~"5.."}[30m]))
      /
      sum(rate(http_requests_total[30m]))
    ) > (6 * 0.001)
  labels:
    severity: warning

SLO Document Template

# SLO: [Service Name] - [SLO Name]

## Overview
- **Service:** payment-api
- **Owner:** payments-team
- **Last reviewed:** 2026-03-01

## SLI Definition
- **Type:** Availability (success rate)
- **Good events:** HTTP responses with status < 500
- **Total events:** All HTTP responses
- **Measurement:** `sum(rate(http_requests_total{status<500}[5m])) / sum(rate(http_requests_total[5m]))`

## SLO Target
- **Target:** 99.9%
- **Window:** 30 days (rolling)
- **Error budget:** 0.1% = ~43 minutes of downtime

## Alerting
- **Fast burn (page):** 14.4x burn rate for 1 hour
- **Slow burn (ticket):** 6x burn rate for 6 hours

## Consequences of Missing SLO
- Freeze non-critical deployments
- Allocate sprint capacity to reliability
- Review in next SLO review meeting

Alert Routing

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00/B00/XXX"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

route:
  receiver: "default-slack"
  group_by: ["alertname", "job"]
  group_wait: 30s        # Wait before sending first notification
  group_interval: 5m     # Wait before sending updates
  repeat_interval: 4h    # Resend if not resolved

  routes:
    # Critical alerts → PagerDuty
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      group_wait: 10s
      repeat_interval: 1h

    # Warning alerts → Slack
    - match:
        severity: warning
      receiver: "slack-warnings"
      repeat_interval: 4h

    # Info alerts → Slack info channel
    - match:
        severity: info
      receiver: "slack-info"
      repeat_interval: 24h

    # Team-specific routing
    - match:
        team: database
      receiver: "pagerduty-database"
      routes:
        - match:
            severity: critical
          receiver: "pagerduty-database"
        - match:
            severity: warning
          receiver: "slack-database"

receivers:
  - name: "default-slack"
    slack_configs:
      - channel: "#alerts"
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "<integration-key>"
        severity: critical
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        details:
          description: '{{ .CommonAnnotations.description }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'

  - name: "slack-warnings"
    slack_configs:
      - channel: "#alerts-warning"
        title: ':warning: {{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: "slack-info"
    slack_configs:
      - channel: "#alerts-info"

inhibit_rules:
  # Suppress warning if critical is already firing
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["alertname", "job"]

Runbook Template

# Runbook: [Alert Name]

## Alert Details
- **Alert:** HighErrorRate
- **Severity:** Warning / Critical
- **Team:** backend

## Symptom
What the user/system is experiencing when this alert fires.

## Investigation Steps
1. Check the Grafana dashboard: [link]
2. Check recent deployments: `kubectl rollout history deployment/api`
3. Check error logs: `kubectl logs -l app=api --tail=100 | jq 'select(.level=="ERROR")'`
4. Check downstream dependencies: [dashboard link]

## Mitigation
Immediate actions to reduce impact:
1. If caused by recent deploy: `kubectl rollout undo deployment/api`
2. If caused by downstream: Enable circuit breaker / failover
3. If caused by traffic spike: Scale horizontally

## Resolution
Steps to fully resolve:
1. Identify root cause from logs/traces
2. Create fix PR
3. Deploy fix through normal pipeline
4. Verify error rate returns to baseline

## Escalation
- Level 1: On-call engineer (this runbook)
- Level 2: Team lead (@team-lead)
- Level 3: VP Engineering (for customer-impacting incidents)

Uptime Monitoring

Uptime Kuma (Self-hosted)

# docker-compose.yml
services:
  uptime-kuma:
    image: louislam/uptime-kuma:1
    restart: unless-stopped
    ports:
      - "3001:3001"
    volumes:
      - uptime-kuma-data:/app/data

volumes:
  uptime-kuma-data:

Features:

  • HTTP(s), TCP, DNS, Docker, gRPC, MQTT monitors
  • Status pages (public-facing)
  • Notifications: Slack, Discord, Telegram, PagerDuty, email, webhooks
  • Certificate expiry monitoring
  • Multi-language support

Synthetic Monitoring

Run scripted checks from multiple regions to verify end-to-end functionality:

// Example: Grafana synthetic monitoring check
import { check } from 'k6';
import http from 'k6/http';

export default function () {
  const res = http.get('https://api.example.com/health');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
    'body contains ok': (r) => r.body.includes('"status":"ok"'),
  });
}

Status Pages

Communicate service health to users:

Tool Type Features
Uptime Kuma Self-hosted Free, built-in status page
Betteruptime SaaS Incident management + status page
Cachet Self-hosted PHP-based, mature
Instatus SaaS Modern, integrations
Statuspage (Atlassian) SaaS Enterprise, expensive

Status page best practices:

  • Show individual component status (API, database, CDN, auth)
  • Include historical uptime percentage (30/90 day)
  • Post incident updates promptly (investigating → identified → monitoring → resolved)
  • Subscribe option for email/SMS/RSS notifications