Infrastructure Monitoring Reference

Comprehensive reference for health checks, infrastructure metrics, APM, cost optimization, capacity planning, and incident response.

Health Checks

Types of Health Checks

Type	Question It Answers	Failure Action
Liveness	Is the process alive and not deadlocked?	Restart the process
Readiness	Can this instance serve traffic?	Remove from load balancer
Startup	Has the process finished initializing?	Wait (don't restart yet)

Implementation Patterns

Basic Health Check Endpoint

// Go
type HealthStatus struct {
    Status    string            `json:"status"`
    Timestamp string            `json:"timestamp"`
    Version   string            `json:"version"`
    Checks    map[string]Check  `json:"checks"`
}

type Check struct {
    Status  string `json:"status"`
    Message string `json:"message,omitempty"`
    Latency string `json:"latency,omitempty"`
}

func healthHandler(w http.ResponseWriter, r *http.Request) {
    health := HealthStatus{
        Status:    "ok",
        Timestamp: time.Now().UTC().Format(time.RFC3339),
        Version:   version,
        Checks:    make(map[string]Check),
    }

    // Check database
    start := time.Now()
    if err := db.PingContext(r.Context()); err != nil {
        health.Status = "degraded"
        health.Checks["database"] = Check{
            Status:  "fail",
            Message: err.Error(),
        }
    } else {
        health.Checks["database"] = Check{
            Status:  "ok",
            Latency: time.Since(start).String(),
        }
    }

    // Check Redis
    start = time.Now()
    if err := redis.Ping(r.Context()).Err(); err != nil {
        health.Status = "degraded"
        health.Checks["redis"] = Check{
            Status:  "fail",
            Message: err.Error(),
        }
    } else {
        health.Checks["redis"] = Check{
            Status:  "ok",
            Latency: time.Since(start).String(),
        }
    }

    statusCode := http.StatusOK
    if health.Status != "ok" {
        statusCode = http.StatusServiceUnavailable
    }

    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(statusCode)
    json.NewEncoder(w).Encode(health)
}

# Python (FastAPI)
from fastapi import FastAPI, Response
from datetime import datetime, timezone
import asyncio

app = FastAPI()

@app.get("/health")
async def health_check():
    checks = {}
    status = "ok"

    # Database check
    try:
        start = datetime.now(timezone.utc)
        await db.execute("SELECT 1")
        checks["database"] = {
            "status": "ok",
            "latency_ms": (datetime.now(timezone.utc) - start).total_seconds() * 1000,
        }
    except Exception as e:
        status = "degraded"
        checks["database"] = {"status": "fail", "message": str(e)}

    # Redis check
    try:
        start = datetime.now(timezone.utc)
        await redis.ping()
        checks["redis"] = {
            "status": "ok",
            "latency_ms": (datetime.now(timezone.utc) - start).total_seconds() * 1000,
        }
    except Exception as e:
        status = "degraded"
        checks["redis"] = {"status": "fail", "message": str(e)}

    response_code = 200 if status == "ok" else 503
    return Response(
        content=json.dumps({
            "status": status,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "checks": checks,
        }),
        status_code=response_code,
        media_type="application/json",
    )

@app.get("/ready")
async def readiness_check():
    """Readiness: can we serve traffic?"""
    try:
        await db.execute("SELECT 1")
        return {"status": "ready"}
    except Exception:
        return Response(
            content='{"status": "not_ready"}',
            status_code=503,
            media_type="application/json",
        )

@app.get("/live")
async def liveness_check():
    """Liveness: is the process alive?"""
    return {"status": "alive"}

Health Check Response Format

{
  "status": "ok",
  "timestamp": "2026-03-09T14:32:01Z",
  "version": "1.4.2",
  "checks": {
    "database": {
      "status": "ok",
      "latency_ms": 2.3
    },
    "redis": {
      "status": "ok",
      "latency_ms": 0.8
    },
    "external_api": {
      "status": "degraded",
      "message": "Elevated latency",
      "latency_ms": 850
    }
  }
}

Liveness vs Readiness Decision Guide

Is the process able to make progress?
├─ No (deadlocked, OOM, infinite loop)
│  └─ Liveness check should FAIL → container gets restarted
│
└─ Yes, but...
   ├─ Database is temporarily unreachable
   │  └─ Readiness FAIL, Liveness PASS → stop sending traffic, don't restart
   │
   ├─ Still loading initial data/cache
   │  └─ Startup FAIL → don't check liveness yet, wait
   │
   └─ Everything is fine
      └─ All checks PASS → serve traffic normally

Common mistake: Making liveness depend on external dependencies (database, Redis). If the database is down, restarting the application won't help — it will cause a restart storm.

Kubernetes Probes

Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
        - name: api
          image: api-server:1.4.2
          ports:
            - containerPort: 8080

          # Startup probe: runs first, disables liveness/readiness until passing
          startupProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 30     # 30 * 5s = 150s max startup time
            successThreshold: 1

          # Liveness probe: is the process alive?
          livenessProbe:
            httpGet:
              path: /live
              port: 8080
            initialDelaySeconds: 0    # Starts after startup probe passes
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3       # 3 consecutive failures → restart
            successThreshold: 1

          # Readiness probe: can it serve traffic?
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 0
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3       # 3 failures → remove from Service
            successThreshold: 1

          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

Probe Types

HTTP GET

livenessProbe:
  httpGet:
    path: /health
    port: 8080
    httpHeaders:
      - name: Authorization
        value: Bearer internal-token

TCP Socket

# For services that don't have HTTP (databases, message brokers)
livenessProbe:
  tcpSocket:
    port: 5432
  periodSeconds: 10

Exec Command

# Run a command inside the container
livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - pg_isready -U postgres
  periodSeconds: 10

gRPC Health Check

# gRPC health checking protocol
livenessProbe:
  grpc:
    port: 50051
    service: ""   # Empty string checks overall server health
  periodSeconds: 10

Probe Configuration Guidelines

Parameter	Liveness	Readiness	Startup
`initialDelaySeconds`	0 (use startup probe)	0	5-10
`periodSeconds`	10-15	5-10	5
`timeoutSeconds`	3-5	3-5	3-5
`failureThreshold`	3	3	30 (generous)
`successThreshold`	1	1-2	1

Docker HEALTHCHECK

# Dockerfile
FROM node:20-slim

HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
  CMD curl -f http://localhost:8080/health || exit 1

# Or with wget (no curl in alpine)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
  CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1

docker-compose Health Check

services:
  api:
    image: api-server:1.4.2
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 60s

  worker:
    image: worker:1.2.0
    depends_on:
      api:
        condition: service_healthy
      postgres:
        condition: service_healthy

  postgres:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

Health Check Parameters

Parameter	Description	Default	Recommendation
`interval`	Time between checks	30s	15-30s for critical services
`timeout`	Max time for check	30s	3-5s (fail fast)
`retries`	Failures before unhealthy	3	3 (avoid flapping)
`start_period`	Grace period for startup	0s	Set to max startup time

Uptime Monitoring

Uptime Kuma Setup

# docker-compose.yml
services:
  uptime-kuma:
    image: louislam/uptime-kuma:1
    restart: unless-stopped
    ports:
      - "3001:3001"
    volumes:
      - uptime-kuma-data:/app/data
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.uptime.rule=Host(`status.example.com`)"

volumes:
  uptime-kuma-data:

Monitor types supported:

HTTP(s) — status code, keyword, response time
TCP — port open check
DNS — resolution check
Docker container — running status
gRPC — health check protocol
MQTT — broker connectivity
Ping (ICMP) — network reachability
Push — heartbeat endpoint (service pushes to Uptime Kuma)

Synthetic Monitoring

Scripted checks that simulate real user behavior from multiple regions:

// k6 script for synthetic monitoring
import { check, sleep } from 'k6';
import http from 'k6/http';

export const options = {
  scenarios: {
    synthetic: {
      executor: 'constant-vus',
      vus: 1,
      duration: '24h',
      gracefulStop: '0s',
    },
  },
  thresholds: {
    http_req_duration: ['p(95)<500'],    // 95% under 500ms
    http_req_failed: ['rate<0.01'],       // < 1% failure rate
    checks: ['rate>0.99'],                // 99% checks pass
  },
};

export default function () {
  // Check homepage
  let res = http.get('https://www.example.com');
  check(res, {
    'homepage status 200': (r) => r.status === 200,
    'homepage loads fast': (r) => r.timings.duration < 500,
    'homepage has title': (r) => r.body.includes('<title>'),
  });

  // Check API health
  res = http.get('https://api.example.com/health');
  check(res, {
    'api health 200': (r) => r.status === 200,
    'api reports ok': (r) => JSON.parse(r.body).status === 'ok',
  });

  // Check login flow
  res = http.post('https://api.example.com/auth/login', JSON.stringify({
    email: 'synthetic-user@example.com',
    password: process.env.SYNTHETIC_PASSWORD,
  }), { headers: { 'Content-Type': 'application/json' } });
  check(res, {
    'login succeeds': (r) => r.status === 200,
    'login returns token': (r) => JSON.parse(r.body).token !== undefined,
  });

  sleep(60); // Check every 60 seconds
}

Multi-Region Monitoring

Provider	Regions	Free Tier	Notes
Uptime Kuma	Self-hosted (1 region)	Free	Deploy in multiple regions yourself
Betteruptime	10+ regions	5 monitors	Status page included
Grafana Synthetic	20+ regions	Part of Grafana Cloud	k6-based scripts
Datadog Synthetic	100+ locations	100 API tests/month	Full browser testing
AWS CloudWatch Synthetics	All AWS regions	Pay per run	Canary scripts

Infrastructure Metrics

CPU Metrics

Metric	Source	What It Shows
`node_cpu_seconds_total{mode="user"}`	node_exporter	Time in user space
`node_cpu_seconds_total{mode="system"}`	node_exporter	Time in kernel space
`node_cpu_seconds_total{mode="iowait"}`	node_exporter	Time waiting for I/O
`node_cpu_seconds_total{mode="idle"}`	node_exporter	Idle time
`node_load1` / `node_load5` / `node_load15`	node_exporter	Load average (1/5/15 min)

Common queries:

# CPU usage percentage (all modes except idle)
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# CPU usage by mode
sum by (mode) (rate(node_cpu_seconds_total{instance="web01:9100"}[5m]))

# IO wait percentage (high = disk bottleneck)
avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m]))

# Load average vs CPU count
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})

Memory Metrics

Metric	What It Shows
`node_memory_MemTotal_bytes`	Total physical memory
`node_memory_MemAvailable_bytes`	Memory available for applications
`node_memory_Cached_bytes`	Page cache (reclaimable)
`node_memory_Buffers_bytes`	Buffer cache
`node_memory_SwapTotal_bytes`	Total swap
`node_memory_SwapFree_bytes`	Free swap

# Memory usage percentage
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Memory breakdown
node_memory_MemTotal_bytes
  - node_memory_MemAvailable_bytes
  - node_memory_Cached_bytes
  - node_memory_Buffers_bytes

# Swap usage (any swap usage may indicate memory pressure)
1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)

Disk Metrics

# Disk usage percentage
1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes)

# Disk I/O utilization (percentage of time doing I/O)
rate(node_disk_io_time_seconds_total[5m])

# Read/write throughput
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# IOPS
rate(node_disk_reads_completed_total[5m])
rate(node_disk_writes_completed_total[5m])

# Average I/O latency
rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m])

Network Metrics

# Bandwidth (bytes/sec)
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])

# Packet errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

# TCP connections
node_netstat_Tcp_CurrEstab         # Current established connections
rate(node_netstat_Tcp_ActiveOpens[5m])  # New outbound connections/sec
rate(node_netstat_Tcp_PassiveOpens[5m]) # New inbound connections/sec

Container Metrics

cAdvisor Metrics

Metric	Description
`container_cpu_usage_seconds_total`	Total CPU time consumed
`container_cpu_cfs_throttled_periods_total`	CPU throttling events
`container_memory_working_set_bytes`	Current memory (excludes cache)
`container_memory_usage_bytes`	Total memory (includes cache)
`container_network_receive_bytes_total`	Network inbound bytes
`container_network_transmit_bytes_total`	Network outbound bytes
`container_fs_usage_bytes`	Container filesystem usage
`container_spec_memory_limit_bytes`	Memory limit
`container_spec_cpu_quota`	CPU quota

# Container CPU usage percentage (of limit)
sum by (container, pod) (
  rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])
) / sum by (container, pod) (
  container_spec_cpu_quota / container_spec_cpu_period
)

# Container memory usage percentage (of limit)
container_memory_working_set_bytes{container!="POD",container!=""}
/
container_spec_memory_limit_bytes{container!="POD",container!=""} > 0

# CPU throttling percentage
sum by (container, pod) (
  rate(container_cpu_cfs_throttled_periods_total[5m])
) / sum by (container, pod) (
  rate(container_cpu_cfs_periods_total[5m])
)

# OOMKill detection
increase(kube_pod_container_status_restarts_total[1h]) > 0
and
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

Kubernetes Metrics (kube-state-metrics)

# Pod status
kube_pod_status_phase{phase="Running"}
kube_pod_status_phase{phase="Pending"}
kube_pod_status_phase{phase="Failed"}

# Deployment replicas
kube_deployment_status_replicas_available
kube_deployment_spec_replicas

# HPA status
kube_horizontalpodautoscaler_status_current_replicas
kube_horizontalpodautoscaler_spec_max_replicas

Node Exporter

Setup

# docker-compose.yml
services:
  node-exporter:
    image: prom/node-exporter:v1.7.0
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

Kubernetes DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9100"
    spec:
      hostPID: true
      hostNetwork: true
      containers:
        - name: node-exporter
          image: prom/node-exporter:v1.7.0
          ports:
            - containerPort: 9100
              hostPort: 9100
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly: true
            - name: sys
              mountPath: /host/sys
              readOnly: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
      tolerations:
        - effect: NoSchedule
          operator: Exists

APM Tools

Comparison

Feature	Datadog APM	New Relic	Elastic APM	Sentry
Type	Full APM	Full APM	Full APM	Error tracking + perf
Pricing	Per host ($31+/mo)	Per user + data	Free (self-host) or Cloud	Per event volume
Traces	Yes	Yes	Yes	Transaction traces
Error tracking	Yes	Yes	Yes	Excellent
Profiling	Yes (continuous)	Yes	No	No
Log correlation	Yes	Yes	Yes	Breadcrumbs
Dashboards	Built-in	Built-in	Kibana	Limited
Setup	Agent-based	Agent-based	Agent or OTel	SDK-based
Best for	Enterprise, full stack	Full observability	Self-hosted, ELK users	Error-focused teams

Sentry Error Tracking

# Python
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration

sentry_sdk.init(
    dsn="https://key@sentry.io/project",
    traces_sample_rate=0.1,  # 10% of transactions
    profiles_sample_rate=0.1,
    environment="production",
    release="1.4.2",
    integrations=[FastApiIntegration()],
)

// Node.js
const Sentry = require('@sentry/node');

Sentry.init({
  dsn: 'https://key@sentry.io/project',
  tracesSampleRate: 0.1,
  environment: 'production',
  release: '1.4.2',
});

// Go
import "github.com/getsentry/sentry-go"

sentry.Init(sentry.ClientOptions{
    Dsn:              "https://key@sentry.io/project",
    TracesSampleRate: 0.1,
    Environment:      "production",
    Release:          "1.4.2",
})
defer sentry.Flush(2 * time.Second)

Cost Optimization

Metric Cardinality Review

High cardinality is the most common cost driver in metrics systems:

# Find metrics with the most time series
topk(20, count by (__name__) ({__name__=~".+"}))

# Find labels with high cardinality
count(group by (path) (http_requests_total))   # How many unique paths?
count(group by (user_id) (api_calls_total))    # Unbounded!

Reduction strategies:

Remove unused metrics (if nobody dashboards/alerts on it, drop it)
Replace high-cardinality labels with bounded categories
Use recording rules to pre-aggregate, drop raw metrics
Use metric relabeling in Prometheus to drop at scrape time

# Drop unused metrics at scrape time
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "go_.*"           # Drop Go runtime metrics if unused
    action: drop

Log Volume Reduction

Strategy	Savings	Implementation
Set production to INFO	50-80%	Logger config
Sample health check logs	90% for /health	Middleware filter
Truncate large payloads	20-40%	Body size limit (4KB)
Drop duplicate errors	30-50%	Rate-limit per error type
Compress in transit	60-80% bandwidth	Enable gzip on log shipper

Trace Sampling

Sampling Rate	Monthly Cost (est.)	Suitability
100%	$$$$	Development, < 100 req/s
10%	$$$	Staging, medium traffic
1%	$$	Production, high traffic
Tail-based (errors + slow)	$$	Production (recommended)
0.1%	$	Very high traffic (> 100k req/s)

Retention Tiers

Tier	Metrics	Logs	Traces
Hot (0-14 days)	15s resolution	Full fidelity	All sampled traces
Warm (14-90 days)	1m resolution	Full fidelity	Error + slow traces only
Cold (90 days - 1 year)	5m resolution	Compressed	None (rely on metrics)
Archive (1-7 years)	1h resolution	Compliance logs only	None

Capacity Planning

Load Testing Correlation

Run load tests while monitoring infrastructure metrics to establish scaling thresholds:

Load Test Results:
┌─────────┬──────────┬────────┬─────────┬──────────────┐
│ RPS     │ p99 (ms) │ CPU %  │ Mem %   │ Error Rate   │
├─────────┼──────────┼────────┼─────────┼──────────────┤
│ 100     │ 45       │ 15     │ 30      │ 0%           │
│ 500     │ 85       │ 35     │ 45      │ 0%           │
│ 1000    │ 150      │ 55     │ 55      │ 0%           │
│ 2000    │ 320      │ 75     │ 65      │ 0.1%         │
│ 3000    │ 850      │ 90     │ 72      │ 1.5%         │  ← degradation
│ 4000    │ 2500     │ 98     │ 78      │ 12%          │  ← failure
└─────────┴──────────┴────────┴─────────┴──────────────┘

Scaling trigger: 75% CPU → add instance
Target capacity: 2x expected peak traffic

Scaling Triggers

# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70    # Scale up at 70% CPU
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"       # Scale at 1000 RPS per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50                  # Max 50% increase per scale-up
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

Resource Forecasting

# Predict disk full in N hours
predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600) < 0
# "Disk will be full within 30 days"

# Predict memory usage trend
predict_linear(
  avg_over_time(container_memory_working_set_bytes[7d]),
  30*24*3600
)

# Growth rate of database size
rate(pg_database_size_bytes[7d])
# Convert to "GB per month"
rate(pg_database_size_bytes[7d]) * 86400 * 30 / 1e9

Incident Response

Incident Lifecycle

Detection → Triage → Mitigate → Resolve → Postmortem
    │          │         │          │          │
    │          │         │          │          └─ Blameless review
    │          │         │          └─ Root cause fix deployed
    │          │         └─ User impact reduced/eliminated
    │          └─ Severity assigned, team engaged
    └─ Alert fires or user reports issue

Severity Classification

Severity	Impact	Response Time	Examples
SEV1 (Critical)	Service down, data loss, security breach	< 15 minutes	Complete outage, payment processing failure
SEV2 (Major)	Significant degradation, partial outage	< 30 minutes	One region down, 50%+ error rate
SEV3 (Minor)	Limited impact, workaround exists	< 4 hours	Single feature broken, elevated latency
SEV4 (Low)	Minimal impact, cosmetic	Next business day	UI glitch, non-critical alert firing

Incident Commander Checklist

## Initial Response (first 15 minutes)
- [ ] Acknowledge the alert / report
- [ ] Assess severity (SEV1-4)
- [ ] Open incident channel (#inc-YYYYMMDD-description)
- [ ] Page relevant team members
- [ ] Post initial status update

## Triage (15-30 minutes)
- [ ] Identify affected services and scope
- [ ] Check recent deployments: any changes in last 2 hours?
- [ ] Check dashboards for anomalies
- [ ] Check external dependencies (status pages)
- [ ] Determine if rollback is feasible

## Mitigation
- [ ] Implement immediate fix (rollback, feature flag, scaling)
- [ ] Verify user impact is reduced
- [ ] Update status page
- [ ] Communicate ETA for full resolution

## Resolution
- [ ] Confirm root cause
- [ ] Deploy fix
- [ ] Verify metrics return to baseline
- [ ] Clear incident status
- [ ] Schedule postmortem within 48 hours

Postmortem Template

# Incident Postmortem: [TITLE]

**Date:** 2026-03-09
**Duration:** 45 minutes (14:15 - 15:00 UTC)
**Severity:** SEV2
**Author:** [Name]
**Status:** Complete

## Summary
One-paragraph description of what happened and impact.

## Impact
- Users affected: ~5,000
- Revenue impact: ~$2,500
- SLO budget consumed: 3.2 hours of the monthly 43-minute budget

## Timeline (all times UTC)
| Time | Event |
|------|-------|
| 14:12 | Deploy v1.4.3 to production |
| 14:15 | Error rate alert fires (5% → 15%) |
| 14:17 | On-call acknowledges, starts investigation |
| 14:22 | Root cause identified: new query missing index |
| 14:25 | Decision: rollback v1.4.3 |
| 14:30 | Rollback complete |
| 14:35 | Error rate returns to baseline |
| 15:00 | All-clear declared |

## Root Cause
The v1.4.3 deployment added a new API endpoint that queried the orders
table without an index on `user_id + created_at`. Under load, this caused
connection pool exhaustion, which cascaded to other endpoints.

## Detection
Alert fired 3 minutes after deploy. Detection was effective.

## Contributing Factors
1. No load test for the new endpoint
2. Missing index not caught in code review
3. No query performance checks in CI

## Action Items
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| Add index on orders(user_id, created_at) | @backend | 2026-03-10 | Done |
| Add slow query detection to CI pipeline | @platform | 2026-03-15 | TODO |
| Add load test for new endpoints to deploy checklist | @backend | 2026-03-12 | TODO |
| Set up query performance alerting (> 100ms avg) | @sre | 2026-03-14 | TODO |

## Lessons Learned
- What went well: Fast detection (3 min), fast rollback (8 min)
- What went poorly: No pre-production load test caught the issue
- Where we got lucky: Happened during business hours, not at 3 AM

Communication During Incidents

Audience	Channel	Frequency	Content
Engineering	Slack #incident	Real-time	Technical details, commands run
Management	Slack #incidents-summary	Every 15-30 min	Impact, ETA, escalation needs
Customers	Status page	Every 15-30 min	User-facing impact, workarounds
Support	Slack #support-escalation	On status change	Scripted responses, known workarounds

infrastructure.md 28 KB Permalink History Raw