infrastructure.md 28 KB

Infrastructure Monitoring Reference

Comprehensive reference for health checks, infrastructure metrics, APM, cost optimization, capacity planning, and incident response.


Health Checks

Types of Health Checks

Type Question It Answers Failure Action
Liveness Is the process alive and not deadlocked? Restart the process
Readiness Can this instance serve traffic? Remove from load balancer
Startup Has the process finished initializing? Wait (don't restart yet)

Implementation Patterns

Basic Health Check Endpoint

// Go
type HealthStatus struct {
    Status    string            `json:"status"`
    Timestamp string            `json:"timestamp"`
    Version   string            `json:"version"`
    Checks    map[string]Check  `json:"checks"`
}

type Check struct {
    Status  string `json:"status"`
    Message string `json:"message,omitempty"`
    Latency string `json:"latency,omitempty"`
}

func healthHandler(w http.ResponseWriter, r *http.Request) {
    health := HealthStatus{
        Status:    "ok",
        Timestamp: time.Now().UTC().Format(time.RFC3339),
        Version:   version,
        Checks:    make(map[string]Check),
    }

    // Check database
    start := time.Now()
    if err := db.PingContext(r.Context()); err != nil {
        health.Status = "degraded"
        health.Checks["database"] = Check{
            Status:  "fail",
            Message: err.Error(),
        }
    } else {
        health.Checks["database"] = Check{
            Status:  "ok",
            Latency: time.Since(start).String(),
        }
    }

    // Check Redis
    start = time.Now()
    if err := redis.Ping(r.Context()).Err(); err != nil {
        health.Status = "degraded"
        health.Checks["redis"] = Check{
            Status:  "fail",
            Message: err.Error(),
        }
    } else {
        health.Checks["redis"] = Check{
            Status:  "ok",
            Latency: time.Since(start).String(),
        }
    }

    statusCode := http.StatusOK
    if health.Status != "ok" {
        statusCode = http.StatusServiceUnavailable
    }

    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(statusCode)
    json.NewEncoder(w).Encode(health)
}
# Python (FastAPI)
from fastapi import FastAPI, Response
from datetime import datetime, timezone
import asyncio

app = FastAPI()

@app.get("/health")
async def health_check():
    checks = {}
    status = "ok"

    # Database check
    try:
        start = datetime.now(timezone.utc)
        await db.execute("SELECT 1")
        checks["database"] = {
            "status": "ok",
            "latency_ms": (datetime.now(timezone.utc) - start).total_seconds() * 1000,
        }
    except Exception as e:
        status = "degraded"
        checks["database"] = {"status": "fail", "message": str(e)}

    # Redis check
    try:
        start = datetime.now(timezone.utc)
        await redis.ping()
        checks["redis"] = {
            "status": "ok",
            "latency_ms": (datetime.now(timezone.utc) - start).total_seconds() * 1000,
        }
    except Exception as e:
        status = "degraded"
        checks["redis"] = {"status": "fail", "message": str(e)}

    response_code = 200 if status == "ok" else 503
    return Response(
        content=json.dumps({
            "status": status,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "checks": checks,
        }),
        status_code=response_code,
        media_type="application/json",
    )

@app.get("/ready")
async def readiness_check():
    """Readiness: can we serve traffic?"""
    try:
        await db.execute("SELECT 1")
        return {"status": "ready"}
    except Exception:
        return Response(
            content='{"status": "not_ready"}',
            status_code=503,
            media_type="application/json",
        )

@app.get("/live")
async def liveness_check():
    """Liveness: is the process alive?"""
    return {"status": "alive"}

Health Check Response Format

{
  "status": "ok",
  "timestamp": "2026-03-09T14:32:01Z",
  "version": "1.4.2",
  "checks": {
    "database": {
      "status": "ok",
      "latency_ms": 2.3
    },
    "redis": {
      "status": "ok",
      "latency_ms": 0.8
    },
    "external_api": {
      "status": "degraded",
      "message": "Elevated latency",
      "latency_ms": 850
    }
  }
}

Liveness vs Readiness Decision Guide

Is the process able to make progress?
├─ No (deadlocked, OOM, infinite loop)
│  └─ Liveness check should FAIL → container gets restarted
│
└─ Yes, but...
   ├─ Database is temporarily unreachable
   │  └─ Readiness FAIL, Liveness PASS → stop sending traffic, don't restart
   │
   ├─ Still loading initial data/cache
   │  └─ Startup FAIL → don't check liveness yet, wait
   │
   └─ Everything is fine
      └─ All checks PASS → serve traffic normally

Common mistake: Making liveness depend on external dependencies (database, Redis). If the database is down, restarting the application won't help — it will cause a restart storm.


Kubernetes Probes

Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
        - name: api
          image: api-server:1.4.2
          ports:
            - containerPort: 8080

          # Startup probe: runs first, disables liveness/readiness until passing
          startupProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 30     # 30 * 5s = 150s max startup time
            successThreshold: 1

          # Liveness probe: is the process alive?
          livenessProbe:
            httpGet:
              path: /live
              port: 8080
            initialDelaySeconds: 0    # Starts after startup probe passes
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3       # 3 consecutive failures → restart
            successThreshold: 1

          # Readiness probe: can it serve traffic?
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 0
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3       # 3 failures → remove from Service
            successThreshold: 1

          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

Probe Types

HTTP GET

livenessProbe:
  httpGet:
    path: /health
    port: 8080
    httpHeaders:
      - name: Authorization
        value: Bearer internal-token

TCP Socket

# For services that don't have HTTP (databases, message brokers)
livenessProbe:
  tcpSocket:
    port: 5432
  periodSeconds: 10

Exec Command

# Run a command inside the container
livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - pg_isready -U postgres
  periodSeconds: 10

gRPC Health Check

# gRPC health checking protocol
livenessProbe:
  grpc:
    port: 50051
    service: ""   # Empty string checks overall server health
  periodSeconds: 10

Probe Configuration Guidelines

Parameter Liveness Readiness Startup
initialDelaySeconds 0 (use startup probe) 0 5-10
periodSeconds 10-15 5-10 5
timeoutSeconds 3-5 3-5 3-5
failureThreshold 3 3 30 (generous)
successThreshold 1 1-2 1

Docker HEALTHCHECK

# Dockerfile
FROM node:20-slim

HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
  CMD curl -f http://localhost:8080/health || exit 1

# Or with wget (no curl in alpine)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
  CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1

docker-compose Health Check

services:
  api:
    image: api-server:1.4.2
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 60s

  worker:
    image: worker:1.2.0
    depends_on:
      api:
        condition: service_healthy
      postgres:
        condition: service_healthy

  postgres:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

Health Check Parameters

Parameter Description Default Recommendation
interval Time between checks 30s 15-30s for critical services
timeout Max time for check 30s 3-5s (fail fast)
retries Failures before unhealthy 3 3 (avoid flapping)
start_period Grace period for startup 0s Set to max startup time

Uptime Monitoring

Uptime Kuma Setup

# docker-compose.yml
services:
  uptime-kuma:
    image: louislam/uptime-kuma:1
    restart: unless-stopped
    ports:
      - "3001:3001"
    volumes:
      - uptime-kuma-data:/app/data
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.uptime.rule=Host(`status.example.com`)"

volumes:
  uptime-kuma-data:

Monitor types supported:

  • HTTP(s) — status code, keyword, response time
  • TCP — port open check
  • DNS — resolution check
  • Docker container — running status
  • gRPC — health check protocol
  • MQTT — broker connectivity
  • Ping (ICMP) — network reachability
  • Push — heartbeat endpoint (service pushes to Uptime Kuma)

Synthetic Monitoring

Scripted checks that simulate real user behavior from multiple regions:

// k6 script for synthetic monitoring
import { check, sleep } from 'k6';
import http from 'k6/http';

export const options = {
  scenarios: {
    synthetic: {
      executor: 'constant-vus',
      vus: 1,
      duration: '24h',
      gracefulStop: '0s',
    },
  },
  thresholds: {
    http_req_duration: ['p(95)<500'],    // 95% under 500ms
    http_req_failed: ['rate<0.01'],       // < 1% failure rate
    checks: ['rate>0.99'],                // 99% checks pass
  },
};

export default function () {
  // Check homepage
  let res = http.get('https://www.example.com');
  check(res, {
    'homepage status 200': (r) => r.status === 200,
    'homepage loads fast': (r) => r.timings.duration < 500,
    'homepage has title': (r) => r.body.includes('<title>'),
  });

  // Check API health
  res = http.get('https://api.example.com/health');
  check(res, {
    'api health 200': (r) => r.status === 200,
    'api reports ok': (r) => JSON.parse(r.body).status === 'ok',
  });

  // Check login flow
  res = http.post('https://api.example.com/auth/login', JSON.stringify({
    email: 'synthetic-user@example.com',
    password: process.env.SYNTHETIC_PASSWORD,
  }), { headers: { 'Content-Type': 'application/json' } });
  check(res, {
    'login succeeds': (r) => r.status === 200,
    'login returns token': (r) => JSON.parse(r.body).token !== undefined,
  });

  sleep(60); // Check every 60 seconds
}

Multi-Region Monitoring

Provider Regions Free Tier Notes
Uptime Kuma Self-hosted (1 region) Free Deploy in multiple regions yourself
Betteruptime 10+ regions 5 monitors Status page included
Grafana Synthetic 20+ regions Part of Grafana Cloud k6-based scripts
Datadog Synthetic 100+ locations 100 API tests/month Full browser testing
AWS CloudWatch Synthetics All AWS regions Pay per run Canary scripts

Infrastructure Metrics

CPU Metrics

Metric Source What It Shows
node_cpu_seconds_total{mode="user"} node_exporter Time in user space
node_cpu_seconds_total{mode="system"} node_exporter Time in kernel space
node_cpu_seconds_total{mode="iowait"} node_exporter Time waiting for I/O
node_cpu_seconds_total{mode="idle"} node_exporter Idle time
node_load1 / node_load5 / node_load15 node_exporter Load average (1/5/15 min)

Common queries:

# CPU usage percentage (all modes except idle)
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# CPU usage by mode
sum by (mode) (rate(node_cpu_seconds_total{instance="web01:9100"}[5m]))

# IO wait percentage (high = disk bottleneck)
avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m]))

# Load average vs CPU count
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})

Memory Metrics

Metric What It Shows
node_memory_MemTotal_bytes Total physical memory
node_memory_MemAvailable_bytes Memory available for applications
node_memory_Cached_bytes Page cache (reclaimable)
node_memory_Buffers_bytes Buffer cache
node_memory_SwapTotal_bytes Total swap
node_memory_SwapFree_bytes Free swap
# Memory usage percentage
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Memory breakdown
node_memory_MemTotal_bytes
  - node_memory_MemAvailable_bytes
  - node_memory_Cached_bytes
  - node_memory_Buffers_bytes

# Swap usage (any swap usage may indicate memory pressure)
1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)

Disk Metrics

# Disk usage percentage
1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes)

# Disk I/O utilization (percentage of time doing I/O)
rate(node_disk_io_time_seconds_total[5m])

# Read/write throughput
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# IOPS
rate(node_disk_reads_completed_total[5m])
rate(node_disk_writes_completed_total[5m])

# Average I/O latency
rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m])

Network Metrics

# Bandwidth (bytes/sec)
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])

# Packet errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

# TCP connections
node_netstat_Tcp_CurrEstab         # Current established connections
rate(node_netstat_Tcp_ActiveOpens[5m])  # New outbound connections/sec
rate(node_netstat_Tcp_PassiveOpens[5m]) # New inbound connections/sec

Container Metrics

cAdvisor Metrics

Metric Description
container_cpu_usage_seconds_total Total CPU time consumed
container_cpu_cfs_throttled_periods_total CPU throttling events
container_memory_working_set_bytes Current memory (excludes cache)
container_memory_usage_bytes Total memory (includes cache)
container_network_receive_bytes_total Network inbound bytes
container_network_transmit_bytes_total Network outbound bytes
container_fs_usage_bytes Container filesystem usage
container_spec_memory_limit_bytes Memory limit
container_spec_cpu_quota CPU quota
# Container CPU usage percentage (of limit)
sum by (container, pod) (
  rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])
) / sum by (container, pod) (
  container_spec_cpu_quota / container_spec_cpu_period
)

# Container memory usage percentage (of limit)
container_memory_working_set_bytes{container!="POD",container!=""}
/
container_spec_memory_limit_bytes{container!="POD",container!=""} > 0

# CPU throttling percentage
sum by (container, pod) (
  rate(container_cpu_cfs_throttled_periods_total[5m])
) / sum by (container, pod) (
  rate(container_cpu_cfs_periods_total[5m])
)

# OOMKill detection
increase(kube_pod_container_status_restarts_total[1h]) > 0
and
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

Kubernetes Metrics (kube-state-metrics)

# Pod status
kube_pod_status_phase{phase="Running"}
kube_pod_status_phase{phase="Pending"}
kube_pod_status_phase{phase="Failed"}

# Deployment replicas
kube_deployment_status_replicas_available
kube_deployment_spec_replicas

# HPA status
kube_horizontalpodautoscaler_status_current_replicas
kube_horizontalpodautoscaler_spec_max_replicas

Node Exporter

Setup

# docker-compose.yml
services:
  node-exporter:
    image: prom/node-exporter:v1.7.0
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

Kubernetes DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9100"
    spec:
      hostPID: true
      hostNetwork: true
      containers:
        - name: node-exporter
          image: prom/node-exporter:v1.7.0
          ports:
            - containerPort: 9100
              hostPort: 9100
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly: true
            - name: sys
              mountPath: /host/sys
              readOnly: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
      tolerations:
        - effect: NoSchedule
          operator: Exists

APM Tools

Comparison

Feature Datadog APM New Relic Elastic APM Sentry
Type Full APM Full APM Full APM Error tracking + perf
Pricing Per host ($31+/mo) Per user + data Free (self-host) or Cloud Per event volume
Traces Yes Yes Yes Transaction traces
Error tracking Yes Yes Yes Excellent
Profiling Yes (continuous) Yes No No
Log correlation Yes Yes Yes Breadcrumbs
Dashboards Built-in Built-in Kibana Limited
Setup Agent-based Agent-based Agent or OTel SDK-based
Best for Enterprise, full stack Full observability Self-hosted, ELK users Error-focused teams

Sentry Error Tracking

# Python
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration

sentry_sdk.init(
    dsn="https://key@sentry.io/project",
    traces_sample_rate=0.1,  # 10% of transactions
    profiles_sample_rate=0.1,
    environment="production",
    release="1.4.2",
    integrations=[FastApiIntegration()],
)
// Node.js
const Sentry = require('@sentry/node');

Sentry.init({
  dsn: 'https://key@sentry.io/project',
  tracesSampleRate: 0.1,
  environment: 'production',
  release: '1.4.2',
});
// Go
import "github.com/getsentry/sentry-go"

sentry.Init(sentry.ClientOptions{
    Dsn:              "https://key@sentry.io/project",
    TracesSampleRate: 0.1,
    Environment:      "production",
    Release:          "1.4.2",
})
defer sentry.Flush(2 * time.Second)

Cost Optimization

Metric Cardinality Review

High cardinality is the most common cost driver in metrics systems:

# Find metrics with the most time series
topk(20, count by (__name__) ({__name__=~".+"}))

# Find labels with high cardinality
count(group by (path) (http_requests_total))   # How many unique paths?
count(group by (user_id) (api_calls_total))    # Unbounded!

Reduction strategies:

  1. Remove unused metrics (if nobody dashboards/alerts on it, drop it)
  2. Replace high-cardinality labels with bounded categories
  3. Use recording rules to pre-aggregate, drop raw metrics
  4. Use metric relabeling in Prometheus to drop at scrape time
# Drop unused metrics at scrape time
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "go_.*"           # Drop Go runtime metrics if unused
    action: drop

Log Volume Reduction

Strategy Savings Implementation
Set production to INFO 50-80% Logger config
Sample health check logs 90% for /health Middleware filter
Truncate large payloads 20-40% Body size limit (4KB)
Drop duplicate errors 30-50% Rate-limit per error type
Compress in transit 60-80% bandwidth Enable gzip on log shipper

Trace Sampling

Sampling Rate Monthly Cost (est.) Suitability
100% $$$$ Development, < 100 req/s
10% $$$ Staging, medium traffic
1% $$ Production, high traffic
Tail-based (errors + slow) $$ Production (recommended)
0.1% $ Very high traffic (> 100k req/s)

Retention Tiers

Tier Metrics Logs Traces
Hot (0-14 days) 15s resolution Full fidelity All sampled traces
Warm (14-90 days) 1m resolution Full fidelity Error + slow traces only
Cold (90 days - 1 year) 5m resolution Compressed None (rely on metrics)
Archive (1-7 years) 1h resolution Compliance logs only None

Capacity Planning

Load Testing Correlation

Run load tests while monitoring infrastructure metrics to establish scaling thresholds:

Load Test Results:
┌─────────┬──────────┬────────┬─────────┬──────────────┐
│ RPS     │ p99 (ms) │ CPU %  │ Mem %   │ Error Rate   │
├─────────┼──────────┼────────┼─────────┼──────────────┤
│ 100     │ 45       │ 15     │ 30      │ 0%           │
│ 500     │ 85       │ 35     │ 45      │ 0%           │
│ 1000    │ 150      │ 55     │ 55      │ 0%           │
│ 2000    │ 320      │ 75     │ 65      │ 0.1%         │
│ 3000    │ 850      │ 90     │ 72      │ 1.5%         │  ← degradation
│ 4000    │ 2500     │ 98     │ 78      │ 12%          │  ← failure
└─────────┴──────────┴────────┴─────────┴──────────────┘

Scaling trigger: 75% CPU → add instance
Target capacity: 2x expected peak traffic

Scaling Triggers

# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70    # Scale up at 70% CPU
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"       # Scale at 1000 RPS per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50                  # Max 50% increase per scale-up
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

Resource Forecasting

# Predict disk full in N hours
predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600) < 0
# "Disk will be full within 30 days"

# Predict memory usage trend
predict_linear(
  avg_over_time(container_memory_working_set_bytes[7d]),
  30*24*3600
)

# Growth rate of database size
rate(pg_database_size_bytes[7d])
# Convert to "GB per month"
rate(pg_database_size_bytes[7d]) * 86400 * 30 / 1e9

Incident Response

Incident Lifecycle

Detection → Triage → Mitigate → Resolve → Postmortem
    │          │         │          │          │
    │          │         │          │          └─ Blameless review
    │          │         │          └─ Root cause fix deployed
    │          │         └─ User impact reduced/eliminated
    │          └─ Severity assigned, team engaged
    └─ Alert fires or user reports issue

Severity Classification

Severity Impact Response Time Examples
SEV1 (Critical) Service down, data loss, security breach < 15 minutes Complete outage, payment processing failure
SEV2 (Major) Significant degradation, partial outage < 30 minutes One region down, 50%+ error rate
SEV3 (Minor) Limited impact, workaround exists < 4 hours Single feature broken, elevated latency
SEV4 (Low) Minimal impact, cosmetic Next business day UI glitch, non-critical alert firing

Incident Commander Checklist

## Initial Response (first 15 minutes)
- [ ] Acknowledge the alert / report
- [ ] Assess severity (SEV1-4)
- [ ] Open incident channel (#inc-YYYYMMDD-description)
- [ ] Page relevant team members
- [ ] Post initial status update

## Triage (15-30 minutes)
- [ ] Identify affected services and scope
- [ ] Check recent deployments: any changes in last 2 hours?
- [ ] Check dashboards for anomalies
- [ ] Check external dependencies (status pages)
- [ ] Determine if rollback is feasible

## Mitigation
- [ ] Implement immediate fix (rollback, feature flag, scaling)
- [ ] Verify user impact is reduced
- [ ] Update status page
- [ ] Communicate ETA for full resolution

## Resolution
- [ ] Confirm root cause
- [ ] Deploy fix
- [ ] Verify metrics return to baseline
- [ ] Clear incident status
- [ ] Schedule postmortem within 48 hours

Postmortem Template

# Incident Postmortem: [TITLE]

**Date:** 2026-03-09
**Duration:** 45 minutes (14:15 - 15:00 UTC)
**Severity:** SEV2
**Author:** [Name]
**Status:** Complete

## Summary
One-paragraph description of what happened and impact.

## Impact
- Users affected: ~5,000
- Revenue impact: ~$2,500
- SLO budget consumed: 3.2 hours of the monthly 43-minute budget

## Timeline (all times UTC)
| Time | Event |
|------|-------|
| 14:12 | Deploy v1.4.3 to production |
| 14:15 | Error rate alert fires (5% → 15%) |
| 14:17 | On-call acknowledges, starts investigation |
| 14:22 | Root cause identified: new query missing index |
| 14:25 | Decision: rollback v1.4.3 |
| 14:30 | Rollback complete |
| 14:35 | Error rate returns to baseline |
| 15:00 | All-clear declared |

## Root Cause
The v1.4.3 deployment added a new API endpoint that queried the orders
table without an index on `user_id + created_at`. Under load, this caused
connection pool exhaustion, which cascaded to other endpoints.

## Detection
Alert fired 3 minutes after deploy. Detection was effective.

## Contributing Factors
1. No load test for the new endpoint
2. Missing index not caught in code review
3. No query performance checks in CI

## Action Items
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| Add index on orders(user_id, created_at) | @backend | 2026-03-10 | Done |
| Add slow query detection to CI pipeline | @platform | 2026-03-15 | TODO |
| Add load test for new endpoints to deploy checklist | @backend | 2026-03-12 | TODO |
| Set up query performance alerting (> 100ms avg) | @sre | 2026-03-14 | TODO |

## Lessons Learned
- What went well: Fast detection (3 min), fast rollback (8 min)
- What went poorly: No pre-production load test caught the issue
- Where we got lucky: Happened during business hours, not at 3 AM

Communication During Incidents

Audience Channel Frequency Content
Engineering Slack #incident Real-time Technical details, commands run
Management Slack #incidents-summary Every 15-30 min Impact, ETA, escalation needs
Customers Status page Every 15-30 min User-facing impact, workarounds
Support Slack #support-escalation On status change Scripted responses, known workarounds