# Infrastructure Monitoring Reference Comprehensive reference for health checks, infrastructure metrics, APM, cost optimization, capacity planning, and incident response. --- ## Health Checks ### Types of Health Checks | Type | Question It Answers | Failure Action | |------|---------------------|----------------| | **Liveness** | Is the process alive and not deadlocked? | Restart the process | | **Readiness** | Can this instance serve traffic? | Remove from load balancer | | **Startup** | Has the process finished initializing? | Wait (don't restart yet) | ### Implementation Patterns #### Basic Health Check Endpoint ```go // Go type HealthStatus struct { Status string `json:"status"` Timestamp string `json:"timestamp"` Version string `json:"version"` Checks map[string]Check `json:"checks"` } type Check struct { Status string `json:"status"` Message string `json:"message,omitempty"` Latency string `json:"latency,omitempty"` } func healthHandler(w http.ResponseWriter, r *http.Request) { health := HealthStatus{ Status: "ok", Timestamp: time.Now().UTC().Format(time.RFC3339), Version: version, Checks: make(map[string]Check), } // Check database start := time.Now() if err := db.PingContext(r.Context()); err != nil { health.Status = "degraded" health.Checks["database"] = Check{ Status: "fail", Message: err.Error(), } } else { health.Checks["database"] = Check{ Status: "ok", Latency: time.Since(start).String(), } } // Check Redis start = time.Now() if err := redis.Ping(r.Context()).Err(); err != nil { health.Status = "degraded" health.Checks["redis"] = Check{ Status: "fail", Message: err.Error(), } } else { health.Checks["redis"] = Check{ Status: "ok", Latency: time.Since(start).String(), } } statusCode := http.StatusOK if health.Status != "ok" { statusCode = http.StatusServiceUnavailable } w.Header().Set("Content-Type", "application/json") w.WriteHeader(statusCode) json.NewEncoder(w).Encode(health) } ``` ```python # Python (FastAPI) from fastapi import FastAPI, Response from datetime import datetime, timezone import asyncio app = FastAPI() @app.get("/health") async def health_check(): checks = {} status = "ok" # Database check try: start = datetime.now(timezone.utc) await db.execute("SELECT 1") checks["database"] = { "status": "ok", "latency_ms": (datetime.now(timezone.utc) - start).total_seconds() * 1000, } except Exception as e: status = "degraded" checks["database"] = {"status": "fail", "message": str(e)} # Redis check try: start = datetime.now(timezone.utc) await redis.ping() checks["redis"] = { "status": "ok", "latency_ms": (datetime.now(timezone.utc) - start).total_seconds() * 1000, } except Exception as e: status = "degraded" checks["redis"] = {"status": "fail", "message": str(e)} response_code = 200 if status == "ok" else 503 return Response( content=json.dumps({ "status": status, "timestamp": datetime.now(timezone.utc).isoformat(), "checks": checks, }), status_code=response_code, media_type="application/json", ) @app.get("/ready") async def readiness_check(): """Readiness: can we serve traffic?""" try: await db.execute("SELECT 1") return {"status": "ready"} except Exception: return Response( content='{"status": "not_ready"}', status_code=503, media_type="application/json", ) @app.get("/live") async def liveness_check(): """Liveness: is the process alive?""" return {"status": "alive"} ``` #### Health Check Response Format ```json { "status": "ok", "timestamp": "2026-03-09T14:32:01Z", "version": "1.4.2", "checks": { "database": { "status": "ok", "latency_ms": 2.3 }, "redis": { "status": "ok", "latency_ms": 0.8 }, "external_api": { "status": "degraded", "message": "Elevated latency", "latency_ms": 850 } } } ``` ### Liveness vs Readiness Decision Guide ``` Is the process able to make progress? ├─ No (deadlocked, OOM, infinite loop) │ └─ Liveness check should FAIL → container gets restarted │ └─ Yes, but... ├─ Database is temporarily unreachable │ └─ Readiness FAIL, Liveness PASS → stop sending traffic, don't restart │ ├─ Still loading initial data/cache │ └─ Startup FAIL → don't check liveness yet, wait │ └─ Everything is fine └─ All checks PASS → serve traffic normally ``` **Common mistake:** Making liveness depend on external dependencies (database, Redis). If the database is down, restarting the application won't help — it will cause a restart storm. --- ## Kubernetes Probes ### Configuration ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: api-server spec: template: spec: containers: - name: api image: api-server:1.4.2 ports: - containerPort: 8080 # Startup probe: runs first, disables liveness/readiness until passing startupProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 30 # 30 * 5s = 150s max startup time successThreshold: 1 # Liveness probe: is the process alive? livenessProbe: httpGet: path: /live port: 8080 initialDelaySeconds: 0 # Starts after startup probe passes periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 # 3 consecutive failures → restart successThreshold: 1 # Readiness probe: can it serve traffic? readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 0 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 # 3 failures → remove from Service successThreshold: 1 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi ``` ### Probe Types #### HTTP GET ```yaml livenessProbe: httpGet: path: /health port: 8080 httpHeaders: - name: Authorization value: Bearer internal-token ``` #### TCP Socket ```yaml # For services that don't have HTTP (databases, message brokers) livenessProbe: tcpSocket: port: 5432 periodSeconds: 10 ``` #### Exec Command ```yaml # Run a command inside the container livenessProbe: exec: command: - /bin/sh - -c - pg_isready -U postgres periodSeconds: 10 ``` #### gRPC Health Check ```yaml # gRPC health checking protocol livenessProbe: grpc: port: 50051 service: "" # Empty string checks overall server health periodSeconds: 10 ``` ### Probe Configuration Guidelines | Parameter | Liveness | Readiness | Startup | |-----------|----------|-----------|---------| | `initialDelaySeconds` | 0 (use startup probe) | 0 | 5-10 | | `periodSeconds` | 10-15 | 5-10 | 5 | | `timeoutSeconds` | 3-5 | 3-5 | 3-5 | | `failureThreshold` | 3 | 3 | 30 (generous) | | `successThreshold` | 1 | 1-2 | 1 | --- ## Docker HEALTHCHECK ```dockerfile # Dockerfile FROM node:20-slim HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \ CMD curl -f http://localhost:8080/health || exit 1 # Or with wget (no curl in alpine) HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \ CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1 ``` ### docker-compose Health Check ```yaml services: api: image: api-server:1.4.2 healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 5s retries: 3 start_period: 60s worker: image: worker:1.2.0 depends_on: api: condition: service_healthy postgres: condition: service_healthy postgres: image: postgres:16 healthcheck: test: ["CMD-SHELL", "pg_isready -U postgres"] interval: 10s timeout: 5s retries: 5 ``` ### Health Check Parameters | Parameter | Description | Default | Recommendation | |-----------|-------------|---------|----------------| | `interval` | Time between checks | 30s | 15-30s for critical services | | `timeout` | Max time for check | 30s | 3-5s (fail fast) | | `retries` | Failures before unhealthy | 3 | 3 (avoid flapping) | | `start_period` | Grace period for startup | 0s | Set to max startup time | --- ## Uptime Monitoring ### Uptime Kuma Setup ```yaml # docker-compose.yml services: uptime-kuma: image: louislam/uptime-kuma:1 restart: unless-stopped ports: - "3001:3001" volumes: - uptime-kuma-data:/app/data labels: - "traefik.enable=true" - "traefik.http.routers.uptime.rule=Host(`status.example.com`)" volumes: uptime-kuma-data: ``` **Monitor types supported:** - HTTP(s) — status code, keyword, response time - TCP — port open check - DNS — resolution check - Docker container — running status - gRPC — health check protocol - MQTT — broker connectivity - Ping (ICMP) — network reachability - Push — heartbeat endpoint (service pushes to Uptime Kuma) ### Synthetic Monitoring Scripted checks that simulate real user behavior from multiple regions: ```javascript // k6 script for synthetic monitoring import { check, sleep } from 'k6'; import http from 'k6/http'; export const options = { scenarios: { synthetic: { executor: 'constant-vus', vus: 1, duration: '24h', gracefulStop: '0s', }, }, thresholds: { http_req_duration: ['p(95)<500'], // 95% under 500ms http_req_failed: ['rate<0.01'], // < 1% failure rate checks: ['rate>0.99'], // 99% checks pass }, }; export default function () { // Check homepage let res = http.get('https://www.example.com'); check(res, { 'homepage status 200': (r) => r.status === 200, 'homepage loads fast': (r) => r.timings.duration < 500, 'homepage has title': (r) => r.body.includes(''), }); // Check API health res = http.get('https://api.example.com/health'); check(res, { 'api health 200': (r) => r.status === 200, 'api reports ok': (r) => JSON.parse(r.body).status === 'ok', }); // Check login flow res = http.post('https://api.example.com/auth/login', JSON.stringify({ email: 'synthetic-user@example.com', password: process.env.SYNTHETIC_PASSWORD, }), { headers: { 'Content-Type': 'application/json' } }); check(res, { 'login succeeds': (r) => r.status === 200, 'login returns token': (r) => JSON.parse(r.body).token !== undefined, }); sleep(60); // Check every 60 seconds } ``` ### Multi-Region Monitoring | Provider | Regions | Free Tier | Notes | |----------|---------|-----------|-------| | **Uptime Kuma** | Self-hosted (1 region) | Free | Deploy in multiple regions yourself | | **Betteruptime** | 10+ regions | 5 monitors | Status page included | | **Grafana Synthetic** | 20+ regions | Part of Grafana Cloud | k6-based scripts | | **Datadog Synthetic** | 100+ locations | 100 API tests/month | Full browser testing | | **AWS CloudWatch Synthetics** | All AWS regions | Pay per run | Canary scripts | --- ## Infrastructure Metrics ### CPU Metrics | Metric | Source | What It Shows | |--------|--------|---------------| | `node_cpu_seconds_total{mode="user"}` | node_exporter | Time in user space | | `node_cpu_seconds_total{mode="system"}` | node_exporter | Time in kernel space | | `node_cpu_seconds_total{mode="iowait"}` | node_exporter | Time waiting for I/O | | `node_cpu_seconds_total{mode="idle"}` | node_exporter | Idle time | | `node_load1` / `node_load5` / `node_load15` | node_exporter | Load average (1/5/15 min) | **Common queries:** ```promql # CPU usage percentage (all modes except idle) 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) # CPU usage by mode sum by (mode) (rate(node_cpu_seconds_total{instance="web01:9100"}[5m])) # IO wait percentage (high = disk bottleneck) avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) # Load average vs CPU count node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"}) ``` ### Memory Metrics | Metric | What It Shows | |--------|---------------| | `node_memory_MemTotal_bytes` | Total physical memory | | `node_memory_MemAvailable_bytes` | Memory available for applications | | `node_memory_Cached_bytes` | Page cache (reclaimable) | | `node_memory_Buffers_bytes` | Buffer cache | | `node_memory_SwapTotal_bytes` | Total swap | | `node_memory_SwapFree_bytes` | Free swap | ```promql # Memory usage percentage 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) # Memory breakdown node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes # Swap usage (any swap usage may indicate memory pressure) 1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) ``` ### Disk Metrics ```promql # Disk usage percentage 1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) # Disk I/O utilization (percentage of time doing I/O) rate(node_disk_io_time_seconds_total[5m]) # Read/write throughput rate(node_disk_read_bytes_total[5m]) rate(node_disk_written_bytes_total[5m]) # IOPS rate(node_disk_reads_completed_total[5m]) rate(node_disk_writes_completed_total[5m]) # Average I/O latency rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) ``` ### Network Metrics ```promql # Bandwidth (bytes/sec) rate(node_network_receive_bytes_total{device!="lo"}[5m]) rate(node_network_transmit_bytes_total{device!="lo"}[5m]) # Packet errors rate(node_network_receive_errs_total[5m]) rate(node_network_transmit_errs_total[5m]) # TCP connections node_netstat_Tcp_CurrEstab # Current established connections rate(node_netstat_Tcp_ActiveOpens[5m]) # New outbound connections/sec rate(node_netstat_Tcp_PassiveOpens[5m]) # New inbound connections/sec ``` --- ## Container Metrics ### cAdvisor Metrics | Metric | Description | |--------|-------------| | `container_cpu_usage_seconds_total` | Total CPU time consumed | | `container_cpu_cfs_throttled_periods_total` | CPU throttling events | | `container_memory_working_set_bytes` | Current memory (excludes cache) | | `container_memory_usage_bytes` | Total memory (includes cache) | | `container_network_receive_bytes_total` | Network inbound bytes | | `container_network_transmit_bytes_total` | Network outbound bytes | | `container_fs_usage_bytes` | Container filesystem usage | | `container_spec_memory_limit_bytes` | Memory limit | | `container_spec_cpu_quota` | CPU quota | ```promql # Container CPU usage percentage (of limit) sum by (container, pod) ( rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]) ) / sum by (container, pod) ( container_spec_cpu_quota / container_spec_cpu_period ) # Container memory usage percentage (of limit) container_memory_working_set_bytes{container!="POD",container!=""} / container_spec_memory_limit_bytes{container!="POD",container!=""} > 0 # CPU throttling percentage sum by (container, pod) ( rate(container_cpu_cfs_throttled_periods_total[5m]) ) / sum by (container, pod) ( rate(container_cpu_cfs_periods_total[5m]) ) # OOMKill detection increase(kube_pod_container_status_restarts_total[1h]) > 0 and kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} ``` ### Kubernetes Metrics (kube-state-metrics) ```promql # Pod status kube_pod_status_phase{phase="Running"} kube_pod_status_phase{phase="Pending"} kube_pod_status_phase{phase="Failed"} # Deployment replicas kube_deployment_status_replicas_available kube_deployment_spec_replicas # HPA status kube_horizontalpodautoscaler_status_current_replicas kube_horizontalpodautoscaler_spec_max_replicas ``` --- ## Node Exporter ### Setup ```yaml # docker-compose.yml services: node-exporter: image: prom/node-exporter:v1.7.0 restart: unless-stopped ports: - "9100:9100" volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--path.rootfs=/rootfs' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' ``` ### Kubernetes DaemonSet ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter annotations: prometheus.io/scrape: "true" prometheus.io/port: "9100" spec: hostPID: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:v1.7.0 ports: - containerPort: 9100 hostPort: 9100 volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys tolerations: - effect: NoSchedule operator: Exists ``` --- ## APM Tools ### Comparison | Feature | Datadog APM | New Relic | Elastic APM | Sentry | |---------|-------------|-----------|-------------|--------| | **Type** | Full APM | Full APM | Full APM | Error tracking + perf | | **Pricing** | Per host ($31+/mo) | Per user + data | Free (self-host) or Cloud | Per event volume | | **Traces** | Yes | Yes | Yes | Transaction traces | | **Error tracking** | Yes | Yes | Yes | Excellent | | **Profiling** | Yes (continuous) | Yes | No | No | | **Log correlation** | Yes | Yes | Yes | Breadcrumbs | | **Dashboards** | Built-in | Built-in | Kibana | Limited | | **Setup** | Agent-based | Agent-based | Agent or OTel | SDK-based | | **Best for** | Enterprise, full stack | Full observability | Self-hosted, ELK users | Error-focused teams | ### Sentry Error Tracking ```python # Python import sentry_sdk from sentry_sdk.integrations.fastapi import FastApiIntegration sentry_sdk.init( dsn="https://key@sentry.io/project", traces_sample_rate=0.1, # 10% of transactions profiles_sample_rate=0.1, environment="production", release="1.4.2", integrations=[FastApiIntegration()], ) ``` ```javascript // Node.js const Sentry = require('@sentry/node'); Sentry.init({ dsn: 'https://key@sentry.io/project', tracesSampleRate: 0.1, environment: 'production', release: '1.4.2', }); ``` ```go // Go import "github.com/getsentry/sentry-go" sentry.Init(sentry.ClientOptions{ Dsn: "https://key@sentry.io/project", TracesSampleRate: 0.1, Environment: "production", Release: "1.4.2", }) defer sentry.Flush(2 * time.Second) ``` --- ## Cost Optimization ### Metric Cardinality Review High cardinality is the most common cost driver in metrics systems: ```promql # Find metrics with the most time series topk(20, count by (__name__) ({__name__=~".+"})) # Find labels with high cardinality count(group by (path) (http_requests_total)) # How many unique paths? count(group by (user_id) (api_calls_total)) # Unbounded! ``` **Reduction strategies:** 1. Remove unused metrics (if nobody dashboards/alerts on it, drop it) 2. Replace high-cardinality labels with bounded categories 3. Use recording rules to pre-aggregate, drop raw metrics 4. Use metric relabeling in Prometheus to drop at scrape time ```yaml # Drop unused metrics at scrape time metric_relabel_configs: - source_labels: [__name__] regex: "go_.*" # Drop Go runtime metrics if unused action: drop ``` ### Log Volume Reduction | Strategy | Savings | Implementation | |----------|---------|----------------| | Set production to INFO | 50-80% | Logger config | | Sample health check logs | 90% for /health | Middleware filter | | Truncate large payloads | 20-40% | Body size limit (4KB) | | Drop duplicate errors | 30-50% | Rate-limit per error type | | Compress in transit | 60-80% bandwidth | Enable gzip on log shipper | ### Trace Sampling | Sampling Rate | Monthly Cost (est.) | Suitability | |---------------|---------------------|-------------| | 100% | $$$$ | Development, < 100 req/s | | 10% | $$$ | Staging, medium traffic | | 1% | $$ | Production, high traffic | | Tail-based (errors + slow) | $$ | Production (recommended) | | 0.1% | $ | Very high traffic (> 100k req/s) | ### Retention Tiers | Tier | Metrics | Logs | Traces | |------|---------|------|--------| | Hot (0-14 days) | 15s resolution | Full fidelity | All sampled traces | | Warm (14-90 days) | 1m resolution | Full fidelity | Error + slow traces only | | Cold (90 days - 1 year) | 5m resolution | Compressed | None (rely on metrics) | | Archive (1-7 years) | 1h resolution | Compliance logs only | None | --- ## Capacity Planning ### Load Testing Correlation Run load tests while monitoring infrastructure metrics to establish scaling thresholds: ``` Load Test Results: ┌─────────┬──────────┬────────┬─────────┬──────────────┐ │ RPS │ p99 (ms) │ CPU % │ Mem % │ Error Rate │ ├─────────┼──────────┼────────┼─────────┼──────────────┤ │ 100 │ 45 │ 15 │ 30 │ 0% │ │ 500 │ 85 │ 35 │ 45 │ 0% │ │ 1000 │ 150 │ 55 │ 55 │ 0% │ │ 2000 │ 320 │ 75 │ 65 │ 0.1% │ │ 3000 │ 850 │ 90 │ 72 │ 1.5% │ ← degradation │ 4000 │ 2500 │ 98 │ 78 │ 12% │ ← failure └─────────┴──────────┴────────┴─────────┴──────────────┘ Scaling trigger: 75% CPU → add instance Target capacity: 2x expected peak traffic ``` ### Scaling Triggers ```yaml # Kubernetes HPA apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-server spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-server minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 # Scale up at 70% CPU - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "1000" # Scale at 1000 RPS per pod behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 50 # Max 50% increase per scale-up periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 # Wait 5 min before scaling down policies: - type: Percent value: 25 periodSeconds: 120 ``` ### Resource Forecasting ```promql # Predict disk full in N hours predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600) < 0 # "Disk will be full within 30 days" # Predict memory usage trend predict_linear( avg_over_time(container_memory_working_set_bytes[7d]), 30*24*3600 ) # Growth rate of database size rate(pg_database_size_bytes[7d]) # Convert to "GB per month" rate(pg_database_size_bytes[7d]) * 86400 * 30 / 1e9 ``` --- ## Incident Response ### Incident Lifecycle ``` Detection → Triage → Mitigate → Resolve → Postmortem │ │ │ │ │ │ │ │ │ └─ Blameless review │ │ │ └─ Root cause fix deployed │ │ └─ User impact reduced/eliminated │ └─ Severity assigned, team engaged └─ Alert fires or user reports issue ``` ### Severity Classification | Severity | Impact | Response Time | Examples | |----------|--------|---------------|---------| | **SEV1 (Critical)** | Service down, data loss, security breach | < 15 minutes | Complete outage, payment processing failure | | **SEV2 (Major)** | Significant degradation, partial outage | < 30 minutes | One region down, 50%+ error rate | | **SEV3 (Minor)** | Limited impact, workaround exists | < 4 hours | Single feature broken, elevated latency | | **SEV4 (Low)** | Minimal impact, cosmetic | Next business day | UI glitch, non-critical alert firing | ### Incident Commander Checklist ```markdown ## Initial Response (first 15 minutes) - [ ] Acknowledge the alert / report - [ ] Assess severity (SEV1-4) - [ ] Open incident channel (#inc-YYYYMMDD-description) - [ ] Page relevant team members - [ ] Post initial status update ## Triage (15-30 minutes) - [ ] Identify affected services and scope - [ ] Check recent deployments: any changes in last 2 hours? - [ ] Check dashboards for anomalies - [ ] Check external dependencies (status pages) - [ ] Determine if rollback is feasible ## Mitigation - [ ] Implement immediate fix (rollback, feature flag, scaling) - [ ] Verify user impact is reduced - [ ] Update status page - [ ] Communicate ETA for full resolution ## Resolution - [ ] Confirm root cause - [ ] Deploy fix - [ ] Verify metrics return to baseline - [ ] Clear incident status - [ ] Schedule postmortem within 48 hours ``` ### Postmortem Template ```markdown # Incident Postmortem: [TITLE] **Date:** 2026-03-09 **Duration:** 45 minutes (14:15 - 15:00 UTC) **Severity:** SEV2 **Author:** [Name] **Status:** Complete ## Summary One-paragraph description of what happened and impact. ## Impact - Users affected: ~5,000 - Revenue impact: ~$2,500 - SLO budget consumed: 3.2 hours of the monthly 43-minute budget ## Timeline (all times UTC) | Time | Event | |------|-------| | 14:12 | Deploy v1.4.3 to production | | 14:15 | Error rate alert fires (5% → 15%) | | 14:17 | On-call acknowledges, starts investigation | | 14:22 | Root cause identified: new query missing index | | 14:25 | Decision: rollback v1.4.3 | | 14:30 | Rollback complete | | 14:35 | Error rate returns to baseline | | 15:00 | All-clear declared | ## Root Cause The v1.4.3 deployment added a new API endpoint that queried the orders table without an index on `user_id + created_at`. Under load, this caused connection pool exhaustion, which cascaded to other endpoints. ## Detection Alert fired 3 minutes after deploy. Detection was effective. ## Contributing Factors 1. No load test for the new endpoint 2. Missing index not caught in code review 3. No query performance checks in CI ## Action Items | Action | Owner | Due | Status | |--------|-------|-----|--------| | Add index on orders(user_id, created_at) | @backend | 2026-03-10 | Done | | Add slow query detection to CI pipeline | @platform | 2026-03-15 | TODO | | Add load test for new endpoints to deploy checklist | @backend | 2026-03-12 | TODO | | Set up query performance alerting (> 100ms avg) | @sre | 2026-03-14 | TODO | ## Lessons Learned - What went well: Fast detection (3 min), fast rollback (8 min) - What went poorly: No pre-production load test caught the issue - Where we got lucky: Happened during business hours, not at 3 AM ``` ### Communication During Incidents | Audience | Channel | Frequency | Content | |----------|---------|-----------|---------| | Engineering | Slack #incident | Real-time | Technical details, commands run | | Management | Slack #incidents-summary | Every 15-30 min | Impact, ETA, escalation needs | | Customers | Status page | Every 15-30 min | User-facing impact, workarounds | | Support | Slack #support-escalation | On status change | Scripted responses, known workarounds |