Infrastructure Monitoring Reference
Comprehensive reference for health checks, infrastructure metrics, APM, cost optimization, capacity planning, and incident response.
Health Checks
Types of Health Checks
| Type |
Question It Answers |
Failure Action |
| Liveness |
Is the process alive and not deadlocked? |
Restart the process |
| Readiness |
Can this instance serve traffic? |
Remove from load balancer |
| Startup |
Has the process finished initializing? |
Wait (don't restart yet) |
Implementation Patterns
Basic Health Check Endpoint
// Go
type HealthStatus struct {
Status string `json:"status"`
Timestamp string `json:"timestamp"`
Version string `json:"version"`
Checks map[string]Check `json:"checks"`
}
type Check struct {
Status string `json:"status"`
Message string `json:"message,omitempty"`
Latency string `json:"latency,omitempty"`
}
func healthHandler(w http.ResponseWriter, r *http.Request) {
health := HealthStatus{
Status: "ok",
Timestamp: time.Now().UTC().Format(time.RFC3339),
Version: version,
Checks: make(map[string]Check),
}
// Check database
start := time.Now()
if err := db.PingContext(r.Context()); err != nil {
health.Status = "degraded"
health.Checks["database"] = Check{
Status: "fail",
Message: err.Error(),
}
} else {
health.Checks["database"] = Check{
Status: "ok",
Latency: time.Since(start).String(),
}
}
// Check Redis
start = time.Now()
if err := redis.Ping(r.Context()).Err(); err != nil {
health.Status = "degraded"
health.Checks["redis"] = Check{
Status: "fail",
Message: err.Error(),
}
} else {
health.Checks["redis"] = Check{
Status: "ok",
Latency: time.Since(start).String(),
}
}
statusCode := http.StatusOK
if health.Status != "ok" {
statusCode = http.StatusServiceUnavailable
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(statusCode)
json.NewEncoder(w).Encode(health)
}
# Python (FastAPI)
from fastapi import FastAPI, Response
from datetime import datetime, timezone
import asyncio
app = FastAPI()
@app.get("/health")
async def health_check():
checks = {}
status = "ok"
# Database check
try:
start = datetime.now(timezone.utc)
await db.execute("SELECT 1")
checks["database"] = {
"status": "ok",
"latency_ms": (datetime.now(timezone.utc) - start).total_seconds() * 1000,
}
except Exception as e:
status = "degraded"
checks["database"] = {"status": "fail", "message": str(e)}
# Redis check
try:
start = datetime.now(timezone.utc)
await redis.ping()
checks["redis"] = {
"status": "ok",
"latency_ms": (datetime.now(timezone.utc) - start).total_seconds() * 1000,
}
except Exception as e:
status = "degraded"
checks["redis"] = {"status": "fail", "message": str(e)}
response_code = 200 if status == "ok" else 503
return Response(
content=json.dumps({
"status": status,
"timestamp": datetime.now(timezone.utc).isoformat(),
"checks": checks,
}),
status_code=response_code,
media_type="application/json",
)
@app.get("/ready")
async def readiness_check():
"""Readiness: can we serve traffic?"""
try:
await db.execute("SELECT 1")
return {"status": "ready"}
except Exception:
return Response(
content='{"status": "not_ready"}',
status_code=503,
media_type="application/json",
)
@app.get("/live")
async def liveness_check():
"""Liveness: is the process alive?"""
return {"status": "alive"}
Health Check Response Format
{
"status": "ok",
"timestamp": "2026-03-09T14:32:01Z",
"version": "1.4.2",
"checks": {
"database": {
"status": "ok",
"latency_ms": 2.3
},
"redis": {
"status": "ok",
"latency_ms": 0.8
},
"external_api": {
"status": "degraded",
"message": "Elevated latency",
"latency_ms": 850
}
}
}
Liveness vs Readiness Decision Guide
Is the process able to make progress?
├─ No (deadlocked, OOM, infinite loop)
│ └─ Liveness check should FAIL → container gets restarted
│
└─ Yes, but...
├─ Database is temporarily unreachable
│ └─ Readiness FAIL, Liveness PASS → stop sending traffic, don't restart
│
├─ Still loading initial data/cache
│ └─ Startup FAIL → don't check liveness yet, wait
│
└─ Everything is fine
└─ All checks PASS → serve traffic normally
Common mistake: Making liveness depend on external dependencies (database, Redis). If the database is down, restarting the application won't help — it will cause a restart storm.
Kubernetes Probes
Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
containers:
- name: api
image: api-server:1.4.2
ports:
- containerPort: 8080
# Startup probe: runs first, disables liveness/readiness until passing
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup time
successThreshold: 1
# Liveness probe: is the process alive?
livenessProbe:
httpGet:
path: /live
port: 8080
initialDelaySeconds: 0 # Starts after startup probe passes
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3 # 3 consecutive failures → restart
successThreshold: 1
# Readiness probe: can it serve traffic?
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3 # 3 failures → remove from Service
successThreshold: 1
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Probe Types
HTTP GET
livenessProbe:
httpGet:
path: /health
port: 8080
httpHeaders:
- name: Authorization
value: Bearer internal-token
TCP Socket
# For services that don't have HTTP (databases, message brokers)
livenessProbe:
tcpSocket:
port: 5432
periodSeconds: 10
Exec Command
# Run a command inside the container
livenessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -U postgres
periodSeconds: 10
gRPC Health Check
# gRPC health checking protocol
livenessProbe:
grpc:
port: 50051
service: "" # Empty string checks overall server health
periodSeconds: 10
Probe Configuration Guidelines
| Parameter |
Liveness |
Readiness |
Startup |
initialDelaySeconds |
0 (use startup probe) |
0 |
5-10 |
periodSeconds |
10-15 |
5-10 |
5 |
timeoutSeconds |
3-5 |
3-5 |
3-5 |
failureThreshold |
3 |
3 |
30 (generous) |
successThreshold |
1 |
1-2 |
1 |
Docker HEALTHCHECK
# Dockerfile
FROM node:20-slim
HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
CMD curl -f http://localhost:8080/health || exit 1
# Or with wget (no curl in alpine)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
docker-compose Health Check
services:
api:
image: api-server:1.4.2
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 60s
worker:
image: worker:1.2.0
depends_on:
api:
condition: service_healthy
postgres:
condition: service_healthy
postgres:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
Health Check Parameters
| Parameter |
Description |
Default |
Recommendation |
interval |
Time between checks |
30s |
15-30s for critical services |
timeout |
Max time for check |
30s |
3-5s (fail fast) |
retries |
Failures before unhealthy |
3 |
3 (avoid flapping) |
start_period |
Grace period for startup |
0s |
Set to max startup time |
Uptime Monitoring
Uptime Kuma Setup
# docker-compose.yml
services:
uptime-kuma:
image: louislam/uptime-kuma:1
restart: unless-stopped
ports:
- "3001:3001"
volumes:
- uptime-kuma-data:/app/data
labels:
- "traefik.enable=true"
- "traefik.http.routers.uptime.rule=Host(`status.example.com`)"
volumes:
uptime-kuma-data:
Monitor types supported:
- HTTP(s) — status code, keyword, response time
- TCP — port open check
- DNS — resolution check
- Docker container — running status
- gRPC — health check protocol
- MQTT — broker connectivity
- Ping (ICMP) — network reachability
- Push — heartbeat endpoint (service pushes to Uptime Kuma)
Synthetic Monitoring
Scripted checks that simulate real user behavior from multiple regions:
// k6 script for synthetic monitoring
import { check, sleep } from 'k6';
import http from 'k6/http';
export const options = {
scenarios: {
synthetic: {
executor: 'constant-vus',
vus: 1,
duration: '24h',
gracefulStop: '0s',
},
},
thresholds: {
http_req_duration: ['p(95)<500'], // 95% under 500ms
http_req_failed: ['rate<0.01'], // < 1% failure rate
checks: ['rate>0.99'], // 99% checks pass
},
};
export default function () {
// Check homepage
let res = http.get('https://www.example.com');
check(res, {
'homepage status 200': (r) => r.status === 200,
'homepage loads fast': (r) => r.timings.duration < 500,
'homepage has title': (r) => r.body.includes('<title>'),
});
// Check API health
res = http.get('https://api.example.com/health');
check(res, {
'api health 200': (r) => r.status === 200,
'api reports ok': (r) => JSON.parse(r.body).status === 'ok',
});
// Check login flow
res = http.post('https://api.example.com/auth/login', JSON.stringify({
email: 'synthetic-user@example.com',
password: process.env.SYNTHETIC_PASSWORD,
}), { headers: { 'Content-Type': 'application/json' } });
check(res, {
'login succeeds': (r) => r.status === 200,
'login returns token': (r) => JSON.parse(r.body).token !== undefined,
});
sleep(60); // Check every 60 seconds
}
Multi-Region Monitoring
| Provider |
Regions |
Free Tier |
Notes |
| Uptime Kuma |
Self-hosted (1 region) |
Free |
Deploy in multiple regions yourself |
| Betteruptime |
10+ regions |
5 monitors |
Status page included |
| Grafana Synthetic |
20+ regions |
Part of Grafana Cloud |
k6-based scripts |
| Datadog Synthetic |
100+ locations |
100 API tests/month |
Full browser testing |
| AWS CloudWatch Synthetics |
All AWS regions |
Pay per run |
Canary scripts |
Infrastructure Metrics
CPU Metrics
| Metric |
Source |
What It Shows |
node_cpu_seconds_total{mode="user"} |
node_exporter |
Time in user space |
node_cpu_seconds_total{mode="system"} |
node_exporter |
Time in kernel space |
node_cpu_seconds_total{mode="iowait"} |
node_exporter |
Time waiting for I/O |
node_cpu_seconds_total{mode="idle"} |
node_exporter |
Idle time |
node_load1 / node_load5 / node_load15 |
node_exporter |
Load average (1/5/15 min) |
Common queries:
# CPU usage percentage (all modes except idle)
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
# CPU usage by mode
sum by (mode) (rate(node_cpu_seconds_total{instance="web01:9100"}[5m]))
# IO wait percentage (high = disk bottleneck)
avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m]))
# Load average vs CPU count
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})
Memory Metrics
| Metric |
What It Shows |
node_memory_MemTotal_bytes |
Total physical memory |
node_memory_MemAvailable_bytes |
Memory available for applications |
node_memory_Cached_bytes |
Page cache (reclaimable) |
node_memory_Buffers_bytes |
Buffer cache |
node_memory_SwapTotal_bytes |
Total swap |
node_memory_SwapFree_bytes |
Free swap |
# Memory usage percentage
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Memory breakdown
node_memory_MemTotal_bytes
- node_memory_MemAvailable_bytes
- node_memory_Cached_bytes
- node_memory_Buffers_bytes
# Swap usage (any swap usage may indicate memory pressure)
1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)
Disk Metrics
# Disk usage percentage
1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes)
# Disk I/O utilization (percentage of time doing I/O)
rate(node_disk_io_time_seconds_total[5m])
# Read/write throughput
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# IOPS
rate(node_disk_reads_completed_total[5m])
rate(node_disk_writes_completed_total[5m])
# Average I/O latency
rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m])
Network Metrics
# Bandwidth (bytes/sec)
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])
# Packet errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
# TCP connections
node_netstat_Tcp_CurrEstab # Current established connections
rate(node_netstat_Tcp_ActiveOpens[5m]) # New outbound connections/sec
rate(node_netstat_Tcp_PassiveOpens[5m]) # New inbound connections/sec
Container Metrics
cAdvisor Metrics
| Metric |
Description |
container_cpu_usage_seconds_total |
Total CPU time consumed |
container_cpu_cfs_throttled_periods_total |
CPU throttling events |
container_memory_working_set_bytes |
Current memory (excludes cache) |
container_memory_usage_bytes |
Total memory (includes cache) |
container_network_receive_bytes_total |
Network inbound bytes |
container_network_transmit_bytes_total |
Network outbound bytes |
container_fs_usage_bytes |
Container filesystem usage |
container_spec_memory_limit_bytes |
Memory limit |
container_spec_cpu_quota |
CPU quota |
# Container CPU usage percentage (of limit)
sum by (container, pod) (
rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])
) / sum by (container, pod) (
container_spec_cpu_quota / container_spec_cpu_period
)
# Container memory usage percentage (of limit)
container_memory_working_set_bytes{container!="POD",container!=""}
/
container_spec_memory_limit_bytes{container!="POD",container!=""} > 0
# CPU throttling percentage
sum by (container, pod) (
rate(container_cpu_cfs_throttled_periods_total[5m])
) / sum by (container, pod) (
rate(container_cpu_cfs_periods_total[5m])
)
# OOMKill detection
increase(kube_pod_container_status_restarts_total[1h]) > 0
and
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
Kubernetes Metrics (kube-state-metrics)
# Pod status
kube_pod_status_phase{phase="Running"}
kube_pod_status_phase{phase="Pending"}
kube_pod_status_phase{phase="Failed"}
# Deployment replicas
kube_deployment_status_replicas_available
kube_deployment_spec_replicas
# HPA status
kube_horizontalpodautoscaler_status_current_replicas
kube_horizontalpodautoscaler_spec_max_replicas
Node Exporter
Setup
# docker-compose.yml
services:
node-exporter:
image: prom/node-exporter:v1.7.0
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
Kubernetes DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
spec:
hostPID: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.7.0
ports:
- containerPort: 9100
hostPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
tolerations:
- effect: NoSchedule
operator: Exists
APM Tools
Comparison
| Feature |
Datadog APM |
New Relic |
Elastic APM |
Sentry |
| Type |
Full APM |
Full APM |
Full APM |
Error tracking + perf |
| Pricing |
Per host ($31+/mo) |
Per user + data |
Free (self-host) or Cloud |
Per event volume |
| Traces |
Yes |
Yes |
Yes |
Transaction traces |
| Error tracking |
Yes |
Yes |
Yes |
Excellent |
| Profiling |
Yes (continuous) |
Yes |
No |
No |
| Log correlation |
Yes |
Yes |
Yes |
Breadcrumbs |
| Dashboards |
Built-in |
Built-in |
Kibana |
Limited |
| Setup |
Agent-based |
Agent-based |
Agent or OTel |
SDK-based |
| Best for |
Enterprise, full stack |
Full observability |
Self-hosted, ELK users |
Error-focused teams |
Sentry Error Tracking
# Python
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
sentry_sdk.init(
dsn="https://key@sentry.io/project",
traces_sample_rate=0.1, # 10% of transactions
profiles_sample_rate=0.1,
environment="production",
release="1.4.2",
integrations=[FastApiIntegration()],
)
// Node.js
const Sentry = require('@sentry/node');
Sentry.init({
dsn: 'https://key@sentry.io/project',
tracesSampleRate: 0.1,
environment: 'production',
release: '1.4.2',
});
// Go
import "github.com/getsentry/sentry-go"
sentry.Init(sentry.ClientOptions{
Dsn: "https://key@sentry.io/project",
TracesSampleRate: 0.1,
Environment: "production",
Release: "1.4.2",
})
defer sentry.Flush(2 * time.Second)
Cost Optimization
Metric Cardinality Review
High cardinality is the most common cost driver in metrics systems:
# Find metrics with the most time series
topk(20, count by (__name__) ({__name__=~".+"}))
# Find labels with high cardinality
count(group by (path) (http_requests_total)) # How many unique paths?
count(group by (user_id) (api_calls_total)) # Unbounded!
Reduction strategies:
- Remove unused metrics (if nobody dashboards/alerts on it, drop it)
- Replace high-cardinality labels with bounded categories
- Use recording rules to pre-aggregate, drop raw metrics
- Use metric relabeling in Prometheus to drop at scrape time
# Drop unused metrics at scrape time
metric_relabel_configs:
- source_labels: [__name__]
regex: "go_.*" # Drop Go runtime metrics if unused
action: drop
Log Volume Reduction
| Strategy |
Savings |
Implementation |
| Set production to INFO |
50-80% |
Logger config |
| Sample health check logs |
90% for /health |
Middleware filter |
| Truncate large payloads |
20-40% |
Body size limit (4KB) |
| Drop duplicate errors |
30-50% |
Rate-limit per error type |
| Compress in transit |
60-80% bandwidth |
Enable gzip on log shipper |
Trace Sampling
| Sampling Rate |
Monthly Cost (est.) |
Suitability |
| 100% |
$$$$ |
Development, < 100 req/s |
| 10% |
$$$ |
Staging, medium traffic |
| 1% |
$$ |
Production, high traffic |
| Tail-based (errors + slow) |
$$ |
Production (recommended) |
| 0.1% |
$ |
Very high traffic (> 100k req/s) |
Retention Tiers
| Tier |
Metrics |
Logs |
Traces |
| Hot (0-14 days) |
15s resolution |
Full fidelity |
All sampled traces |
| Warm (14-90 days) |
1m resolution |
Full fidelity |
Error + slow traces only |
| Cold (90 days - 1 year) |
5m resolution |
Compressed |
None (rely on metrics) |
| Archive (1-7 years) |
1h resolution |
Compliance logs only |
None |
Capacity Planning
Load Testing Correlation
Run load tests while monitoring infrastructure metrics to establish scaling thresholds:
Load Test Results:
┌─────────┬──────────┬────────┬─────────┬──────────────┐
│ RPS │ p99 (ms) │ CPU % │ Mem % │ Error Rate │
├─────────┼──────────┼────────┼─────────┼──────────────┤
│ 100 │ 45 │ 15 │ 30 │ 0% │
│ 500 │ 85 │ 35 │ 45 │ 0% │
│ 1000 │ 150 │ 55 │ 55 │ 0% │
│ 2000 │ 320 │ 75 │ 65 │ 0.1% │
│ 3000 │ 850 │ 90 │ 72 │ 1.5% │ ← degradation
│ 4000 │ 2500 │ 98 │ 78 │ 12% │ ← failure
└─────────┴──────────┴────────┴─────────┴──────────────┘
Scaling trigger: 75% CPU → add instance
Target capacity: 2x expected peak traffic
Scaling Triggers
# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale up at 70% CPU
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000" # Scale at 1000 RPS per pod
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50 # Max 50% increase per scale-up
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 25
periodSeconds: 120
Resource Forecasting
# Predict disk full in N hours
predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600) < 0
# "Disk will be full within 30 days"
# Predict memory usage trend
predict_linear(
avg_over_time(container_memory_working_set_bytes[7d]),
30*24*3600
)
# Growth rate of database size
rate(pg_database_size_bytes[7d])
# Convert to "GB per month"
rate(pg_database_size_bytes[7d]) * 86400 * 30 / 1e9
Incident Response
Incident Lifecycle
Detection → Triage → Mitigate → Resolve → Postmortem
│ │ │ │ │
│ │ │ │ └─ Blameless review
│ │ │ └─ Root cause fix deployed
│ │ └─ User impact reduced/eliminated
│ └─ Severity assigned, team engaged
└─ Alert fires or user reports issue
Severity Classification
| Severity |
Impact |
Response Time |
Examples |
| SEV1 (Critical) |
Service down, data loss, security breach |
< 15 minutes |
Complete outage, payment processing failure |
| SEV2 (Major) |
Significant degradation, partial outage |
< 30 minutes |
One region down, 50%+ error rate |
| SEV3 (Minor) |
Limited impact, workaround exists |
< 4 hours |
Single feature broken, elevated latency |
| SEV4 (Low) |
Minimal impact, cosmetic |
Next business day |
UI glitch, non-critical alert firing |
Incident Commander Checklist
## Initial Response (first 15 minutes)
- [ ] Acknowledge the alert / report
- [ ] Assess severity (SEV1-4)
- [ ] Open incident channel (#inc-YYYYMMDD-description)
- [ ] Page relevant team members
- [ ] Post initial status update
## Triage (15-30 minutes)
- [ ] Identify affected services and scope
- [ ] Check recent deployments: any changes in last 2 hours?
- [ ] Check dashboards for anomalies
- [ ] Check external dependencies (status pages)
- [ ] Determine if rollback is feasible
## Mitigation
- [ ] Implement immediate fix (rollback, feature flag, scaling)
- [ ] Verify user impact is reduced
- [ ] Update status page
- [ ] Communicate ETA for full resolution
## Resolution
- [ ] Confirm root cause
- [ ] Deploy fix
- [ ] Verify metrics return to baseline
- [ ] Clear incident status
- [ ] Schedule postmortem within 48 hours
Postmortem Template
# Incident Postmortem: [TITLE]
**Date:** 2026-03-09
**Duration:** 45 minutes (14:15 - 15:00 UTC)
**Severity:** SEV2
**Author:** [Name]
**Status:** Complete
## Summary
One-paragraph description of what happened and impact.
## Impact
- Users affected: ~5,000
- Revenue impact: ~$2,500
- SLO budget consumed: 3.2 hours of the monthly 43-minute budget
## Timeline (all times UTC)
| Time | Event |
|------|-------|
| 14:12 | Deploy v1.4.3 to production |
| 14:15 | Error rate alert fires (5% → 15%) |
| 14:17 | On-call acknowledges, starts investigation |
| 14:22 | Root cause identified: new query missing index |
| 14:25 | Decision: rollback v1.4.3 |
| 14:30 | Rollback complete |
| 14:35 | Error rate returns to baseline |
| 15:00 | All-clear declared |
## Root Cause
The v1.4.3 deployment added a new API endpoint that queried the orders
table without an index on `user_id + created_at`. Under load, this caused
connection pool exhaustion, which cascaded to other endpoints.
## Detection
Alert fired 3 minutes after deploy. Detection was effective.
## Contributing Factors
1. No load test for the new endpoint
2. Missing index not caught in code review
3. No query performance checks in CI
## Action Items
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| Add index on orders(user_id, created_at) | @backend | 2026-03-10 | Done |
| Add slow query detection to CI pipeline | @platform | 2026-03-15 | TODO |
| Add load test for new endpoints to deploy checklist | @backend | 2026-03-12 | TODO |
| Set up query performance alerting (> 100ms avg) | @sre | 2026-03-14 | TODO |
## Lessons Learned
- What went well: Fast detection (3 min), fast rollback (8 min)
- What went poorly: No pre-production load test caught the issue
- Where we got lucky: Happened during business hours, not at 3 AM
Communication During Incidents
| Audience |
Channel |
Frequency |
Content |
| Engineering |
Slack #incident |
Real-time |
Technical details, commands run |
| Management |
Slack #incidents-summary |
Every 15-30 min |
Impact, ETA, escalation needs |
| Customers |
Status page |
Every 15-30 min |
User-facing impact, workarounds |
| Support |
Slack #support-escalation |
On status change |
Scripted responses, known workarounds |