# Infrastructure Monitoring Reference
Comprehensive reference for health checks, infrastructure metrics, APM, cost optimization, capacity planning, and incident response.
---
## Health Checks
### Types of Health Checks
| Type | Question It Answers | Failure Action |
|------|---------------------|----------------|
| **Liveness** | Is the process alive and not deadlocked? | Restart the process |
| **Readiness** | Can this instance serve traffic? | Remove from load balancer |
| **Startup** | Has the process finished initializing? | Wait (don't restart yet) |
### Implementation Patterns
#### Basic Health Check Endpoint
```go
// Go
type HealthStatus struct {
Status string `json:"status"`
Timestamp string `json:"timestamp"`
Version string `json:"version"`
Checks map[string]Check `json:"checks"`
}
type Check struct {
Status string `json:"status"`
Message string `json:"message,omitempty"`
Latency string `json:"latency,omitempty"`
}
func healthHandler(w http.ResponseWriter, r *http.Request) {
health := HealthStatus{
Status: "ok",
Timestamp: time.Now().UTC().Format(time.RFC3339),
Version: version,
Checks: make(map[string]Check),
}
// Check database
start := time.Now()
if err := db.PingContext(r.Context()); err != nil {
health.Status = "degraded"
health.Checks["database"] = Check{
Status: "fail",
Message: err.Error(),
}
} else {
health.Checks["database"] = Check{
Status: "ok",
Latency: time.Since(start).String(),
}
}
// Check Redis
start = time.Now()
if err := redis.Ping(r.Context()).Err(); err != nil {
health.Status = "degraded"
health.Checks["redis"] = Check{
Status: "fail",
Message: err.Error(),
}
} else {
health.Checks["redis"] = Check{
Status: "ok",
Latency: time.Since(start).String(),
}
}
statusCode := http.StatusOK
if health.Status != "ok" {
statusCode = http.StatusServiceUnavailable
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(statusCode)
json.NewEncoder(w).Encode(health)
}
```
```python
# Python (FastAPI)
from fastapi import FastAPI, Response
from datetime import datetime, timezone
import asyncio
app = FastAPI()
@app.get("/health")
async def health_check():
checks = {}
status = "ok"
# Database check
try:
start = datetime.now(timezone.utc)
await db.execute("SELECT 1")
checks["database"] = {
"status": "ok",
"latency_ms": (datetime.now(timezone.utc) - start).total_seconds() * 1000,
}
except Exception as e:
status = "degraded"
checks["database"] = {"status": "fail", "message": str(e)}
# Redis check
try:
start = datetime.now(timezone.utc)
await redis.ping()
checks["redis"] = {
"status": "ok",
"latency_ms": (datetime.now(timezone.utc) - start).total_seconds() * 1000,
}
except Exception as e:
status = "degraded"
checks["redis"] = {"status": "fail", "message": str(e)}
response_code = 200 if status == "ok" else 503
return Response(
content=json.dumps({
"status": status,
"timestamp": datetime.now(timezone.utc).isoformat(),
"checks": checks,
}),
status_code=response_code,
media_type="application/json",
)
@app.get("/ready")
async def readiness_check():
"""Readiness: can we serve traffic?"""
try:
await db.execute("SELECT 1")
return {"status": "ready"}
except Exception:
return Response(
content='{"status": "not_ready"}',
status_code=503,
media_type="application/json",
)
@app.get("/live")
async def liveness_check():
"""Liveness: is the process alive?"""
return {"status": "alive"}
```
#### Health Check Response Format
```json
{
"status": "ok",
"timestamp": "2026-03-09T14:32:01Z",
"version": "1.4.2",
"checks": {
"database": {
"status": "ok",
"latency_ms": 2.3
},
"redis": {
"status": "ok",
"latency_ms": 0.8
},
"external_api": {
"status": "degraded",
"message": "Elevated latency",
"latency_ms": 850
}
}
}
```
### Liveness vs Readiness Decision Guide
```
Is the process able to make progress?
├─ No (deadlocked, OOM, infinite loop)
│ └─ Liveness check should FAIL → container gets restarted
│
└─ Yes, but...
├─ Database is temporarily unreachable
│ └─ Readiness FAIL, Liveness PASS → stop sending traffic, don't restart
│
├─ Still loading initial data/cache
│ └─ Startup FAIL → don't check liveness yet, wait
│
└─ Everything is fine
└─ All checks PASS → serve traffic normally
```
**Common mistake:** Making liveness depend on external dependencies (database, Redis). If the database is down, restarting the application won't help — it will cause a restart storm.
---
## Kubernetes Probes
### Configuration
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
containers:
- name: api
image: api-server:1.4.2
ports:
- containerPort: 8080
# Startup probe: runs first, disables liveness/readiness until passing
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup time
successThreshold: 1
# Liveness probe: is the process alive?
livenessProbe:
httpGet:
path: /live
port: 8080
initialDelaySeconds: 0 # Starts after startup probe passes
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3 # 3 consecutive failures → restart
successThreshold: 1
# Readiness probe: can it serve traffic?
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3 # 3 failures → remove from Service
successThreshold: 1
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
```
### Probe Types
#### HTTP GET
```yaml
livenessProbe:
httpGet:
path: /health
port: 8080
httpHeaders:
- name: Authorization
value: Bearer internal-token
```
#### TCP Socket
```yaml
# For services that don't have HTTP (databases, message brokers)
livenessProbe:
tcpSocket:
port: 5432
periodSeconds: 10
```
#### Exec Command
```yaml
# Run a command inside the container
livenessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -U postgres
periodSeconds: 10
```
#### gRPC Health Check
```yaml
# gRPC health checking protocol
livenessProbe:
grpc:
port: 50051
service: "" # Empty string checks overall server health
periodSeconds: 10
```
### Probe Configuration Guidelines
| Parameter | Liveness | Readiness | Startup |
|-----------|----------|-----------|---------|
| `initialDelaySeconds` | 0 (use startup probe) | 0 | 5-10 |
| `periodSeconds` | 10-15 | 5-10 | 5 |
| `timeoutSeconds` | 3-5 | 3-5 | 3-5 |
| `failureThreshold` | 3 | 3 | 30 (generous) |
| `successThreshold` | 1 | 1-2 | 1 |
---
## Docker HEALTHCHECK
```dockerfile
# Dockerfile
FROM node:20-slim
HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
CMD curl -f http://localhost:8080/health || exit 1
# Or with wget (no curl in alpine)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
```
### docker-compose Health Check
```yaml
services:
api:
image: api-server:1.4.2
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 60s
worker:
image: worker:1.2.0
depends_on:
api:
condition: service_healthy
postgres:
condition: service_healthy
postgres:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
```
### Health Check Parameters
| Parameter | Description | Default | Recommendation |
|-----------|-------------|---------|----------------|
| `interval` | Time between checks | 30s | 15-30s for critical services |
| `timeout` | Max time for check | 30s | 3-5s (fail fast) |
| `retries` | Failures before unhealthy | 3 | 3 (avoid flapping) |
| `start_period` | Grace period for startup | 0s | Set to max startup time |
---
## Uptime Monitoring
### Uptime Kuma Setup
```yaml
# docker-compose.yml
services:
uptime-kuma:
image: louislam/uptime-kuma:1
restart: unless-stopped
ports:
- "3001:3001"
volumes:
- uptime-kuma-data:/app/data
labels:
- "traefik.enable=true"
- "traefik.http.routers.uptime.rule=Host(`status.example.com`)"
volumes:
uptime-kuma-data:
```
**Monitor types supported:**
- HTTP(s) — status code, keyword, response time
- TCP — port open check
- DNS — resolution check
- Docker container — running status
- gRPC — health check protocol
- MQTT — broker connectivity
- Ping (ICMP) — network reachability
- Push — heartbeat endpoint (service pushes to Uptime Kuma)
### Synthetic Monitoring
Scripted checks that simulate real user behavior from multiple regions:
```javascript
// k6 script for synthetic monitoring
import { check, sleep } from 'k6';
import http from 'k6/http';
export const options = {
scenarios: {
synthetic: {
executor: 'constant-vus',
vus: 1,
duration: '24h',
gracefulStop: '0s',
},
},
thresholds: {
http_req_duration: ['p(95)<500'], // 95% under 500ms
http_req_failed: ['rate<0.01'], // < 1% failure rate
checks: ['rate>0.99'], // 99% checks pass
},
};
export default function () {
// Check homepage
let res = http.get('https://www.example.com');
check(res, {
'homepage status 200': (r) => r.status === 200,
'homepage loads fast': (r) => r.timings.duration < 500,
'homepage has title': (r) => r.body.includes('
'),
});
// Check API health
res = http.get('https://api.example.com/health');
check(res, {
'api health 200': (r) => r.status === 200,
'api reports ok': (r) => JSON.parse(r.body).status === 'ok',
});
// Check login flow
res = http.post('https://api.example.com/auth/login', JSON.stringify({
email: 'synthetic-user@example.com',
password: process.env.SYNTHETIC_PASSWORD,
}), { headers: { 'Content-Type': 'application/json' } });
check(res, {
'login succeeds': (r) => r.status === 200,
'login returns token': (r) => JSON.parse(r.body).token !== undefined,
});
sleep(60); // Check every 60 seconds
}
```
### Multi-Region Monitoring
| Provider | Regions | Free Tier | Notes |
|----------|---------|-----------|-------|
| **Uptime Kuma** | Self-hosted (1 region) | Free | Deploy in multiple regions yourself |
| **Betteruptime** | 10+ regions | 5 monitors | Status page included |
| **Grafana Synthetic** | 20+ regions | Part of Grafana Cloud | k6-based scripts |
| **Datadog Synthetic** | 100+ locations | 100 API tests/month | Full browser testing |
| **AWS CloudWatch Synthetics** | All AWS regions | Pay per run | Canary scripts |
---
## Infrastructure Metrics
### CPU Metrics
| Metric | Source | What It Shows |
|--------|--------|---------------|
| `node_cpu_seconds_total{mode="user"}` | node_exporter | Time in user space |
| `node_cpu_seconds_total{mode="system"}` | node_exporter | Time in kernel space |
| `node_cpu_seconds_total{mode="iowait"}` | node_exporter | Time waiting for I/O |
| `node_cpu_seconds_total{mode="idle"}` | node_exporter | Idle time |
| `node_load1` / `node_load5` / `node_load15` | node_exporter | Load average (1/5/15 min) |
**Common queries:**
```promql
# CPU usage percentage (all modes except idle)
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
# CPU usage by mode
sum by (mode) (rate(node_cpu_seconds_total{instance="web01:9100"}[5m]))
# IO wait percentage (high = disk bottleneck)
avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m]))
# Load average vs CPU count
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})
```
### Memory Metrics
| Metric | What It Shows |
|--------|---------------|
| `node_memory_MemTotal_bytes` | Total physical memory |
| `node_memory_MemAvailable_bytes` | Memory available for applications |
| `node_memory_Cached_bytes` | Page cache (reclaimable) |
| `node_memory_Buffers_bytes` | Buffer cache |
| `node_memory_SwapTotal_bytes` | Total swap |
| `node_memory_SwapFree_bytes` | Free swap |
```promql
# Memory usage percentage
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Memory breakdown
node_memory_MemTotal_bytes
- node_memory_MemAvailable_bytes
- node_memory_Cached_bytes
- node_memory_Buffers_bytes
# Swap usage (any swap usage may indicate memory pressure)
1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)
```
### Disk Metrics
```promql
# Disk usage percentage
1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes)
# Disk I/O utilization (percentage of time doing I/O)
rate(node_disk_io_time_seconds_total[5m])
# Read/write throughput
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# IOPS
rate(node_disk_reads_completed_total[5m])
rate(node_disk_writes_completed_total[5m])
# Average I/O latency
rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m])
```
### Network Metrics
```promql
# Bandwidth (bytes/sec)
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])
# Packet errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
# TCP connections
node_netstat_Tcp_CurrEstab # Current established connections
rate(node_netstat_Tcp_ActiveOpens[5m]) # New outbound connections/sec
rate(node_netstat_Tcp_PassiveOpens[5m]) # New inbound connections/sec
```
---
## Container Metrics
### cAdvisor Metrics
| Metric | Description |
|--------|-------------|
| `container_cpu_usage_seconds_total` | Total CPU time consumed |
| `container_cpu_cfs_throttled_periods_total` | CPU throttling events |
| `container_memory_working_set_bytes` | Current memory (excludes cache) |
| `container_memory_usage_bytes` | Total memory (includes cache) |
| `container_network_receive_bytes_total` | Network inbound bytes |
| `container_network_transmit_bytes_total` | Network outbound bytes |
| `container_fs_usage_bytes` | Container filesystem usage |
| `container_spec_memory_limit_bytes` | Memory limit |
| `container_spec_cpu_quota` | CPU quota |
```promql
# Container CPU usage percentage (of limit)
sum by (container, pod) (
rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])
) / sum by (container, pod) (
container_spec_cpu_quota / container_spec_cpu_period
)
# Container memory usage percentage (of limit)
container_memory_working_set_bytes{container!="POD",container!=""}
/
container_spec_memory_limit_bytes{container!="POD",container!=""} > 0
# CPU throttling percentage
sum by (container, pod) (
rate(container_cpu_cfs_throttled_periods_total[5m])
) / sum by (container, pod) (
rate(container_cpu_cfs_periods_total[5m])
)
# OOMKill detection
increase(kube_pod_container_status_restarts_total[1h]) > 0
and
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
```
### Kubernetes Metrics (kube-state-metrics)
```promql
# Pod status
kube_pod_status_phase{phase="Running"}
kube_pod_status_phase{phase="Pending"}
kube_pod_status_phase{phase="Failed"}
# Deployment replicas
kube_deployment_status_replicas_available
kube_deployment_spec_replicas
# HPA status
kube_horizontalpodautoscaler_status_current_replicas
kube_horizontalpodautoscaler_spec_max_replicas
```
---
## Node Exporter
### Setup
```yaml
# docker-compose.yml
services:
node-exporter:
image: prom/node-exporter:v1.7.0
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
```
### Kubernetes DaemonSet
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
spec:
hostPID: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.7.0
ports:
- containerPort: 9100
hostPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
tolerations:
- effect: NoSchedule
operator: Exists
```
---
## APM Tools
### Comparison
| Feature | Datadog APM | New Relic | Elastic APM | Sentry |
|---------|-------------|-----------|-------------|--------|
| **Type** | Full APM | Full APM | Full APM | Error tracking + perf |
| **Pricing** | Per host ($31+/mo) | Per user + data | Free (self-host) or Cloud | Per event volume |
| **Traces** | Yes | Yes | Yes | Transaction traces |
| **Error tracking** | Yes | Yes | Yes | Excellent |
| **Profiling** | Yes (continuous) | Yes | No | No |
| **Log correlation** | Yes | Yes | Yes | Breadcrumbs |
| **Dashboards** | Built-in | Built-in | Kibana | Limited |
| **Setup** | Agent-based | Agent-based | Agent or OTel | SDK-based |
| **Best for** | Enterprise, full stack | Full observability | Self-hosted, ELK users | Error-focused teams |
### Sentry Error Tracking
```python
# Python
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
sentry_sdk.init(
dsn="https://key@sentry.io/project",
traces_sample_rate=0.1, # 10% of transactions
profiles_sample_rate=0.1,
environment="production",
release="1.4.2",
integrations=[FastApiIntegration()],
)
```
```javascript
// Node.js
const Sentry = require('@sentry/node');
Sentry.init({
dsn: 'https://key@sentry.io/project',
tracesSampleRate: 0.1,
environment: 'production',
release: '1.4.2',
});
```
```go
// Go
import "github.com/getsentry/sentry-go"
sentry.Init(sentry.ClientOptions{
Dsn: "https://key@sentry.io/project",
TracesSampleRate: 0.1,
Environment: "production",
Release: "1.4.2",
})
defer sentry.Flush(2 * time.Second)
```
---
## Cost Optimization
### Metric Cardinality Review
High cardinality is the most common cost driver in metrics systems:
```promql
# Find metrics with the most time series
topk(20, count by (__name__) ({__name__=~".+"}))
# Find labels with high cardinality
count(group by (path) (http_requests_total)) # How many unique paths?
count(group by (user_id) (api_calls_total)) # Unbounded!
```
**Reduction strategies:**
1. Remove unused metrics (if nobody dashboards/alerts on it, drop it)
2. Replace high-cardinality labels with bounded categories
3. Use recording rules to pre-aggregate, drop raw metrics
4. Use metric relabeling in Prometheus to drop at scrape time
```yaml
# Drop unused metrics at scrape time
metric_relabel_configs:
- source_labels: [__name__]
regex: "go_.*" # Drop Go runtime metrics if unused
action: drop
```
### Log Volume Reduction
| Strategy | Savings | Implementation |
|----------|---------|----------------|
| Set production to INFO | 50-80% | Logger config |
| Sample health check logs | 90% for /health | Middleware filter |
| Truncate large payloads | 20-40% | Body size limit (4KB) |
| Drop duplicate errors | 30-50% | Rate-limit per error type |
| Compress in transit | 60-80% bandwidth | Enable gzip on log shipper |
### Trace Sampling
| Sampling Rate | Monthly Cost (est.) | Suitability |
|---------------|---------------------|-------------|
| 100% | $$$$ | Development, < 100 req/s |
| 10% | $$$ | Staging, medium traffic |
| 1% | $$ | Production, high traffic |
| Tail-based (errors + slow) | $$ | Production (recommended) |
| 0.1% | $ | Very high traffic (> 100k req/s) |
### Retention Tiers
| Tier | Metrics | Logs | Traces |
|------|---------|------|--------|
| Hot (0-14 days) | 15s resolution | Full fidelity | All sampled traces |
| Warm (14-90 days) | 1m resolution | Full fidelity | Error + slow traces only |
| Cold (90 days - 1 year) | 5m resolution | Compressed | None (rely on metrics) |
| Archive (1-7 years) | 1h resolution | Compliance logs only | None |
---
## Capacity Planning
### Load Testing Correlation
Run load tests while monitoring infrastructure metrics to establish scaling thresholds:
```
Load Test Results:
┌─────────┬──────────┬────────┬─────────┬──────────────┐
│ RPS │ p99 (ms) │ CPU % │ Mem % │ Error Rate │
├─────────┼──────────┼────────┼─────────┼──────────────┤
│ 100 │ 45 │ 15 │ 30 │ 0% │
│ 500 │ 85 │ 35 │ 45 │ 0% │
│ 1000 │ 150 │ 55 │ 55 │ 0% │
│ 2000 │ 320 │ 75 │ 65 │ 0.1% │
│ 3000 │ 850 │ 90 │ 72 │ 1.5% │ ← degradation
│ 4000 │ 2500 │ 98 │ 78 │ 12% │ ← failure
└─────────┴──────────┴────────┴─────────┴──────────────┘
Scaling trigger: 75% CPU → add instance
Target capacity: 2x expected peak traffic
```
### Scaling Triggers
```yaml
# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale up at 70% CPU
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000" # Scale at 1000 RPS per pod
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50 # Max 50% increase per scale-up
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 25
periodSeconds: 120
```
### Resource Forecasting
```promql
# Predict disk full in N hours
predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600) < 0
# "Disk will be full within 30 days"
# Predict memory usage trend
predict_linear(
avg_over_time(container_memory_working_set_bytes[7d]),
30*24*3600
)
# Growth rate of database size
rate(pg_database_size_bytes[7d])
# Convert to "GB per month"
rate(pg_database_size_bytes[7d]) * 86400 * 30 / 1e9
```
---
## Incident Response
### Incident Lifecycle
```
Detection → Triage → Mitigate → Resolve → Postmortem
│ │ │ │ │
│ │ │ │ └─ Blameless review
│ │ │ └─ Root cause fix deployed
│ │ └─ User impact reduced/eliminated
│ └─ Severity assigned, team engaged
└─ Alert fires or user reports issue
```
### Severity Classification
| Severity | Impact | Response Time | Examples |
|----------|--------|---------------|---------|
| **SEV1 (Critical)** | Service down, data loss, security breach | < 15 minutes | Complete outage, payment processing failure |
| **SEV2 (Major)** | Significant degradation, partial outage | < 30 minutes | One region down, 50%+ error rate |
| **SEV3 (Minor)** | Limited impact, workaround exists | < 4 hours | Single feature broken, elevated latency |
| **SEV4 (Low)** | Minimal impact, cosmetic | Next business day | UI glitch, non-critical alert firing |
### Incident Commander Checklist
```markdown
## Initial Response (first 15 minutes)
- [ ] Acknowledge the alert / report
- [ ] Assess severity (SEV1-4)
- [ ] Open incident channel (#inc-YYYYMMDD-description)
- [ ] Page relevant team members
- [ ] Post initial status update
## Triage (15-30 minutes)
- [ ] Identify affected services and scope
- [ ] Check recent deployments: any changes in last 2 hours?
- [ ] Check dashboards for anomalies
- [ ] Check external dependencies (status pages)
- [ ] Determine if rollback is feasible
## Mitigation
- [ ] Implement immediate fix (rollback, feature flag, scaling)
- [ ] Verify user impact is reduced
- [ ] Update status page
- [ ] Communicate ETA for full resolution
## Resolution
- [ ] Confirm root cause
- [ ] Deploy fix
- [ ] Verify metrics return to baseline
- [ ] Clear incident status
- [ ] Schedule postmortem within 48 hours
```
### Postmortem Template
```markdown
# Incident Postmortem: [TITLE]
**Date:** 2026-03-09
**Duration:** 45 minutes (14:15 - 15:00 UTC)
**Severity:** SEV2
**Author:** [Name]
**Status:** Complete
## Summary
One-paragraph description of what happened and impact.
## Impact
- Users affected: ~5,000
- Revenue impact: ~$2,500
- SLO budget consumed: 3.2 hours of the monthly 43-minute budget
## Timeline (all times UTC)
| Time | Event |
|------|-------|
| 14:12 | Deploy v1.4.3 to production |
| 14:15 | Error rate alert fires (5% → 15%) |
| 14:17 | On-call acknowledges, starts investigation |
| 14:22 | Root cause identified: new query missing index |
| 14:25 | Decision: rollback v1.4.3 |
| 14:30 | Rollback complete |
| 14:35 | Error rate returns to baseline |
| 15:00 | All-clear declared |
## Root Cause
The v1.4.3 deployment added a new API endpoint that queried the orders
table without an index on `user_id + created_at`. Under load, this caused
connection pool exhaustion, which cascaded to other endpoints.
## Detection
Alert fired 3 minutes after deploy. Detection was effective.
## Contributing Factors
1. No load test for the new endpoint
2. Missing index not caught in code review
3. No query performance checks in CI
## Action Items
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| Add index on orders(user_id, created_at) | @backend | 2026-03-10 | Done |
| Add slow query detection to CI pipeline | @platform | 2026-03-15 | TODO |
| Add load test for new endpoints to deploy checklist | @backend | 2026-03-12 | TODO |
| Set up query performance alerting (> 100ms avg) | @sre | 2026-03-14 | TODO |
## Lessons Learned
- What went well: Fast detection (3 min), fast rollback (8 min)
- What went poorly: No pre-production load test caught the issue
- Where we got lucky: Happened during business hours, not at 3 AM
```
### Communication During Incidents
| Audience | Channel | Frequency | Content |
|----------|---------|-----------|---------|
| Engineering | Slack #incident | Real-time | Technical details, commands run |
| Management | Slack #incidents-summary | Every 15-30 min | Impact, ETA, escalation needs |
| Customers | Status page | Every 15-30 min | User-facing impact, workarounds |
| Support | Slack #support-escalation | On status change | Scripted responses, known workarounds |