--- name: monitoring-ops description: "Observability patterns - metrics, logging, tracing, alerting, and infrastructure monitoring. Use for: monitoring, observability, prometheus, grafana, metrics, alerting, structured logging, distributed tracing, opentelemetry, SLO, SLI, dashboard, health check, loki, jaeger, datadog, pagerduty." license: MIT allowed-tools: "Read Write Bash" metadata: author: claude-mods related-skills: python-observability-ops, docker-ops, ci-cd-ops, nginx-ops --- # Monitoring Operations Comprehensive observability patterns covering the three pillars (metrics, logging, tracing), alerting strategies, dashboard design, and infrastructure monitoring for production systems. --- ## Three Pillars Quick Reference Use this table to decide which observability signal fits your need: | Pillar | Best For | Tools | Data Type | |--------|----------|-------|-----------| | **Metrics** | Aggregated numeric measurements, trends, alerting on thresholds | Prometheus, Datadog, CloudWatch, StatsD | Time-series (numeric) | | **Logs** | Discrete events, error details, audit trails, debugging context | Loki, ELK, CloudWatch Logs, Fluentd | Unstructured/structured text | | **Traces** | Request flow across services, latency breakdown, dependency mapping | Jaeger, Tempo, Zipkin, Datadog APM | Span trees (structured) | **When to use which:** - **"How many requests per second?"** → Metrics (counter + rate) - **"Why did this specific request fail?"** → Logs (error message + stack trace) - **"Where is the latency in this request?"** → Traces (span waterfall) - **"Is the system healthy right now?"** → Metrics (gauges + alerts) - **"What happened at 3:42 AM?"** → Logs (timestamped event search) - **"Which downstream service caused the timeout?"** → Traces (span analysis) **Correlation is key:** Connect all three by embedding `trace_id` in log entries, recording exemplars in metrics, and linking trace spans to log queries. --- ## Metrics Type Decision Tree Use this tree to select the correct metric type: ``` What are you measuring? │ ├─ A count of events that only goes up? │ └─ COUNTER │ Examples: http_requests_total, errors_total, bytes_sent_total │ Use rate() or increase() to get per-second or per-interval values │ Never use a counter's raw value — it resets on restart │ ├─ A current value that goes up AND down? │ └─ GAUGE │ Examples: temperature_celsius, active_connections, queue_depth │ Use for snapshots of current state │ Can use avg_over_time(), max_over_time() for trends │ ├─ A distribution of values (latency, size)? │ │ │ ├─ Need aggregatable quantiles across instances? │ │ └─ HISTOGRAM │ │ Examples: http_request_duration_seconds, response_size_bytes │ │ Define buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] │ │ Use histogram_quantile() for percentiles (p50, p95, p99) │ │ Aggregatable across instances (histograms can be summed) │ │ │ └─ Need pre-calculated quantiles on a single instance? │ └─ SUMMARY │ Examples: go_gc_duration_seconds │ Pre-calculates quantiles client-side │ NOT aggregatable across instances │ Prefer histogram unless you have a specific reason │ └─ None of the above? └─ INFO metric (labels only, value=1) Examples: build_info{version="1.2.3", commit="abc123"} Use for metadata exposed as metrics ``` **Rule of thumb:** Start with counters and histograms. Add gauges for current state. Avoid summaries unless you have a compelling reason. --- ## Alerting Decision Tree ``` What type of alert do you need? │ ├─ Known threshold with a fixed boundary? │ └─ THRESHOLD-BASED │ Example: CPU > 90% for 5 minutes │ Pros: Simple, predictable, easy to understand │ Cons: Requires manual tuning, doesn't adapt to patterns │ Best for: Resource limits, error rate spikes, queue depth │ ├─ Normal behavior varies by time/season? │ └─ ANOMALY-BASED │ Example: Traffic 3 standard deviations below normal for this hour │ Pros: Adapts to patterns, catches novel failures │ Cons: Noisy during transitions, requires training data │ Best for: Traffic patterns, business metrics, gradual degradation │ └─ Defined reliability targets? └─ SLO-BASED (PREFERRED) Example: Error budget burn rate > 14.4x for 1 hour Pros: Aligned with user impact, reduces noise, principled Cons: Requires SLI/SLO definition, more complex setup Best for: User-facing services, platform reliability ``` ### Severity Levels | Severity | Response | Examples | Routing | |----------|----------|----------|---------| | **Critical (P1)** | Page on-call immediately | Service down, data loss risk, security breach | PagerDuty high-urgency, phone call | | **Warning (P2)** | Investigate within hours | Elevated error rate, disk 80% full, SLO burn rate elevated | PagerDuty low-urgency, Slack alert channel | | **Info (P3)** | Review next business day | Deployment completed, certificate expiring in 30 days | Slack info channel, ticket auto-created | ### When to Page vs When to Ticket **Page (wake someone up) when:** - Users are currently impacted - Data loss is occurring or imminent - Security incident is active - Error budget will be exhausted within hours **Create ticket (don't page) when:** - Issue is not user-facing yet - Automated remediation is possible - Degradation is slow and has runway - Issue is during business hours and can be triaged normally --- ## Structured Logging Quick Reference ### Standard JSON Log Format ```json { "timestamp": "2026-03-09T14:32:01.123Z", "level": "ERROR", "message": "Failed to process payment", "service": "payment-api", "version": "1.4.2", "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "00f067aa0ba902b7", "request_id": "req-abc123", "user_id": "usr-789", "error": { "type": "PaymentGatewayTimeout", "message": "Gateway response timeout after 30s", "stack": "..." }, "duration_ms": 30042, "http": { "method": "POST", "path": "/api/v1/payments", "status_code": 504 } } ``` ### Log Level Decision Guide | Level | When to Use | Examples | |-------|-------------|---------| | **DEBUG** | Development only, verbose internal state | Variable values, SQL queries, cache hits/misses | | **INFO** | Normal operations worth recording | Request completed, job started/finished, config loaded | | **WARN** | Degraded but still functioning | Retry succeeded, fallback used, approaching limit | | **ERROR** | Operation failed, needs attention | Payment failed, API call error, constraint violation | | **FATAL** | Process cannot continue, must exit | Database unreachable at startup, invalid config, OOM | **Rules:** - Never log at ERROR for expected conditions (user input validation → WARN) - Every ERROR should be actionable — if no one will act on it, use WARN - DEBUG should be off in production by default - INFO should not be noisy — 1-5 log lines per request, not 50 ### Correlation IDs - Generate a `request_id` (UUID v4 or ULID) at the edge/gateway - Propagate through all internal services via headers (`X-Request-ID`) - Include `trace_id` and `span_id` from distributed tracing - Log all three IDs in every log entry for cross-referencing --- ## Distributed Tracing Quick Reference ### Core Concepts - **Trace:** End-to-end journey of a request across all services - **Span:** A single unit of work (HTTP call, DB query, function execution) - **Context propagation:** Passing trace/span IDs between services via headers ### W3C TraceContext Header ``` traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 │ │ │ │ │ │ │ └─ flags (01=sampled) │ │ └─ parent span ID (16 hex) │ └─ trace ID (32 hex) └─ version (00) ``` ### Sampling Strategies | Strategy | How It Works | Use When | |----------|--------------|----------| | **Head-based (ratio)** | Decide at trace start, propagate decision | Low traffic, need predictable volume | | **Always-on** | Sample everything | Development, low-traffic services | | **Parent-based** | Follow parent's sampling decision | Default for most services | | **Tail-based** | Decide after trace completes (at Collector) | Need error/slow traces, high traffic | **Recommendation:** Use parent-based + tail-based at the Collector. This captures all error traces and slow traces while controlling volume. ### Trace ID in Logs Always include `trace_id` in structured log entries. This enables jumping from a log line to the full trace view: ``` Log entry → trace_id → Jaeger/Tempo → full request waterfall ``` --- ## Tool Selection Matrix | Feature | Prometheus + Grafana | Datadog | Grafana Cloud | CloudWatch | |---------|---------------------|---------|---------------|------------| | **Cost** | Free (infra costs) | $$$$ (per host/metric) | $$ (usage-based) | $$ (AWS-native) | | **Setup complexity** | High (self-managed) | Low (SaaS agent) | Medium (managed) | Low (AWS-native) | | **Metrics** | Prometheus (excellent) | Built-in (excellent) | Mimir (excellent) | Built-in (good) | | **Logs** | Loki (good) | Built-in (excellent) | Loki (good) | CloudWatch Logs (good) | | **Traces** | Jaeger/Tempo (good) | APM (excellent) | Tempo (good) | X-Ray (adequate) | | **Alerting** | Alertmanager (good) | Built-in (excellent) | Grafana Alerting (good) | CloudWatch Alarms (adequate) | | **Dashboards** | Grafana (excellent) | Built-in (excellent) | Grafana (excellent) | Dashboards (adequate) | | **Retention** | Configurable (unlimited) | 15 months default | Configurable | Up to 15 months | | **Multi-cloud** | Yes | Yes | Yes | AWS only | | **Best for** | Cost-conscious, control | Full-featured, enterprise | Open-source + managed | AWS-native shops | **Recommendation path:** - **Starting out / budget-conscious:** Prometheus + Grafana + Loki + Tempo (all free, self-hosted) - **Small team, want managed:** Grafana Cloud free tier (10k metrics, 50GB logs, 50GB traces) - **Enterprise, need everything:** Datadog (expensive but comprehensive) - **AWS-only shop:** CloudWatch + X-Ray (simplest if already on AWS) --- ## Dashboard Design ### USE Method (Infrastructure) For every resource (CPU, memory, disk, network): | Signal | Question | Metric Example | |--------|----------|----------------| | **Utilization** | How busy is it? | `node_cpu_seconds_total` (% busy) | | **Saturation** | How overloaded is it? | `node_load1` (run queue length) | | **Errors** | Are there error events? | `node_network_receive_errs_total` | ### RED Method (Services) For every service endpoint: | Signal | Question | Metric Example | |--------|----------|----------------| | **Rate** | How many requests per second? | `rate(http_requests_total[5m])` | | **Errors** | How many are failing? | `rate(http_requests_total{status=~"5.."}[5m])` | | **Duration** | How long do they take? | `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))` | ### Four Golden Signals (Google SRE) | Signal | What to Measure | Alert Threshold Guidance | |--------|-----------------|--------------------------| | **Latency** | Time to serve a request (distinguish success vs error latency) | p99 > 2x baseline | | **Traffic** | Demand on the system (requests/sec, sessions, transactions) | Anomaly detection | | **Errors** | Rate of failed requests (explicit 5xx, implicit policy violations) | > 0.1% of traffic | | **Saturation** | How "full" the service is (CPU, memory, queue depth) | > 80% capacity | ### Dashboard Layout Best Practices 1. **Top row:** Key health indicators (error rate, latency p99, availability %) 2. **Second row:** Traffic and throughput (requests/sec, active users) 3. **Third row:** Resource utilization (CPU, memory, disk, network) 4. **Bottom rows:** Detailed breakdowns (by endpoint, by status code, by region) 5. **Use variables:** Service, environment, time range as dropdown selectors 6. **Include annotations:** Deployments, incidents, config changes as vertical markers --- ## Common Gotchas | Gotcha | Why It Happens | Fix | |--------|----------------|-----| | **Cardinality explosion** | Using unbounded label values (user ID, request path, query string) | Use bounded labels only; aggregate high-cardinality data in logs, not metrics | | **Alert fatigue** | Too many alerts, too sensitive thresholds, alerts on non-actionable symptoms | Require runbook for every alert; tune thresholds; use SLO-based alerting | | **Missing correlation IDs** | Logs, metrics, and traces not linked together | Include trace_id in all log entries; use exemplars in metrics | | **Sampling bias** | Head-based sampling drops error/slow traces at high sample rates | Use tail-based sampling at the Collector to always capture errors and slow traces | | **Log volume costs** | DEBUG or verbose INFO in production, logging full request/response bodies | Set production to INFO minimum; truncate large payloads; use sampling for verbose paths | | **Metric naming inconsistency** | Different teams use different naming conventions | Adopt OpenMetrics naming: `namespace_subsystem_unit_suffix` (e.g., `http_server_request_duration_seconds`) | | **Dashboard sprawl** | Everyone creates dashboards, nobody maintains them | Standardize with USE/RED templates; review quarterly; delete unused dashboards | | **SLO too aggressive** | Setting 99.99% availability without the budget or architecture for it | Start with 99.5% or 99.9%; tighten only when consistently meeting targets with margin | | **Missing baseline** | Alerting on absolute thresholds without understanding normal behavior | Collect 2-4 weeks of baseline data before setting alert thresholds | | **Over-instrumentation** | Instrumenting every function, creating too many spans/metrics | Instrument at service boundaries; use auto-instrumentation for HTTP/DB/gRPC; add manual spans selectively | | **Ignoring metric staleness** | Assuming a metric that stops reporting means zero | Use `absent()` or `up == 0` to detect missing scrapers; distinguish "zero" from "not reporting" | | **Alerting on cause not symptom** | Alerting on CPU usage instead of user-facing error rate | Alert on symptoms (error rate, latency); use cause metrics (CPU, memory) for investigation | | **No retention policy** | Storing all metrics/logs at full resolution forever | Define retention tiers: 15s resolution for 2 weeks, 1m for 3 months, 5m for 1 year | | **Dashboard without context** | Graphs with no units, no description, no threshold lines | Add units to Y-axis, threshold lines for SLOs, panel descriptions explaining what "good" looks like | --- ## Reference Files | File | Contents | Lines | |------|----------|-------| | [metrics-alerting.md](references/metrics-alerting.md) | Prometheus, Grafana, OpenTelemetry metrics, SLI/SLO/SLA, alert routing, runbooks, uptime monitoring | ~650 | | [logging.md](references/logging.md) | Structured logging, log levels, correlation IDs, aggregation (Loki, ELK), retention, PII masking, language-specific | ~550 | | [tracing.md](references/tracing.md) | OpenTelemetry, spans, context propagation, sampling, Jaeger, async tracing, DB/HTTP/gRPC instrumentation | ~600 | | [infrastructure.md](references/infrastructure.md) | Health checks, K8s probes, Docker HEALTHCHECK, infra metrics, APM, cost optimization, incident response | ~550 | --- ## See Also - **docker-ops** — Container monitoring with cAdvisor, Docker stats, and health checks - **ci-cd-ops** — Pipeline observability, deployment tracking, build metrics - **nginx-ops** — Nginx access/error log parsing, request metrics, upstream monitoring - **python-observability-ops** — Python-specific instrumentation with structlog, opentelemetry-python - [OpenTelemetry documentation](https://opentelemetry.io/docs/) - [Prometheus best practices](https://prometheus.io/docs/practices/) - [Google SRE Book — Monitoring chapter](https://sre.google/sre-book/monitoring-distributed-systems/) - [Grafana dashboards library](https://grafana.com/grafana/dashboards/)