SKILL.md 12 KB


name: debug-ops description: "Systematic debugging methodology, language-specific debuggers, and common scenario playbooks. Use for: debug, debugging, bug, crash, hang, memory leak, race condition, deadlock, bisect, reproduce, root cause, breakpoint, profiling, performance issue, segfault, stack trace, core dump." allowed-tools: "Read Write Bash"

related-skills: [testing-ops, security-ops, monitoring-ops, code-stats]

Debug Operations

Systematic debugging methodology with language-specific tooling and common scenario playbooks.

Bug Classification Decision Tree

Bug Report / Symptom
│
├─ Crash
│  ├─ Segfault / Access Violation
│  │  └─ Check: null pointer, buffer overflow, use-after-free, stack overflow
│  ├─ Panic / Fatal Error
│  │  └─ Check: assertion failure, unrecoverable state, out-of-memory
│  └─ Unhandled Exception
│     └─ Check: missing error handler, unexpected input type, network failure
│
├─ Hang
│  ├─ Deadlock
│  │  └─ Check: lock ordering, mutex contention, channel blocking
│  ├─ Infinite Loop
│  │  └─ Check: loop termination condition, counter overflow, recursive call
│  └─ Blocked I/O
│     └─ Check: network timeout, DNS resolution, disk full, file lock
│
├─ Wrong Output
│  ├─ Logic Error
│  │  └─ Check: operator precedence, boundary conditions, boolean logic
│  ├─ Data Corruption
│  │  └─ Check: concurrent mutation, encoding mismatch, truncation
│  └─ Off-by-One
│     └─ Check: loop bounds, array indexing, fence-post errors
│
├─ Performance
│  ├─ Slow Queries
│  │  └─ Check: missing index, N+1 queries, full table scan, lock wait
│  ├─ Memory Bloat
│  │  └─ Check: cache without eviction, leaked references, large allocations
│  └─ CPU Spikes
│     └─ Check: hot loops, regex backtracking, excessive GC, busy-wait
│
└─ Intermittent
   ├─ Race Condition
   │  └─ Check: shared mutable state, read-modify-write, check-then-act
   ├─ Timing-Dependent
   │  └─ Check: timeout values, clock skew, event ordering assumptions
   └─ Environment-Specific
      └─ Check: OS differences, locale, timezone, file system case sensitivity

Systematic Debugging Workflow

Six-step process from symptom to prevention:

Step 1: Reproduce

Confirm the bug exists and create a reliable reproduction. A bug you cannot reproduce is a bug you cannot confidently fix. Capture exact inputs, environment, and sequence of operations.

Step 2: Isolate

Narrow the fault to the smallest possible scope. Use binary search (git bisect, commenting out code halves), stubs, feature flags, and environment isolation to eliminate innocent code.

Step 3: Identify

Find the root cause, not just the proximate trigger. Use the 5 Whys technique, trace execution, inspect state at key points. Distinguish between the symptom and the underlying defect.

Step 4: Fix

Apply the minimal correct change that addresses the root cause. Avoid shotgun debugging (changing multiple things at once). Understand why the fix works, not just that it works.

Step 5: Verify

Confirm the fix resolves the original issue without introducing regressions. Re-run the original reproduction case. Run the full test suite. Test edge cases related to the fix.

Step 6: Prevent

Add a regression test. Update documentation or runbooks if applicable. Consider whether the same class of bug could exist elsewhere. Share findings with the team.

Reproduction Checklist

[ ] Minimal reproduction steps documented (numbered, unambiguous)
[ ] Environment captured (OS, runtime version, dependencies, config)
[ ] Exact inputs recorded (request payload, CLI args, file contents)
[ ] Timing sensitivity assessed (does it fail only under load? after delay?)
[ ] Single-threaded reproduction attempted (eliminates concurrency noise)
[ ] Reproduction automated as script or test case
[ ] Confirmed reproduction is deterministic (fails N/N attempts)
[ ] Identified whether reproduction requires specific data/state

Isolation Techniques Quick Reference

Technique Method Best For
Binary search (git) git bisect start BAD GOOD then git bisect run ./test.sh Finding which commit introduced the bug
Binary search (code) Comment out half the code, test, repeat Narrowing fault location in unfamiliar code
Stubs/Mocks Replace dependencies with known-good fakes Isolating from external services
Feature flags Toggle features off one by one Finding which feature causes the issue
Environment isolation Docker container, fresh VM, clean install Eliminating environment contamination
Network interception mitmproxy, Charles Proxy, mock server Isolating client vs server issues
Input reduction Remove input fields/data until bug disappears Finding minimal trigger
Dependency pinning Lock all deps, update one at a time Finding breaking dependency update

Root Cause Analysis Template

5 Whys Example

Problem: API returns 500 error on user login

1. Why? → The database query throws a timeout exception
2. Why? → The users table scan takes >30 seconds
3. Why? → There is no index on the email column
4. Why? → The migration that adds the index was never run in production
5. Why? → The deployment script skips migrations when the --fast flag is used

Root cause: Deployment script's --fast flag bypasses migrations
Fix: Remove --fast flag behavior that skips migrations, add migration check to health endpoint
Prevention: CI check that verifies all migrations are applied after deployment

Fault Tree Basics

                    [System Failure]
                    /              \
            [Hardware]          [Software]
            /       \           /        \
       [Disk]    [Memory]  [Config]   [Code Bug]
                              |          |
                         [Missing    [Race in
                          env var]    worker pool]

Work from the top (observed failure) down to leaves (root causes). Each branch is an AND/OR gate -- AND means all children must be true, OR means any one child suffices.

Language-Specific Debugger Quick Reference

Language Tool Launch Command Key Commands
Node.js Chrome DevTools node --inspect-brk app.js Open chrome://inspect, set breakpoints in Sources
Node.js ndb npx ndb app.js Enhanced DevTools with blackboxing
Python pdb python -m pdb script.py n next, s step, c continue, p expr print, bt backtrace
Python debugpy python -m debugpy --listen 5678 --wait-for-client script.py VS Code "Attach" launch config
Python breakpoint() Insert breakpoint() in code Drops into pdb at that line (Python 3.7+)
Go Delve dlv debug ./cmd/server b main.go:42 break, c continue, n next, p var print
Go Delve (test) dlv test ./pkg/... Debug test functions directly
Go Delve (attach) dlv attach PID Debug running process
Rust rust-gdb rust-gdb target/debug/myapp b main, r, n, p variable, bt
Rust rust-lldb rust-lldb target/debug/myapp b s main, r, n, p variable, bt
Rust CodeLLDB VS Code extension GUI breakpoints, variable inspection
Browser DevTools F12 or Ctrl+Shift+I Elements, Console, Network, Sources, Performance, Memory

Quick Debug Snippets

// Node.js: drop into debugger at this point
debugger;

// Node.js: conditional breakpoint
if (user.id === 'problem-user') debugger;
# Python: drop into debugger at this point
breakpoint()

# Python: conditional breakpoint
if user_id == 'problem-user':
    breakpoint()
// Go: print goroutine stacks (send SIGQUIT or SIGABRT)
// kill -QUIT <pid>
// Or in code:
import "runtime/debug"
debug.PrintStack()
// Rust: enable full backtraces
// RUST_BACKTRACE=1 cargo run
// RUST_BACKTRACE=full cargo run

Log-Based Debugging Patterns

Strategic Logging

Place logs at decision points, not just error paths:

[ENTRY] function_name(args_summary)     -- entering the function
[STATE] key_variable=value              -- state at critical decision point
[BRANCH] taking path X because Y       -- which branch and why
[EXIT] function_name -> result_summary  -- leaving the function
[ERROR] operation failed: detail        -- error with context

Correlation IDs

Trace a single request across services:

# Generate at entry point, propagate through all calls
X-Request-ID: 550e8400-e29b-41d4-a716-446655440000

# Search across all service logs
rg "550e8400-e29b-41d4-a716-446655440000" /var/log/services/

Timeline Reconstruction

# Merge and sort logs from multiple sources by timestamp
sort -t' ' -k1,2 service-a.log service-b.log service-c.log > timeline.log

# Find gaps in activity (potential hang/block)
awk '{print $1, $2}' timeline.log | uniq -c | sort -rn | head -20

Structured Log Queries

# jq queries on JSON logs
# Find all errors for a specific user
jq 'select(.level == "error" and .user_id == "u123")' app.log

# Get timing distribution for slow requests
jq 'select(.duration_ms > 1000) | .duration_ms' app.log | sort -n

# Count errors by type
jq -r 'select(.level == "error") | .error_type' app.log | sort | uniq -c | sort -rn

Common Gotchas

Gotcha Why It Hurts Fix
Fixing symptoms, not root cause Bug resurfaces in a different form Use 5 Whys to dig deeper
Debugging in production without safety net Risk of data loss or extended outage Use read-only queries, feature flags, canary deploys
Heisenbug (disappears under observation) Adding logging/breakpoints changes timing Use non-invasive tools: strace, sampling profiler, rr
Assumption bias ("it can't be X") Skipping the actual cause because you trust it Test every assumption explicitly, even "obvious" ones
Missing reproduction case Cannot verify fix, cannot prevent regression Invest time upfront in reliable reproduction
Over-relying on print/log debugging Slow iteration, pollutes code, misses concurrency bugs Use proper debugger, profiler, or tracing tool
Not checking recent changes The answer is often in the last few commits git log --oneline -20, git diff HEAD~5
Ignoring warning messages Warnings often predict the error that follows Treat warnings as errors during debugging
Debugging wrong version/branch Wasting time on already-fixed or different code Verify git branch, git log -1, runtime version
Not reading the full stack trace Root cause is often in the middle, not the top Read bottom-up: find your code in the trace first
Changing multiple things at once Cannot tell which change fixed (or broke) it One change per test cycle
Not capturing the "before" state Cannot diff against working baseline Snapshot config, deps, data before debugging

Reference Files

File Contents Lines
references/systematic-methods.md Scientific method, binary search, delta debugging, differential debugging, time-travel debugging, team debugging ~600
references/tool-specific.md Browser DevTools, Node.js, Python, Go, Rust, database, network, Docker debugging tools ~650
references/common-scenarios.md Memory leaks, deadlocks, race conditions, performance regressions, API debugging, deployment issues ~550

See Also

  • testing-ops -- Write tests to prevent bugs from recurring
  • security-ops -- Security-specific debugging (auth failures, injection, CSRF)
  • monitoring-ops -- Production observability, alerting, dashboards
  • code-stats -- Measure code complexity and identify bug-prone areas
  • container-orchestration -- Docker and Kubernetes debugging context
  • git-ops -- Git bisect workflow and history investigation