Structured approaches to finding and fixing bugs, from scientific method to team-based protocols.
The most rigorous approach: treat debugging as an experiment.
Observe → Hypothesize → Predict → Test → Conclude
↑ │
└────────────────────────────────────────┘
(if hypothesis rejected)
Observe: Gather all available evidence without interpretation
Hypothesize: Form a specific, falsifiable explanation
Predict: State what should happen if the hypothesis is true
Test: Run the smallest experiment that distinguishes true from false
\di users_email_idxConclude: Accept or reject the hypothesis based on evidence
OBSERVATION:
API endpoint POST /api/orders returns 500 after deploying v2.3.1
Error log: "TypeError: Cannot read property 'id' of undefined"
Stack trace points to orders.controller.js:47
HYPOTHESIS 1:
"The user object is null because the auth middleware is not
attaching it to the request in the new version"
PREDICTION:
"If I log req.user in the auth middleware, it will be undefined
for the failing requests"
TEST:
Added console.log(req.user) in auth middleware
Result: req.user IS defined and correct
CONCLUSION:
Hypothesis 1 REJECTED. The user object exists.
HYPOTHESIS 2:
"The order.customer field changed from an object to a string ID
in the new schema, so order.customer.id fails"
PREDICTION:
"If I inspect the order document, customer will be a string, not
an object with an .id property"
TEST:
db.orders.findOne({_id: "failing-order-id"})
Result: { customer: "user_123", ... } -- string, not object
CONCLUSION:
Hypothesis 2 CONFIRMED. Schema migration changed customer from
embedded object to reference ID. Fix: update controller to handle
both formats or ensure migration is complete.
Divide the search space in half with each step. Works for both code and history.
Find the exact commit that introduced a bug:
# Start bisect
git bisect start
# Mark current (broken) state as bad
git bisect bad
# Mark a known-good commit
git bisect good v2.2.0
# Automate with a test script (exit 0 = good, exit 1 = bad)
git bisect run ./test-bug.sh
Example test script:
#!/bin/bash
# test-bug.sh - exits 0 if bug is absent, 1 if present
# Build the project (skip if not needed)
npm install --silent 2>/dev/null
npm run build --silent 2>/dev/null
# Run the specific test that catches the bug
npm test -- --grep "order creation" 2>/dev/null
exit $?
# After bisect completes:
# "abc1234 is the first bad commit"
# View the offending commit
git show abc1234
# Clean up
git bisect reset
git bisect start
git bisect bad HEAD
git bisect good v2.2.0
# Git checks out a middle commit
# Test manually, then mark:
git bisect good # if this commit works
git bisect bad # if this commit is broken
# Repeat until the first bad commit is found
# ~10 steps for 1000 commits (log2(1000) ≈ 10)
When the bug is in a single file or function:
1. Comment out the bottom half of the suspect function
2. Test → still broken? Bug is in the top half
3. Uncomment bottom half, comment out top half of remaining suspect code
4. Repeat until you isolate the exact line(s)
This is particularly effective for:
Named after the strategy of placing a fence across a territory to determine which side the wolf is on.
Place a "probe" (assertion, log, breakpoint) at the midpoint of execution. Check if the state is correct at that point. If correct, the bug is downstream. If incorrect, it is upstream. Repeat.
def process_order(order):
validated = validate(order)
# PROBE 1: is validated correct here?
assert validated.total > 0, f"Probe 1 failed: total={validated.total}"
enriched = enrich_with_inventory(validated)
# PROBE 2: is enriched correct here?
assert enriched.items_available, f"Probe 2 failed: items={enriched.items}"
charged = charge_payment(enriched)
# PROBE 3: is charged correct here?
assert charged.payment_id, f"Probe 3 failed: payment={charged.payment_id}"
return finalize(charged)
Place probes at:
├─ Function entry/exit boundaries
├─ Before/after external calls (DB, API, filesystem)
├─ Before/after data transformations
├─ At conditional branches (which path was taken?)
└─ At loop boundaries (iteration count, accumulator value)
Explaining the problem forces you to examine your assumptions.
1. WHAT I EXPECT TO HAPPEN:
[describe the correct behavior in detail]
2. WHAT ACTUALLY HAPPENS:
[describe the observed behavior precisely]
3. THE GAP:
[what is different between expected and actual?]
4. MY CODE DOES THIS:
[walk through the relevant code line by line]
- Line 1: "First, we fetch the user by ID..."
- Line 2: "Then we check if the user has permission..."
- Line 3: "Wait... we check user.role but role could be
undefined if the user was created before we
added roles... THAT'S THE BUG"
5. ASSUMPTIONS I AM MAKING:
[list every assumption, then question each one]
- "The user always has a role" ← IS THIS TRUE?
- "The database returns results in insertion order" ← IS THIS TRUE?
- "The config file is loaded before this function runs" ← IS THIS TRUE?
Compare what works against what does not.
Working State vs. Broken State
───────────── ────────────
Environment A Environment B
Input set X Input set Y
Version N-1 Version N
Config A Config B
# Compare environment variables
diff <(env | sort) <(ssh prod 'env | sort')
# Compare installed packages
diff <(pip list --format=freeze | sort) <(ssh prod 'pip list --format=freeze | sort')
# Compare config files
diff local.env production.env
difft config-working.yaml config-broken.yaml # semantic diff
# Compare database schemas
diff <(pg_dump --schema-only dbA) <(pg_dump --schema-only dbB)
# Compare API responses
diff <(curl -s localhost:3000/api/users | jq .) <(curl -s prod:3000/api/users | jq .)
# Compare directory structures
diff <(fd -t f . ./working/ | sort) <(fd -t f . ./broken/ | sort)
[ ] OS and version
[ ] Runtime version (node --version, python --version, go version)
[ ] Dependency versions (package-lock.json, requirements.txt, go.sum)
[ ] Environment variables (especially secrets, API keys, feature flags)
[ ] Config files (compare byte-for-byte)
[ ] Database schema and seed data
[ ] Network configuration (firewall, DNS, proxy)
[ ] File system (permissions, case sensitivity, available disk)
[ ] System resources (memory, CPU, file descriptors)
[ ] Time and timezone
Systematically reduce the input or code to find the minimal failing case.
Given: A failing input of N elements
Goal: Find the smallest subset that still triggers the failure
1. Split input into 2 halves
2. Test each half:
- If one half fails alone → recurse on that half
- If neither half fails alone → the bug requires elements from both
→ try removing each quarter, then each eighth, etc.
3. Stop when no single element can be removed without fixing the bug
# Reduce a failing test input file
# Start: 1000-line input.json that causes a crash
# Test: does the first half crash?
head -500 input.json > test.json && ./program test.json
# If yes: recurse on first 500 lines
# If no: test second half
tail -500 input.json > test.json && ./program test.json
# Continue halving until minimal input found
1. Start with the full failing program
2. Remove half the code (e.g., half the imports, half the functions)
3. Does it still fail?
- Yes → keep removing from what remains
- No → restore that half, remove the other half
4. Result: minimal code that reproduces the failure
creduce): Automated C/C++ test case reductionFollow the execution path through the system.
import functools
import time
import json
def trace(func):
"""Decorator that traces function entry, exit, and timing."""
@functools.wraps(func)
def wrapper(*args, **kwargs):
call_id = id(args) % 10000
arg_summary = json.dumps(args[:3], default=str)[:100]
print(f"[TRACE {call_id}] ENTER {func.__name__}({arg_summary})")
start = time.perf_counter()
try:
result = func(*args, **kwargs)
elapsed = (time.perf_counter() - start) * 1000
result_summary = str(result)[:100]
print(f"[TRACE {call_id}] EXIT {func.__name__} -> {result_summary} ({elapsed:.1f}ms)")
return result
except Exception as e:
elapsed = (time.perf_counter() - start) * 1000
print(f"[TRACE {call_id}] ERROR {func.__name__}: {e} ({elapsed:.1f}ms)")
raise
return wrapper
# Linux: trace system calls
strace -f -e trace=network,file -p PID
# Trace a command from start
strace -f -o trace.log ./my-program
# Count system calls (find the hot path)
strace -c ./my-program
# macOS: dtrace/dtruss
sudo dtruss -p PID
# Trace file access only
strace -e trace=open,openat,read,write -p PID
Timestamp | Component | Event | Detail
------------|---------------|----------|---------------------------
10:23:01.001| auth-service | ENTRY | validateToken(tok_abc...)
10:23:01.003| auth-service | CALL | redis.get("session:abc")
10:23:01.015| auth-service | RETURN | redis -> {user: "u123"}
10:23:01.016| auth-service | EXIT | validateToken -> valid
10:23:01.017| order-service | ENTRY | createOrder(user=u123)
10:23:01.018| order-service | CALL | db.query("SELECT ...")
10:23:01.250| order-service | RETURN | db -> timeout after 232ms ← PROBLEM
Record execution and replay it, stepping forwards AND backwards.
# Record the execution
rr record ./my-program arg1 arg2
# Replay (starts at the end, you can go backwards)
rr replay
# Inside the rr session (gdb-like interface):
(rr) continue # run forward
(rr) reverse-continue # run backward to previous breakpoint
(rr) reverse-next # step backward one line
(rr) reverse-step # step backward into function calls
(rr) watch -l var # break when var changes (works in reverse too)
# Set a breakpoint and reverse-continue to find what set a value
(rr) break my_function
(rr) continue # hit the breakpoint going forward
(rr) watch -l result # watch the variable
(rr) reverse-continue # go back to where result was last set
| Language | Tool | Notes |
|---|---|---|
| C/C++ | rr | Best-in-class, Linux only |
| C/C++ | UDB (UndoDB) | Commercial, cross-platform |
| JavaScript | Chrome DevTools | "Step backward" in Sources panel (limited) |
| Python | epdb / pdb++ |
Post-mortem with history, not true time-travel |
| Java | IntelliJ IDEA | Limited reverse debugging |
| .NET | Visual Studio | IntelliTrace (Enterprise edition) |
Maintain a structured log of hypotheses to avoid going in circles.
## Bug: [Brief description]
### Evidence Collected
- [ ] Error message: "..."
- [ ] Stack trace captured
- [ ] Logs reviewed for timeframe: X to Y
- [ ] Reproduction rate: N/M attempts
### Hypotheses
| # | Hypothesis | Prediction | Test | Result | Status |
|---|-----------|------------|------|--------|--------|
| 1 | Missing index on users.email | EXPLAIN shows seq scan | Run EXPLAIN ANALYZE | Shows index scan | REJECTED |
| 2 | Connection pool exhausted | Active connections = max | Check pg_stat_activity | 47/50 connections | INVESTIGATING |
| 3 | | | | | |
### Current Best Hypothesis: #2
### What I Tried That Didn't Work
- Restarting the service (symptom returned after 5 min)
- Increasing query timeout (different error, same root cause)
[ ] Read the full error message and stack trace
[ ] Check if this is a known issue (search issue tracker, logs)
[ ] Identify when it started (correlate with recent changes)
[ ] Verify you are on the correct branch/version
[ ] Check recent commits: git log --oneline -10
[ ] Check recent deployments or config changes
[ ] Reproduce the bug locally
[ ] Set a time limit (30 min before seeking help)
Time | Action Taken | Result | Next Step
--------|---------------------------|------------------|----------
10:00 | Reproduced locally | Fails 3/3 times | Check logs
10:05 | Checked error logs | Found stack trace | Form hypothesis
10:10 | Hypothesis: null user obj | Test: add assert | Test
10:15 | Assert passed (user OK) | Rejected H1 | New hypothesis
10:20 | Hypothesis: stale cache | Test: clear cache | Test
10:22 | Cache cleared, bug gone | Confirmed H2 | Find root cause
10:30 | Cache TTL was 0 (never expires) | Root cause found | Fix
[ ] Root cause documented
[ ] Fix applied and verified
[ ] Regression test added
[ ] Could this bug class exist elsewhere? (search for similar patterns)
[ ] Was the debugging process efficient? What would I do differently?
[ ] Knowledge shared with team (if applicable)
[ ] Monitoring/alerting added to catch recurrence
Can you reproduce it?
├─ No → Workaround + monitoring + move on
└─ Yes
├─ Is the impact high? (data loss, security, outage)
│ └─ Yes → Keep debugging, escalate if needed
└─ Is the impact low? (cosmetic, edge case, rare)
└─ Yes → Workaround + backlog ticket + move on
Two people, one screen. One drives (types), one navigates (thinks strategically).
When stuck, explain the problem to a colleague who has NOT been debugging it:
The fresh person often spots an assumption the original debugger has gone blind to.
Share with the team:
├─ What the bug was (root cause, not just symptom)
├─ How it was found (which technique worked)
├─ Why existing tests/monitoring did not catch it
├─ What was added to prevent recurrence
└─ Any broader lessons (design patterns, common pitfalls)