# Systematic Debugging Methods

Structured approaches to finding and fixing bugs, from scientific method to team-based protocols.

## Scientific Debugging Method

The most rigorous approach: treat debugging as an experiment.

### The Cycle

```
Observe → Hypothesize → Predict → Test → Conclude
   ↑                                        │
   └────────────────────────────────────────┘
           (if hypothesis rejected)
```

### Step-by-Step

1. **Observe**: Gather all available evidence without interpretation
   - Error messages, stack traces, logs
   - System state (memory, CPU, disk, network)
   - User-reported behavior vs expected behavior
   - When it started (correlate with deployments, config changes)

2. **Hypothesize**: Form a specific, falsifiable explanation
   - "The query times out because the users table lacks an index on email"
   - NOT "something is wrong with the database" (too vague)

3. **Predict**: State what should happen if the hypothesis is true
   - "If I add an index on email, the query should complete in <100ms"
   - "If I run EXPLAIN on the query, it should show a sequential scan"

4. **Test**: Run the smallest experiment that distinguishes true from false
   - Run EXPLAIN ANALYZE on the query
   - Check if the index exists: `\di users_email_idx`

5. **Conclude**: Accept or reject the hypothesis based on evidence
   - If confirmed: proceed to fix
   - If rejected: return to step 1 with new information

### Example Walkthrough

```
OBSERVATION:
  API endpoint POST /api/orders returns 500 after deploying v2.3.1
  Error log: "TypeError: Cannot read property 'id' of undefined"
  Stack trace points to orders.controller.js:47

HYPOTHESIS 1:
  "The user object is null because the auth middleware is not
   attaching it to the request in the new version"

PREDICTION:
  "If I log req.user in the auth middleware, it will be undefined
   for the failing requests"

TEST:
  Added console.log(req.user) in auth middleware
  Result: req.user IS defined and correct

CONCLUSION:
  Hypothesis 1 REJECTED. The user object exists.

HYPOTHESIS 2:
  "The order.customer field changed from an object to a string ID
   in the new schema, so order.customer.id fails"

PREDICTION:
  "If I inspect the order document, customer will be a string, not
   an object with an .id property"

TEST:
  db.orders.findOne({_id: "failing-order-id"})
  Result: { customer: "user_123", ... }  -- string, not object

CONCLUSION:
  Hypothesis 2 CONFIRMED. Schema migration changed customer from
  embedded object to reference ID. Fix: update controller to handle
  both formats or ensure migration is complete.
```

## Binary Search Debugging

Divide the search space in half with each step. Works for both code and history.

### Git Bisect (Automated)

Find the exact commit that introduced a bug:

```bash
# Start bisect
git bisect start

# Mark current (broken) state as bad
git bisect bad

# Mark a known-good commit
git bisect good v2.2.0

# Automate with a test script (exit 0 = good, exit 1 = bad)
git bisect run ./test-bug.sh
```

Example test script:

```bash
#!/bin/bash
# test-bug.sh - exits 0 if bug is absent, 1 if present

# Build the project (skip if not needed)
npm install --silent 2>/dev/null
npm run build --silent 2>/dev/null

# Run the specific test that catches the bug
npm test -- --grep "order creation" 2>/dev/null
exit $?
```

```bash
# After bisect completes:
# "abc1234 is the first bad commit"

# View the offending commit
git show abc1234

# Clean up
git bisect reset
```

### Git Bisect (Manual)

```bash
git bisect start
git bisect bad HEAD
git bisect good v2.2.0

# Git checks out a middle commit
# Test manually, then mark:
git bisect good   # if this commit works
git bisect bad    # if this commit is broken

# Repeat until the first bad commit is found
# ~10 steps for 1000 commits (log2(1000) ≈ 10)
```

### Manual Code Bisection

When the bug is in a single file or function:

```
1. Comment out the bottom half of the suspect function
2. Test → still broken? Bug is in the top half
3. Uncomment bottom half, comment out top half of remaining suspect code
4. Repeat until you isolate the exact line(s)
```

This is particularly effective for:
- Long functions with no clear fault location
- Template/config files where errors are positional
- CSS debugging (comment out rule blocks)

## Wolf Fence Algorithm

Named after the strategy of placing a fence across a territory to determine which side the wolf is on.

### Concept

Place a "probe" (assertion, log, breakpoint) at the midpoint of execution. Check if the state is correct at that point. If correct, the bug is downstream. If incorrect, it is upstream. Repeat.

### Implementation

```python
def process_order(order):
    validated = validate(order)
    # PROBE 1: is validated correct here?
    assert validated.total > 0, f"Probe 1 failed: total={validated.total}"

    enriched = enrich_with_inventory(validated)
    # PROBE 2: is enriched correct here?
    assert enriched.items_available, f"Probe 2 failed: items={enriched.items}"

    charged = charge_payment(enriched)
    # PROBE 3: is charged correct here?
    assert charged.payment_id, f"Probe 3 failed: payment={charged.payment_id}"

    return finalize(charged)
```

### Strategic Probe Placement

```
Place probes at:
├─ Function entry/exit boundaries
├─ Before/after external calls (DB, API, filesystem)
├─ Before/after data transformations
├─ At conditional branches (which path was taken?)
└─ At loop boundaries (iteration count, accumulator value)
```

## Rubber Duck Debugging

Explaining the problem forces you to examine your assumptions.

### Structured Rubber Duck Template

```
1. WHAT I EXPECT TO HAPPEN:
   [describe the correct behavior in detail]

2. WHAT ACTUALLY HAPPENS:
   [describe the observed behavior precisely]

3. THE GAP:
   [what is different between expected and actual?]

4. MY CODE DOES THIS:
   [walk through the relevant code line by line]
   - Line 1: "First, we fetch the user by ID..."
   - Line 2: "Then we check if the user has permission..."
   - Line 3: "Wait... we check user.role but role could be
              undefined if the user was created before we
              added roles... THAT'S THE BUG"

5. ASSUMPTIONS I AM MAKING:
   [list every assumption, then question each one]
   - "The user always has a role" ← IS THIS TRUE?
   - "The database returns results in insertion order" ← IS THIS TRUE?
   - "The config file is loaded before this function runs" ← IS THIS TRUE?
```

### Why It Works

- Forces sequential reasoning instead of pattern-matching
- Exposes implicit assumptions
- Catches "it obviously works" blind spots
- The bug is usually found in step 4 or 5

## Differential Debugging

Compare what works against what does not.

### Method

```
Working State          vs.          Broken State
─────────────                      ────────────
Environment A                      Environment B
Input set X                        Input set Y
Version N-1                        Version N
Config A                           Config B
```

### Practical Comparison Commands

```bash
# Compare environment variables
diff <(env | sort) <(ssh prod 'env | sort')

# Compare installed packages
diff <(pip list --format=freeze | sort) <(ssh prod 'pip list --format=freeze | sort')

# Compare config files
diff local.env production.env
difft config-working.yaml config-broken.yaml  # semantic diff

# Compare database schemas
diff <(pg_dump --schema-only dbA) <(pg_dump --schema-only dbB)

# Compare API responses
diff <(curl -s localhost:3000/api/users | jq .) <(curl -s prod:3000/api/users | jq .)

# Compare directory structures
diff <(fd -t f . ./working/ | sort) <(fd -t f . ./broken/ | sort)
```

### Environment Diff Checklist

```
[ ] OS and version
[ ] Runtime version (node --version, python --version, go version)
[ ] Dependency versions (package-lock.json, requirements.txt, go.sum)
[ ] Environment variables (especially secrets, API keys, feature flags)
[ ] Config files (compare byte-for-byte)
[ ] Database schema and seed data
[ ] Network configuration (firewall, DNS, proxy)
[ ] File system (permissions, case sensitivity, available disk)
[ ] System resources (memory, CPU, file descriptors)
[ ] Time and timezone
```

## Delta Debugging

Systematically reduce the input or code to find the minimal failing case.

### Concept (ddmin Algorithm)

```
Given: A failing input of N elements
Goal:  Find the smallest subset that still triggers the failure

1. Split input into 2 halves
2. Test each half:
   - If one half fails alone → recurse on that half
   - If neither half fails alone → the bug requires elements from both
     → try removing each quarter, then each eighth, etc.
3. Stop when no single element can be removed without fixing the bug
```

### Practical Application

```bash
# Reduce a failing test input file
# Start: 1000-line input.json that causes a crash

# Test: does the first half crash?
head -500 input.json > test.json && ./program test.json

# If yes: recurse on first 500 lines
# If no: test second half
tail -500 input.json > test.json && ./program test.json

# Continue halving until minimal input found
```

### Code Reduction

```
1. Start with the full failing program
2. Remove half the code (e.g., half the imports, half the functions)
3. Does it still fail?
   - Yes → keep removing from what remains
   - No → restore that half, remove the other half
4. Result: minimal code that reproduces the failure
```

### Tools

- **C-Reduce** (`creduce`): Automated C/C++ test case reduction
- **Perses**: Language-agnostic program reducer
- **picireny**: Python-based delta debugging framework
- Manual reduction is often fastest for small programs

## Trace-Based Debugging

Follow the execution path through the system.

### Strategic Trace Points

```python
import functools
import time
import json

def trace(func):
    """Decorator that traces function entry, exit, and timing."""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        call_id = id(args) % 10000
        arg_summary = json.dumps(args[:3], default=str)[:100]
        print(f"[TRACE {call_id}] ENTER {func.__name__}({arg_summary})")
        start = time.perf_counter()
        try:
            result = func(*args, **kwargs)
            elapsed = (time.perf_counter() - start) * 1000
            result_summary = str(result)[:100]
            print(f"[TRACE {call_id}] EXIT  {func.__name__} -> {result_summary} ({elapsed:.1f}ms)")
            return result
        except Exception as e:
            elapsed = (time.perf_counter() - start) * 1000
            print(f"[TRACE {call_id}] ERROR {func.__name__}: {e} ({elapsed:.1f}ms)")
            raise
    return wrapper
```

### System Call Tracing

```bash
# Linux: trace system calls
strace -f -e trace=network,file -p PID

# Trace a command from start
strace -f -o trace.log ./my-program

# Count system calls (find the hot path)
strace -c ./my-program

# macOS: dtrace/dtruss
sudo dtruss -p PID

# Trace file access only
strace -e trace=open,openat,read,write -p PID
```

### Structured Trace Output

```
Timestamp   | Component     | Event    | Detail
------------|---------------|----------|---------------------------
10:23:01.001| auth-service  | ENTRY    | validateToken(tok_abc...)
10:23:01.003| auth-service  | CALL     | redis.get("session:abc")
10:23:01.015| auth-service  | RETURN   | redis -> {user: "u123"}
10:23:01.016| auth-service  | EXIT     | validateToken -> valid
10:23:01.017| order-service | ENTRY    | createOrder(user=u123)
10:23:01.018| order-service | CALL     | db.query("SELECT ...")
10:23:01.250| order-service | RETURN   | db -> timeout after 232ms  ← PROBLEM
```

## Time-Travel Debugging

Record execution and replay it, stepping forwards AND backwards.

### rr (Linux only)

```bash
# Record the execution
rr record ./my-program arg1 arg2

# Replay (starts at the end, you can go backwards)
rr replay

# Inside the rr session (gdb-like interface):
(rr) continue           # run forward
(rr) reverse-continue   # run backward to previous breakpoint
(rr) reverse-next       # step backward one line
(rr) reverse-step       # step backward into function calls
(rr) watch -l var       # break when var changes (works in reverse too)

# Set a breakpoint and reverse-continue to find what set a value
(rr) break my_function
(rr) continue           # hit the breakpoint going forward
(rr) watch -l result    # watch the variable
(rr) reverse-continue   # go back to where result was last set
```

### When to Use Time-Travel Debugging

- Bug manifests late but root cause is early in execution
- You need to find "what set this variable to the wrong value?"
- Intermittent bugs that are hard to reproduce (record once, replay many times)
- Complex multi-step state corruption

### Alternatives by Language

| Language | Tool | Notes |
|----------|------|-------|
| C/C++ | rr | Best-in-class, Linux only |
| C/C++ | UDB (UndoDB) | Commercial, cross-platform |
| JavaScript | Chrome DevTools | "Step backward" in Sources panel (limited) |
| Python | `epdb` / `pdb++` | Post-mortem with history, not true time-travel |
| Java | IntelliJ IDEA | Limited reverse debugging |
| .NET | Visual Studio | IntelliTrace (Enterprise edition) |

## Hypothesis-Driven Debugging

Maintain a structured log of hypotheses to avoid going in circles.

### Tracking Template

```markdown
## Bug: [Brief description]

### Evidence Collected
- [ ] Error message: "..."
- [ ] Stack trace captured
- [ ] Logs reviewed for timeframe: X to Y
- [ ] Reproduction rate: N/M attempts

### Hypotheses

| # | Hypothesis | Prediction | Test | Result | Status |
|---|-----------|------------|------|--------|--------|
| 1 | Missing index on users.email | EXPLAIN shows seq scan | Run EXPLAIN ANALYZE | Shows index scan | REJECTED |
| 2 | Connection pool exhausted | Active connections = max | Check pg_stat_activity | 47/50 connections | INVESTIGATING |
| 3 | | | | | |

### Current Best Hypothesis: #2

### What I Tried That Didn't Work
- Restarting the service (symptom returned after 5 min)
- Increasing query timeout (different error, same root cause)
```

### Rules

1. Write hypotheses down before testing them
2. Define the prediction before running the test
3. Record results even for rejected hypotheses (prevents retesting)
4. If you have tested 5+ hypotheses without progress, step back and re-examine assumptions

## Debugging Checklists

### Pre-Debug Checklist

```
[ ] Read the full error message and stack trace
[ ] Check if this is a known issue (search issue tracker, logs)
[ ] Identify when it started (correlate with recent changes)
[ ] Verify you are on the correct branch/version
[ ] Check recent commits: git log --oneline -10
[ ] Check recent deployments or config changes
[ ] Reproduce the bug locally
[ ] Set a time limit (30 min before seeking help)
```

### During-Debug Log

```
Time    | Action Taken              | Result           | Next Step
--------|---------------------------|------------------|----------
10:00   | Reproduced locally        | Fails 3/3 times  | Check logs
10:05   | Checked error logs        | Found stack trace | Form hypothesis
10:10   | Hypothesis: null user obj | Test: add assert  | Test
10:15   | Assert passed (user OK)   | Rejected H1      | New hypothesis
10:20   | Hypothesis: stale cache   | Test: clear cache | Test
10:22   | Cache cleared, bug gone   | Confirmed H2     | Find root cause
10:30   | Cache TTL was 0 (never expires) | Root cause found | Fix
```

### Post-Debug Retrospective

```
[ ] Root cause documented
[ ] Fix applied and verified
[ ] Regression test added
[ ] Could this bug class exist elsewhere? (search for similar patterns)
[ ] Was the debugging process efficient? What would I do differently?
[ ] Knowledge shared with team (if applicable)
[ ] Monitoring/alerting added to catch recurrence
```

## When to Stop Debugging

### Diminishing Returns Signals

- You have been debugging for 2+ hours without new information
- You are retesting hypotheses you already rejected
- You are making changes "just to see what happens"
- You are frustrated and making mistakes

### Decision Framework

```
Can you reproduce it?
├─ No → Workaround + monitoring + move on
└─ Yes
   ├─ Is the impact high? (data loss, security, outage)
   │  └─ Yes → Keep debugging, escalate if needed
   └─ Is the impact low? (cosmetic, edge case, rare)
      └─ Yes → Workaround + backlog ticket + move on
```

### Escalation Criteria

- You have spent 2x your initial time estimate
- You need access to systems/data you do not have
- The bug crosses team boundaries (your service + another team's service)
- You suspect a bug in a third-party library or runtime

## Debugging in Teams

### Pair Debugging

Two people, one screen. One drives (types), one navigates (thinks strategically).

- Driver focuses on tactical execution
- Navigator watches for wrong turns, suggests hypotheses
- Switch roles every 20-30 minutes
- Navigator should resist the urge to grab the keyboard

### Fresh Eyes Protocol

When stuck, explain the problem to a colleague who has NOT been debugging it:

1. Describe the expected behavior
2. Describe the actual behavior
3. List hypotheses tested and their results
4. Ask: "What am I missing?"

The fresh person often spots an assumption the original debugger has gone blind to.

### Knowledge Transfer After Debugging

```
Share with the team:
├─ What the bug was (root cause, not just symptom)
├─ How it was found (which technique worked)
├─ Why existing tests/monitoring did not catch it
├─ What was added to prevent recurrence
└─ Any broader lessons (design patterns, common pitfalls)
```