common-scenarios.md 22 KB

Common Debugging Scenarios

Playbooks for the most frequently encountered bug categories.

Memory Leaks

Symptoms

├─ RSS (Resident Set Size) grows continuously over time
├─ OOM (Out of Memory) kills after hours/days of uptime
├─ Increasing GC time / GC pauses getting longer
├─ Application slows down gradually
└─ Swap usage increases

Browser / Frontend

Three-snapshot technique:

1. Take heap snapshot (baseline after page load)
2. Perform the suspected leaking action (e.g., open/close modal 10 times)
3. Force garbage collection (Performance panel → trash can icon)
4. Take heap snapshot 2
5. Repeat step 2 (10 more times)
6. Force GC again
7. Take heap snapshot 3
8. In snapshot 3, select "Objects allocated between snapshot 1 and 2"
9. Sort by "Retained Size" descending
10. Look for objects that should have been GC'd

Detached DOM nodes:

// Find detached DOM nodes in DevTools Console
// Take heap snapshot → search for "Detached" in class filter

// Common cause: event listener on removed element
const handler = () => { /* ... */ };
element.addEventListener('click', handler);
element.remove(); // Element is detached but handler holds reference

// Fix: remove listener before removing element
element.removeEventListener('click', handler);
element.remove();

// Or use AbortController (modern approach)
const controller = new AbortController();
element.addEventListener('click', handler, { signal: controller.signal });
// Later: clean up all listeners at once
controller.abort();

Node.js

# Method 1: Chrome DevTools
node --inspect app.js
# Open chrome://inspect → Take heap snapshots

# Method 2: heapdump module
# In code: require('heapdump');
# Send SIGUSR2 to take snapshot: kill -USR2 PID
# Compare .heapsnapshot files in Chrome DevTools

# Method 3: clinic.js doctor
clinic doctor -- node app.js
# Generates report identifying likely memory leak

# Method 4: Process memory monitoring
node -e "setInterval(() => console.log(process.memoryUsage()), 5000)"
# Watch rss, heapUsed, heapTotal, external, arrayBuffers

Python

# objgraph: find reference chains keeping objects alive
import objgraph

# Show object count growth between two points
objgraph.show_growth(limit=10)
# ... run suspect code ...
objgraph.show_growth(limit=10)  # Shows what increased

# Find what holds a reference to an object
objgraph.show_backrefs(
    objgraph.by_type('MyClass')[0],
    max_depth=5,
    filename='refs.png'
)

# tracemalloc: track where allocations happen
import tracemalloc
tracemalloc.start(25)  # Store 25 frames of traceback

# ... run suspect code ...

snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics('traceback')[:5]:
    print(stat)
    for line in stat.traceback.format():
        print(f"  {line}")

# gc: inspect garbage collector
import gc
gc.set_debug(gc.DEBUG_LEAK)  # Log uncollectable objects
gc.collect()  # Force collection
print(gc.garbage)  # List of uncollectable objects

# Find circular references
gc.collect()
for obj in gc.garbage:
    print(type(obj), gc.get_referrers(obj))

Go

# Enable pprof endpoint (add to your app)
# import _ "net/http/pprof"

# Take heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Compare two heap profiles (before and after)
go tool pprof -diff_base=heap1.prof heap2.prof

# Inside pprof:
(pprof) top             # Top allocators
(pprof) top -cum        # Top by cumulative allocations
(pprof) list funcName   # Annotated source showing allocations per line
(pprof) web             # Graphical view in browser

# Quick check: runtime memory stats
import "runtime"

var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("Alloc: %d MiB\n", m.Alloc / 1024 / 1024)
fmt.Printf("TotalAlloc: %d MiB\n", m.TotalAlloc / 1024 / 1024)
fmt.Printf("Sys: %d MiB\n", m.Sys / 1024 / 1024)
fmt.Printf("NumGC: %d\n", m.NumGC)

Common Causes

Cause Language Detection
Event listener accumulation JS Heap snapshot → EventListener count growing
Cache without eviction All Memory grows linearly with unique inputs
Closure capturing large scope JS/Python Heap snapshot → large retained size in closures
Circular references Python gc.garbage shows uncollectable objects
Goroutine leak Go pprof/goroutine count grows over time
Global/static collections All Check module-level lists, dicts, maps
Unreleased database connections All Connection pool stats show exhaustion
String concatenation in loops Go/Java strings.Builder / StringBuilder instead
Forgotten timers/intervals JS setInterval without corresponding clearInterval

Deadlocks

Symptoms

├─ Process hangs (0% CPU, still alive)
├─ All worker threads blocked
├─ No new log output
├─ Health check timeouts
└─ Incoming requests queue up, never complete

Detection by Language

# Go: dump all goroutine stacks
kill -SIGQUIT PID
# Or: curl http://localhost:6060/debug/pprof/goroutine?debug=2

# Java: thread dump
jstack PID
kill -3 PID  # SIGQUIT also works for JVM

# Python: faulthandler (prints all thread stacks)
python -c "import faulthandler; faulthandler.enable()" # then Ctrl+\
# Or send SIGUSR1 if faulthandler is registered

# Node.js: get active handles/requests
process._getActiveHandles()
process._getActiveRequests()

# Linux: check what threads are waiting on
cat /proc/PID/stack           # Kernel stack of main thread
ls /proc/PID/task/            # List all threads
cat /proc/PID/task/TID/stack  # Kernel stack of specific thread

# GDB: attach to stuck process
gdb -p PID
(gdb) info threads
(gdb) thread apply all bt    # Backtrace for all threads

Classic Deadlock Pattern

Thread 1: lock(A) → lock(B)
Thread 2: lock(B) → lock(A)

Timeline:
  T1: acquires A         T2: acquires B
  T1: waits for B        T2: waits for A
  → DEADLOCK (both waiting forever)

Prevention

1. Consistent lock ordering:
   Always acquire locks in the same order (e.g., alphabetical by resource name)

2. Timeout on lock acquisition:
   mutex.tryLock(timeout: 5.seconds)
   If timeout → release all locks, backoff, retry

3. Lock-free data structures:
   Use atomic operations, channels (Go), or concurrent collections

4. Detect and break:
   Deadlock detection thread that monitors lock wait times
   Go: runtime detects goroutine deadlocks (fatal error: all goroutines asleep)

Go-Specific: Channel Deadlocks

// Deadlock: unbuffered channel with no receiver
ch := make(chan int)
ch <- 1  // blocks forever, no goroutine reading

// Deadlock: channel in select without default
select {
case msg := <-ch:
    process(msg)
// no default → blocks forever if ch has no sender
}

// Fix: add timeout or default
select {
case msg := <-ch:
    process(msg)
case <-time.After(5 * time.Second):
    log.Println("timeout waiting for message")
default:
    // non-blocking
}

// Goroutine leak detection
// If goroutine count grows over time, goroutines are stuck
import "runtime"
fmt.Println("Goroutines:", runtime.NumGoroutine())

Go-Specific: Mutex Deadlock Detection

// Use sync.Mutex with deadlock detector during development
// go get github.com/sasha-s/go-deadlock
import "github.com/sasha-s/go-deadlock"

var mu go_deadlock.Mutex  // Drop-in replacement for sync.Mutex
// Prints potential deadlock warning with stack traces
// when lock is held for too long

Race Conditions

Symptoms

├─ Intermittent test failures ("flaky tests")
├─ Different results on different runs with same input
├─ Bug disappears when adding print/log statements (Heisenbug)
├─ Works with 1 user, fails with 10 concurrent users
├─ Works in debugger, fails in production
└─ Data corruption that "should be impossible"

Detection Tools

# Go: built-in race detector
go test -race ./...
go run -race ./cmd/server

# C/C++/Rust: ThreadSanitizer
# Compile with: -fsanitize=thread
gcc -fsanitize=thread -g program.c -o program
./program

# Rust: Miri (for unsafe code)
cargo miri test

# Java: use -XX:+UseThreadSanitizer (experimental)
# or tools like FindBugs, SpotBugs with concurrency detectors

# Python: threading issues are less common due to GIL
# but still occur with multiprocessing, asyncio, or C extensions

Reproduction Techniques

# Technique 1: Add strategic delays to widen the race window
import time

def transfer(from_account, to_account, amount):
    balance = from_account.balance
    time.sleep(0.001)  # ← Widens the race window
    from_account.balance = balance - amount
    time.sleep(0.001)  # ← Makes race more likely
    to_account.balance += amount
# Technique 2: Increase concurrency
# Run the same test with 100 concurrent workers
for i in $(seq 1 100); do
    curl -s http://localhost:3000/api/transfer &
done
wait

# Technique 3: Stress test with loop
for i in $(seq 1 1000); do
    go test -race -count=1 ./pkg/... || echo "FAILED on iteration $i"
done
// Technique 4: Go - use t.Parallel() in tests
func TestConcurrentAccess(t *testing.T) {
    for i := 0; i < 100; i++ {
        t.Run(fmt.Sprintf("case_%d", i), func(t *testing.T) {
            t.Parallel() // Run sub-tests concurrently
            // ... test code that exercises shared state
        })
    }
}

Common Race Condition Patterns

Read-Modify-Write (most common):

Thread 1: read counter (= 5)
Thread 2: read counter (= 5)
Thread 1: write counter (= 6)
Thread 2: write counter (= 6)  ← Should be 7!

Fix: atomic operations or mutex
  Go:    atomic.AddInt64(&counter, 1)
  Rust:  counter.fetch_add(1, Ordering::SeqCst)
  JS:    N/A (single-threaded, but async read-modify-write exists)
  Python: threading.Lock()

Check-Then-Act (TOCTOU):

Thread 1: if file.exists()        (yes)
Thread 2:                         delete file
Thread 1:     file.read()         ← CRASH: file no longer exists

Fix: atomic operations or locks
  OS-level: use O_CREAT|O_EXCL flags
  DB-level: use transactions with proper isolation
  App-level: lock around check+act

Publication Without Synchronization:

// BAD: other goroutines may see partially initialized Config
config = &Config{Host: "example.com", Port: 8080}

// GOOD: use atomic.Value or sync.Once
var configValue atomic.Value
configValue.Store(&Config{Host: "example.com", Port: 8080})

Performance Regressions

Detection

# Git bisect with benchmark
git bisect start
git bisect bad HEAD
git bisect good v1.0.0

# Automated bisect using benchmark threshold
cat > /tmp/bench-test.sh << 'SCRIPT'
#!/bin/bash
go test -bench=BenchmarkCriticalPath -count=5 ./pkg/... |
  grep "ns/op" |
  awk '{print $3}' |
  awk '{s+=$1; n++} END {
    avg = s/n;
    if (avg > 1000) exit 1;  # Bad if > 1000 ns/op
    else exit 0;              # Good otherwise
  }'
SCRIPT
chmod +x /tmp/bench-test.sh
git bisect run /tmp/bench-test.sh

CPU Profiling

# Go: CPU profile
go test -cpuprofile=cpu.prof -bench=. ./pkg/...
go tool pprof cpu.prof
(pprof) top 20
(pprof) list HotFunction  # Annotated source with time per line
(pprof) web               # Visual graph

# Node.js: clinic flame
clinic flame -- node app.js
# Or: 0x app.js

# Python: cProfile
python -m cProfile -s cumulative script.py
# Or: py-spy for live profiling
py-spy top --pid PID

# Rust: cargo flamegraph
cargo flamegraph --bin myapp

Flame Graph Interpretation

Reading flame graphs:
├─ X-axis: fraction of total time (wider = more time)
├─ Y-axis: call stack depth (bottom = entry point, top = leaf)
├─ Each bar: a function in the stack
├─ Color: usually random (not meaningful) unless semantic coloring
│
├─ Look for: "plateaus" (wide flat bars) = hot functions
├─ Look for: unexpected depth = unnecessary call chains
├─ Look for: multiple thin towers = function called many times
└─ Ignore: narrow bars (insignificant time)

Common findings:
├─ Wide JSON.parse bar → large payload parsing
├─ Wide sort bar → inefficient sorting algorithm or large dataset
├─ Wide GC bar → too many allocations (reduce object creation)
├─ Deep regex bar → regex backtracking (simplify pattern)
└─ Wide I/O bar → blocking I/O on critical path

Memory Profiling for Performance

# Go: allocation profiling
go test -memprofile=mem.prof -bench=. ./pkg/...
go tool pprof -alloc_objects mem.prof  # Count of allocations
go tool pprof -alloc_space mem.prof    # Size of allocations

# Node.js: allocation timeline in DevTools
# Memory panel → Allocation instrumentation on timeline
# Shows objects allocated over time, find what survives GC

# Python: memray for allocation hot spots
memray run --trace-python-allocators script.py
memray flamegraph output.bin

I/O Performance

# Identify slow queries
# PostgreSQL:
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
# Look for:
#   Seq Scan on large table → add index
#   Nested Loop with high row count → consider join strategy
#   Sort with external merge → increase work_mem

# N+1 query detection:
# Count queries per request (log all queries, count):
grep "SELECT\|INSERT\|UPDATE\|DELETE" query.log | wc -l
# If count scales with data size → N+1 problem

# Connection pool exhaustion:
# PostgreSQL:
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
# If active ≈ max_connections → pool exhaustion

API Debugging

Request/Response Inspection

# Full request/response with timing
curl -v -w "\n\nTiming:\n  DNS:     %{time_namelookup}s\n  Connect: %{time_connect}s\n  TLS:     %{time_appconnect}s\n  TTFB:    %{time_starttransfer}s\n  Total:   %{time_total}s\n  Size:    %{size_download} bytes\n" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"key": "value"}' \
  https://api.example.com/endpoint

# Compare expected vs actual response
diff <(curl -s expected-endpoint | jq .) <(curl -s actual-endpoint | jq .)

Status Code Debugging

2xx: Success (but check response body for soft errors)
├─ 200: OK
├─ 201: Created (check Location header for new resource URL)
└─ 204: No Content (no response body expected)

3xx: Redirect (follow with curl -L, check redirect chain)
├─ 301: Permanent redirect (cache implications)
├─ 302: Temporary redirect
└─ 304: Not Modified (caching working correctly)

4xx: Client error (fix the request)
├─ 400: Bad Request → check request body against API schema
├─ 401: Unauthorized → check token validity, expiration
├─ 403: Forbidden → check permissions, scopes, IP allowlist
├─ 404: Not Found → check URL path, resource existence
├─ 405: Method Not Allowed → check HTTP method (GET vs POST)
├─ 409: Conflict → check for duplicate/concurrent operations
├─ 413: Payload Too Large → reduce request body size
├─ 422: Unprocessable → valid JSON but semantic errors
├─ 429: Rate Limited → check Retry-After header, implement backoff
└─ 431: Headers Too Large → reduce cookie/header size

5xx: Server error (usually not your fault, but check your request)
├─ 500: Internal Server Error → check server logs
├─ 502: Bad Gateway → upstream service down
├─ 503: Service Unavailable → service overloaded or deploying
└─ 504: Gateway Timeout → upstream too slow, check timeout settings

Header Debugging

# CORS debugging
curl -v -X OPTIONS \
  -H "Origin: http://localhost:3000" \
  -H "Access-Control-Request-Method: POST" \
  -H "Access-Control-Request-Headers: Content-Type,Authorization" \
  https://api.example.com/endpoint

# Check response headers:
# Access-Control-Allow-Origin: must match your origin (or *)
# Access-Control-Allow-Methods: must include your method
# Access-Control-Allow-Headers: must include your custom headers
# Access-Control-Allow-Credentials: must be true if sending cookies

# Content-Type debugging
# Sending JSON but getting 400? Check:
curl -H "Content-Type: application/json" ...   # CORRECT
curl -H "Content-Type: text/plain" ...         # WRONG for JSON APIs

# Auth header debugging
# Bearer token:
curl -H "Authorization: Bearer eyJhbG..." ...
# Basic auth:
curl -u username:password ...
# API key:
curl -H "X-API-Key: your-key" ...

Payload Debugging

# Validate JSON syntax
echo '{"key": "value"}' | jq .

# Pretty-print API response
curl -s https://api.example.com/endpoint | jq .

# Compare schemas
# Save expected schema and actual response, then diff
curl -s https://api.example.com/endpoint | jq 'keys' > actual-keys.json
diff expected-keys.json actual-keys.json

# Check encoding issues
curl -s https://api.example.com/endpoint | file -
# Should show: "UTF-8 Unicode text" or "ASCII text"
# If "ISO-8859" or "binary" → encoding mismatch

# Large payload debugging
curl -s https://api.example.com/endpoint | jq '. | length'  # Array length
curl -s https://api.example.com/endpoint | wc -c             # Byte count

Timeout and Retry Debugging

# Test with explicit timeout
curl --connect-timeout 5 --max-time 30 https://api.example.com/endpoint

# If timing out, check at each layer:
# 1. DNS resolution
dig api.example.com
nslookup api.example.com

# 2. TCP connectivity
nc -zv api.example.com 443

# 3. TLS handshake
openssl s_client -connect api.example.com:443

# 4. HTTP response time
curl -o /dev/null -s -w "TTFB: %{time_starttransfer}s\n" https://api.example.com/endpoint

# Retry with exponential backoff (script)
for i in 1 2 4 8 16; do
    if curl -sf https://api.example.com/health; then
        echo "Service is up"
        break
    fi
    echo "Retry in ${i}s..."
    sleep $i
done

Deployment Issues ("Works on My Machine")

Environment Diff Checklist

# 1. OS and architecture
uname -a                              # Linux/macOS
# Compare: local vs CI vs production

# 2. Runtime versions
node --version                        # Node.js
python --version                      # Python
go version                            # Go
rustc --version                       # Rust

# 3. Dependency versions
# Node.js:
diff <(cat package-lock.json | jq '.dependencies | keys') \
     <(ssh prod 'cat /app/package-lock.json | jq ".dependencies | keys"')

# Python:
diff <(pip list --format=freeze | sort) \
     <(ssh prod 'pip list --format=freeze | sort')

# 4. Environment variables
diff <(env | sort | grep -v SECRET) \
     <(ssh prod 'env | sort | grep -v SECRET')

# 5. Config files (byte-for-byte comparison)
diff local.env <(ssh prod 'cat /app/.env')

# 6. System resources
free -h                               # Memory
df -h                                 # Disk space
ulimit -n                             # File descriptor limit

Docker Reproducibility

# Ensure same image locally and in production
docker inspect IMAGE --format '{{.Id}}'  # Compare image IDs

# Run locally with production-equivalent constraints
docker run \
  --memory=512m \
  --cpus=1 \
  --env-file production.env \
  --network=host \
  IMAGE

# Debug inside the exact production image
docker run -it --entrypoint /bin/sh PRODUCTION_IMAGE

Dependency Differences

# Check if lock file is fresh
# Node.js: compare node_modules to lock file
npm ls --all 2>&1 | grep "WARN\|ERR"

# Python: check for mismatched requirements
pip check

# Go: verify module checksum
go mod verify

# Common issue: "works locally" because you have a package
# installed globally that is not in the project's dependencies
# Test: run in clean environment (Docker, CI)

File System Differences

Common traps:
├─ Case sensitivity: macOS/Windows are case-insensitive, Linux is case-sensitive
│  import User from './user'  ← works on Mac, fails on Linux if file is User.js
│
├─ Path separators: Windows uses \, Linux/macOS uses /
│  Use path.join() or path.resolve(), never hardcode separators
│
├─ Line endings: Windows CRLF (\r\n) vs Unix LF (\n)
│  Scripts with CRLF fail on Linux: /bin/bash^M: bad interpreter
│  Fix: git config core.autocrlf input
│
├─ File permissions: Linux/macOS have execute bits, Windows does not
│  chmod +x script.sh has no effect on Windows
│
├─ Max path length: Windows has 260 char limit (unless LongPathsEnabled)
│  node_modules paths can exceed this on Windows
│
└─ Symlinks: Windows requires admin privileges or Developer Mode
   npm link / yarn link may fail on Windows

Network Differences

# DNS resolution differences
dig +short api.example.com              # What does DNS resolve to here?
ssh prod 'dig +short api.example.com'   # What about in production?

# Firewall differences
# Can the production server reach the external API?
ssh prod 'curl -sv https://external-api.com/health 2>&1 | head -20'

# Proxy differences
echo $HTTP_PROXY $HTTPS_PROXY $NO_PROXY
ssh prod 'echo $HTTP_PROXY $HTTPS_PROXY $NO_PROXY'

# TLS/certificate differences
openssl s_client -connect api.example.com:443 < /dev/null 2>/dev/null | openssl x509 -noout -dates
# Check if production has different CA bundle
ssh prod 'openssl s_client -connect api.example.com:443 < /dev/null 2>/dev/null | openssl x509 -noout -dates'

# MTU / packet size issues (rare but painful)
ping -M do -s 1472 api.example.com      # Test path MTU

Quick "Works on My Machine" Decision Tree

Does it fail in Docker locally (same image as prod)?
├─ No → Environment difference. Compare: env vars, config, DNS, network
└─ Yes → Does it fail in CI?
   ├─ No → Data or state difference. Compare: database, cache, file system
   └─ Yes → Code bug. Use standard debugging workflow.
       └─ But I swear it works on my machine!
          → Run in clean checkout: git stash && npm ci && npm test
          → If it passes: your working tree has uncommitted changes that fix it
          → If it fails: local cache/build artifact masking the bug
             → rm -rf node_modules .next dist build && npm ci && npm test