Comprehensive profiling guide across languages and runtimes.
A flamegraph is a visualization of stack traces collected by a sampling profiler.
┌─────────────────────────────────────────────────────┐
│ expensiveComputation() │ ← Leaf (top): where CPU time is spent
├───────────────────────┬─────────────────────────────┤
│ processItem() │ validateInput() │ ← Callees of handleRequest
├───────────────────────┴─────────────────────────────┤
│ handleRequest() │ ← Called by main
├─────────────────────────────────────────────────────┤
│ main() │ ← Root (bottom): entry point
└─────────────────────────────────────────────────────┘
Reading rules:
- Width = proportion of total samples (wider = more CPU time)
- Height = stack depth (bottom = caller, top = callee)
- Colors = typically random or language-based (not meaningful)
- Left-to-right order = alphabetical (not temporal)
What to look for:
1. WIDE bars at the TOP → functions consuming the most CPU directly
2. WIDE bars in the MIDDLE → functions whose callees consume most CPU
3. Narrow tall towers → deep call stacks but fast (usually fine)
4. Flat plateaus → single function dominating CPU time
Top-Down (Caller → Callee):
- Start from root (main), follow widest paths down
- Good for: understanding call hierarchy, finding which code path is slow
- Question answered: "What is my application doing?"
Bottom-Up (Callee → Caller):
- Start from leaf functions, trace up to callers
- Good for: finding hot functions regardless of who calls them
- Question answered: "Which functions use the most CPU?"
Differential Flamegraphs:
- Compare two profiles (before/after change, baseline/regression)
- Red = regression (more samples), Blue = improvement (fewer samples)
- Generate: flamegraph.pl --negate > diff.svg
- Tools: speedscope, Firefox Profiler, pprof diff
Pattern: GC Pressure
- Look for: wide GC/runtime.gc bars, frequent small allocations
- Fix: reduce allocations, use object pools, pre-allocate buffers
Pattern: Lock Contention
- Look for: wide mutex/lock/wait bars in multiple goroutines/threads
- Fix: reduce critical section size, use lock-free data structures
Pattern: Regex Backtracking
- Look for: wide regex engine bars (re.match, regexp.exec)
- Fix: anchor patterns, use possessive quantifiers, compile once
Pattern: Serialization Overhead
- Look for: wide JSON.parse/encode/marshal bars
- Fix: schema-based serialization (protobuf, msgpack), streaming
Pattern: Syscall Heavy
- Look for: wide read/write/sendto/recvfrom system call bars
- Fix: buffered I/O, batch operations, io_uring (Linux)
# Install clinic.js globally
npm install -g clinic
# Doctor: automated diagnosis (CPU, memory, I/O, event loop)
clinic doctor -- node app.js
# Exercise your application, then Ctrl+C
# Opens HTML report with diagnosis and recommendations
# Flame: CPU flamegraph
clinic flame -- node app.js
# Exercise, Ctrl+C → interactive flamegraph
# BubbleProf: async operation visualization
clinic bubbleprof -- node app.js
# Shows async operations, delays, and dependencies
# Great for finding async bottlenecks invisible to CPU profilers
# Install and profile
npm install -g 0x
0x app.js
# Exercise, Ctrl+C → opens flamegraph in browser
# Profile with specific flags
0x --collect-only app.js # Collect stacks, don't generate graph
0x --visualize-only PID.0x # Generate graph from collected data
0x -o flamegraph.html app.js # Specify output file
# CPU profile (generates .cpuprofile)
node --cpu-prof --cpu-prof-interval=100 app.js
# Load in Chrome DevTools → Performance tab → Load profile
# Heap profile (generates .heapprofile)
node --heap-prof app.js
# Load in Chrome DevTools → Memory tab
# V8 trace optimization decisions
node --trace-opt --trace-deopt app.js 2>&1 | grep -E "(OPTIMIZED|DEOPTIMIZED)"
# Shows which functions V8 optimizes and deoptimizes
# GC tracing
node --trace-gc app.js
# Output: [GC] type, duration, heap before/after
# Allocation tracking
node --trace-gc --trace-gc-verbose app.js
1. Open chrome://inspect (or node --inspect-brk app.js)
2. Click "inspect" on your Node.js target
3. Go to Performance tab
4. Click Record (●)
5. Perform the actions you want to profile
6. Stop recording
7. Analyze:
- Summary: pie chart of activity types
- Bottom-Up: hottest functions first
- Call Tree: top-down call hierarchy
- Event Log: chronological events
Key columns:
- Self Time: time in the function itself (not callees)
- Total Time: time including all callees
- Focus on high Self Time for optimization targets
# Heap snapshot via Chrome DevTools
node --inspect app.js
# DevTools → Memory → Take heap snapshot
# Programmatic heap snapshots
# npm install heapdump
# In code:
# const heapdump = require('heapdump');
# heapdump.writeSnapshot('/tmp/heap-' + Date.now() + '.heapsnapshot');
# Process memory monitoring
node -e "
setInterval(() => {
const mem = process.memoryUsage();
console.log(JSON.stringify({
rss_mb: (mem.rss / 1024 / 1024).toFixed(1),
heap_used_mb: (mem.heapUsed / 1024 / 1024).toFixed(1),
heap_total_mb: (mem.heapTotal / 1024 / 1024).toFixed(1),
external_mb: (mem.external / 1024 / 1024).toFixed(1)
}));
}, 5000);
"
# Event loop utilization (Node 14+)
# const { monitorEventLoopDelay } = require('perf_hooks');
# const h = monitorEventLoopDelay({ resolution: 20 });
# h.enable();
# setInterval(() => console.log('p99:', h.percentile(99) / 1e6, 'ms'), 5000);
# Install
pip install py-spy
# Record flamegraph (no code changes needed)
py-spy record -o profile.svg -- python app.py
py-spy record -o profile.svg --pid PID # Attach to running process
# Live top-like view
py-spy top --pid PID
py-spy top -- python app.py
# Output format options
py-spy record -o profile.json --format speedscope -- python app.py
py-spy record -o profile.txt --format raw -- python app.py
# Profile subprocesses too
py-spy record --subprocesses -o profile.svg -- python app.py
# Sample rate (default 100 Hz)
py-spy record --rate 250 -o profile.svg -- python app.py
# Include native (C extension) frames
py-spy record --native -o profile.svg -- python app.py
# cProfile: function-level profiling (built-in)
import cProfile
import pstats
# Profile a function
cProfile.run('my_function()', 'output.prof')
# Analyze results
stats = pstats.Stats('output.prof')
stats.sort_stats('cumulative') # or 'tottime' for self time
stats.print_stats(20) # top 20 functions
# Command-line usage
# python -m cProfile -s cumulative app.py
# python -m cProfile -o output.prof app.py
# line_profiler: line-by-line profiling
# pip install line_profiler
# Decorate functions with @profile
# kernprof -l -v script.py
# Install
pip install scalene
# Profile (no code changes needed)
scalene script.py
scalene --cpu --memory --gpu script.py
# Output as JSON for programmatic analysis
scalene --json --outfile profile.json script.py
# Profile specific function
scalene --profile-only my_module script.py
# Web-based UI
scalene --html --outfile profile.html script.py
# What scalene shows:
# - CPU time (Python vs native code)
# - Memory allocation and deallocation per line
# - GPU usage per line
# - Copy volume (data copying overhead)
# Install
pip install memray
# Record memory allocations
memray run script.py
memray run --output output.bin script.py
# Attach to running process
memray attach PID
# Generate reports
memray flamegraph output.bin # Allocation flamegraph
memray table output.bin # Table of allocations
memray tree output.bin # Tree of allocations
memray summary output.bin # High-level summary
memray stats output.bin # Allocation statistics
# Live monitoring
memray run --live script.py # TUI live view
memray run --live-remote --live-port 9001 script.py # Remote live view
# Detect memory leaks
memray flamegraph --leaks output.bin # Show only leaked memory
memray table --leaks output.bin
# Temporal flamegraph (allocation over time)
memray flamegraph --temporal output.bin
import tracemalloc
# Start tracing
tracemalloc.start(25) # Store 25 frames of traceback
# Take snapshots at different points
snapshot1 = tracemalloc.take_snapshot()
# ... run code ...
snapshot2 = tracemalloc.take_snapshot()
# Top allocators
top_stats = snapshot2.statistics('lineno') # or 'traceback', 'filename'
for stat in top_stats[:10]:
print(stat)
# Compare snapshots (find growth)
diff = snapshot2.compare_to(snapshot1, 'lineno')
for stat in diff[:10]:
print(stat)
# Current memory usage
current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024 / 1024:.1f} MB")
print(f"Peak: {peak / 1024 / 1024:.1f} MB")
import objgraph
# Show most common types in memory
objgraph.show_most_common_types(limit=20)
# Show growth between two points
objgraph.show_growth(limit=10)
# ... do something ...
objgraph.show_growth(limit=10) # Shows only types that grew
# Find reference chains keeping objects alive
objgraph.show_backrefs(
objgraph.by_type('MyClass')[0],
max_depth=10,
filename='refs.png'
)
# Count instances of a type
print(objgraph.count('dict'))
print(objgraph.count('MyClass'))
// Enable pprof HTTP endpoint (add to your main.go)
import _ "net/http/pprof"
func main() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// ... rest of application
}
// Or for non-HTTP applications, use runtime/pprof directly:
import "runtime/pprof"
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
# CPU profile (30 seconds)
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Interactive mode commands:
# top - top functions by CPU
# top -cum - top by cumulative time
# list funcName - source-level annotation
# web - open graph in browser
# svg - export call graph as SVG
# Web UI (recommended)
go tool pprof -http :8080 http://localhost:6060/debug/pprof/profile?seconds=30
# Opens browser with flamegraph, graph, source, top views
# Heap profile (current allocations)
go tool pprof http://localhost:6060/debug/pprof/heap
# Heap profile options:
# -inuse_space (default) currently allocated bytes
# -inuse_objects currently allocated object count
# -alloc_space total bytes allocated (including freed)
# -alloc_objects total objects allocated (including freed)
# Goroutine profile (debug hanging/leaking goroutines)
go tool pprof http://localhost:6060/debug/pprof/goroutine
# Block profile (time spent blocking on sync primitives)
# Must enable: runtime.SetBlockProfileRate(1)
go tool pprof http://localhost:6060/debug/pprof/block
# Mutex profile (mutex contention)
# Must enable: runtime.SetMutexProfileFraction(5)
go tool pprof http://localhost:6060/debug/pprof/mutex
# Compare two profiles (differential)
go tool pprof -diff_base=base.prof current.prof
# Capture execution trace
curl -o trace.out http://localhost:6060/debug/pprof/trace?seconds=5
go tool trace trace.out
# Or in code:
# f, _ := os.Create("trace.out")
# trace.Start(f)
# defer trace.Stop()
# Trace viewer shows:
# - Goroutine execution timeline
# - Network blocking
# - Syscall blocking
# - Scheduler latency
# - GC events
# See what escapes to heap (allocations you may not expect)
go build -gcflags '-m' ./...
# More verbose
go build -gcflags '-m -m' ./...
# Common escape reasons:
# "moved to heap: x" - variable allocated on heap instead of stack
# "leaking param: x" - parameter escapes the function
# "x escapes to heap" - compiler cannot prove x doesn't outlive the stack frame
# Fix: reduce pointer usage, return values instead of pointers for small types,
# use sync.Pool for frequently allocated objects
# GC trace logging
GODEBUG=gctrace=1 ./myapp
# Output format:
# gc N @T% G%: wall_time+cpu_time ms clock, H->H->H MB, S MB goal, P P
# N = GC number, T = time since start, G = fraction of CPU in GC
# H = heap before -> after -> live, S = heap goal
# Set GC target percentage (default 100 = GC when heap doubles)
GOGC=200 ./myapp # Less frequent GC, more memory usage
GOGC=50 ./myapp # More frequent GC, less memory usage
# Memory limit (Go 1.19+)
GOMEMLIMIT=1GiB ./myapp # Hard memory limit
# Install
cargo install flamegraph
# Generate flamegraph (release build recommended)
cargo flamegraph --bin myapp
cargo flamegraph --bin myapp -- --arg1 --arg2 # With arguments
cargo flamegraph --bench my_benchmark # Profile benchmarks
# Linux: may need to set perf permissions
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
# Or run with sudo
# Output: flamegraph.svg in current directory
# Install
cargo install samply
# Profile (opens Firefox Profiler UI)
samply record ./target/release/myapp
samply record ./target/release/myapp -- --arg1
# samply advantages:
# - Uses Firefox Profiler UI (excellent visualization)
# - Shows both CPU and memory
# - Per-thread timeline view
# - Source code annotation
# - No code changes needed
# Requires nightly Rust or Valgrind
# With Valgrind:
valgrind --tool=dhat ./target/debug/myapp
# Opens dhat-viewer in browser
# DHAT shows:
# - Total bytes allocated
# - Maximum bytes live at any point
# - Total blocks allocated
# - Access patterns (reads/writes per block)
# - Short-lived allocations (allocated and freed quickly)
# - Allocation sites with full backtraces
# Install (Linux)
# sudo apt install heaptrack heaptrack-gui
# Profile
heaptrack ./target/release/myapp
# Analyze
heaptrack_gui heaptrack.myapp.*.zst
# heaptrack shows:
# - Allocation timeline
# - Allocation flamegraph
# - Peak memory consumers
# - Temporary allocation hotspots
# - Potential memory leaks (allocated, never freed)
// Avoid unnecessary allocations
// BAD: allocates a new String every call
fn process(name: &str) -> String {
format!("Hello, {}!", name)
}
// GOOD: take ownership when needed, borrow otherwise
fn process(name: &str) -> Cow<'_, str> {
if name.is_empty() {
Cow::Borrowed("Hello, stranger!")
} else {
Cow::Owned(format!("Hello, {}!", name))
}
}
// Use SmallVec for usually-small collections
// use smallvec::SmallVec;
// let mut v: SmallVec<[i32; 8]> = SmallVec::new(); // stack-allocated up to 8
// Use iterators instead of collecting
// BAD
let filtered: Vec<_> = items.iter().filter(|x| x > &5).collect();
let sum: i32 = filtered.iter().sum();
// GOOD: no intermediate allocation
let sum: i32 = items.iter().filter(|x| *x > &5).sum();
// Pre-allocate when size is known
let mut v = Vec::with_capacity(1000); // One allocation
for i in 0..1000 {
v.push(i); // No reallocation
}
Recording a performance profile:
1. Open DevTools (F12) → Performance tab
2. Click Record (●) or Ctrl+E
3. Perform the action to profile
4. Stop recording
5. Analyze the timeline
Key areas:
├─ Network: request waterfall (blocking, TTFB, download)
├─ Frames: FPS chart (green = 60fps, red = dropped frames)
├─ Timings: FCP, LCP, DCL markers
├─ Main: flame chart of main thread activity
│ ├─ Yellow = JavaScript execution
│ ├─ Purple = Layout/Rendering
│ ├─ Green = Paint/Composite
│ └─ Gray = System/Idle
├─ Raster: paint operations
└─ GPU: GPU activity
Long Tasks (>50ms):
- Flagged with red triangle in the timeline
- Block the main thread, cause jank
- Fix: break into smaller tasks with requestIdleCallback, setTimeout,
or scheduler.postTask
Metric Target Measures
─────────────────────────────────────────────────────
LCP <2.5s Largest Contentful Paint (perceived load)
INP <200ms Interaction to Next Paint (responsiveness)
CLS <0.1 Cumulative Layout Shift (visual stability)
Measurement tools:
- Lighthouse: npx lighthouse https://example.com --view
- web-vitals library: import { onLCP, onINP, onCLS } from 'web-vitals'
- Chrome DevTools → Performance → Timings row
- PageSpeed Insights: https://pagespeed.web.dev
- CrUX Dashboard (real user data)
React DevTools Profiler:
1. Install React DevTools browser extension
2. Open DevTools → Profiler tab
3. Click Record
4. Interact with your app
5. Stop recording
What it shows:
- Commit-by-commit render timeline
- Which components rendered and why
- Render duration per component
- Ranked chart (slowest components)
Programmatic profiling:
import { Profiler } from 'react';
function onRender(id, phase, actualDuration, baseDuration, startTime, commitTime) {
console.log({ id, phase, actualDuration, baseDuration });
}
<Profiler id="MyComponent" onRender={onRender}>
<MyComponent />
</Profiler>
// phase: "mount" or "update"
// actualDuration: time spent rendering (with memoization)
// baseDuration: time without memoization (worst case)
Step 1: Confirm the leak exists
├─ Monitor memory over time (RSS, heap)
├─ Perform repeated action cycles (create/destroy)
├─ Force GC between cycles
└─ If memory grows without bound → leak confirmed
Step 2: Identify the leak type
├─ Growing collections (maps, arrays, caches without eviction)
├─ Event listener accumulation (add without remove)
├─ Closure captures (inner function holds reference to outer scope)
├─ Unreleased resources (file handles, DB connections, sockets)
├─ Circular references (in languages without cycle-collecting GC)
├─ Global state accumulation (module-level variables growing)
└─ Timer/interval not cleared (setInterval without clearInterval)
Step 3: Locate the leak source
├─ Take heap snapshots at different points
├─ Compare snapshots (objects allocated between snap 1 and 2)
├─ Sort by retained size
├─ Follow retainer chains to find root reference
└─ The "GC root → ... → leaked object" chain shows you what to fix
JavaScript / Node.js:
├─ Closures capturing large scope
│ Fix: null out references, restructure to minimize capture
├─ Event emitter listeners without removeListener
│ Fix: AbortController, cleanup in componentWillUnmount / useEffect return
├─ Global caches without LRU eviction
│ Fix: Use lru-cache package, set maxSize
├─ Detached DOM nodes
│ Fix: Remove event listeners before removing elements
└─ Unresolved Promises holding references
Fix: Add timeout, ensure rejection paths release resources
Python:
├─ __del__ preventing GC of cycles
│ Fix: Use weakref, avoid __del__, use context managers
├─ Module-level mutable defaults growing
│ Fix: Reset between requests, use request-scoped storage
├─ C extension objects not properly released
│ Fix: Explicit cleanup, context managers
└─ threading.local() without cleanup
Fix: Clean up in thread exit callback
Go:
├─ Goroutine leaks (blocked goroutines never collected)
│ Fix: Always provide cancellation (context.WithCancel)
├─ time.After in loops (each creates a timer)
│ Fix: Use time.NewTimer with Reset
├─ Slice header retaining large underlying array
│ Fix: Copy needed elements to new slice
└─ sync.Pool objects growing
Fix: Set reasonable object sizes, profile pool usage
Rust:
├─ Rc/Arc cycles
│ Fix: Use Weak references to break cycles
├─ Forgotten JoinHandle (task never joined/cancelled)
│ Fix: Store and join/abort all spawned tasks
└─ Unbounded channels
Fix: Use bounded channels, apply backpressure
Node.js Leak Investigation:
[ ] Enabled --max-old-space-size to catch OOM earlier
[ ] Took 3+ heap snapshots at intervals
[ ] Compared snapshots in Chrome DevTools (Objects allocated between)
[ ] Sorted by Retained Size to find largest leaked objects
[ ] Followed retainer chain from leaked object to GC root
[ ] Checked event listener count: process._getActiveHandles().length
[ ] Checked timer count: process._getActiveRequests().length
[ ] Tested with clinic doctor for automated diagnosis
Python Leak Investigation:
[ ] Used tracemalloc to identify top allocation sites
[ ] Compared snapshots to find growing allocations
[ ] Used objgraph.show_growth() to find growing object types
[ ] Used objgraph.show_backrefs() to find reference chains
[ ] Checked gc.garbage for objects with __del__ preventing collection
[ ] Used memray --leaks to identify unreleased memory
[ ] Tested with gc.collect() to distinguish real leaks from delayed GC
Go Leak Investigation:
[ ] Checked goroutine count via pprof/goroutine
[ ] Looked for goroutine profile growth over time
[ ] Used runtime.NumGoroutine() in metrics
[ ] Checked for blocked channel operations
[ ] Verified all contexts have cancel called
[ ] Used goleak in tests to catch goroutine leaks
[ ] Compared heap profiles: pprof -diff_base