name: perf-ops
description: "Performance profiling and optimization - CPU, memory, bundle analysis, load testing, flamegraphs. Use for: performance, profiling, flamegraph, pprof, py-spy, clinic.js, memray, heaptrack, bundle size, webpack analyzer, load testing, k6, artillery, vegeta, locust, benchmark, hyperfine, criterion, slow query, EXPLAIN ANALYZE, N+1, caching, optimization, latency, throughput, p99, memory leak, CPU spike, bottleneck."
allowed-tools: "Read Edit Write Bash Glob Grep Agent"
related-skills: [debug-ops, monitoring-ops, testing-ops, code-stats]
Performance Operations
Cross-language performance profiling, optimization patterns, and load testing methodology.
Performance Issue Decision Tree
What symptom are you observing?
│
├─ High CPU usage
│ ├─ Sustained 100% on one core
│ │ └─ CPU-bound: hot loop, regex backtracking, tight computation
│ │ → Profile with flamegraph (py-spy, pprof, clinic flame, samply)
│ ├─ Sustained 100% across all cores
│ │ └─ Parallelism gone wrong: fork bomb, unbounded workers, spin locks
│ │ → Check process count, thread count, lock contention
│ └─ Periodic spikes
│ └─ GC pressure, cron job, batch processing, cache stampede
│ → Correlate with GC logs, scheduled tasks, traffic patterns
│
├─ High memory usage
│ ├─ Growing over time (never decreasing)
│ │ └─ Memory leak: unclosed resources, growing caches, event listener accumulation
│ │ → Heap snapshots over time, compare retained objects
│ ├─ Sudden large allocation
│ │ └─ Unbounded buffer, loading full dataset into memory, large file read
│ │ → Check allocation sizes, switch to streaming
│ └─ High but stable
│ └─ May be normal: in-memory cache, preloaded data, memory-mapped files
│ → Verify with expected working set size
│
├─ Slow responses / high latency
│ ├─ All endpoints slow
│ │ └─ Systemic: resource exhaustion, GC pauses, DNS issues, TLS overhead
│ │ → Check resource utilization, GC metrics, network path
│ ├─ Specific endpoint slow
│ │ └─ Query-specific: N+1 queries, missing index, unoptimized algorithm
│ │ → EXPLAIN ANALYZE, query logging, endpoint profiling
│ └─ Intermittently slow (p99 spikes)
│ └─ Contention: lock wait, connection pool exhaustion, noisy neighbor
│ → Check lock metrics, pool sizes, correlated traffic
│
├─ Low throughput
│ ├─ CPU not saturated
│ │ └─ I/O bound: disk wait, network latency, blocking calls in async code
│ │ → Check iowait, network RTT, ensure async throughout
│ ├─ CPU saturated
│ │ └─ Compute bound: need algorithmic improvement or horizontal scaling
│ │ → Profile hot paths, optimize or scale out
│ └─ Queues backing up
│ └─ Consumer too slow: batch size, consumer count, downstream bottleneck
│ → Increase consumers, optimize processing, check downstream
│
├─ Large bundle size (frontend)
│ ├─ Main bundle too large
│ │ └─ Missing code splitting, tree shaking not working, barrel file imports
│ │ → Bundle analyzer, check import patterns, add dynamic imports
│ ├─ Duplicate dependencies
│ │ └─ Multiple versions of same library bundled
│ │ → Dedupe, check peer dependencies, use resolutions
│ └─ Large assets
│ └─ Unoptimized images, embedded fonts, inline data URIs
│ → Image optimization, font subsetting, external assets
│
└─ Slow database queries
├─ Single slow query
│ └─ Missing index, suboptimal join order, full table scan
│ → EXPLAIN ANALYZE, add index, rewrite query
├─ Many small queries (N+1)
│ └─ ORM lazy loading, loop with individual queries
│ → Eager loading, batch queries, dataloader pattern
└─ Lock contention
└─ Long transactions, row-level locks, table locks
→ Shorten transactions, check isolation level, advisory locks
Profiling Tool Selection Matrix
| Problem |
Node.js |
Python |
Go |
Rust |
Browser |
| CPU hotspots |
clinic flame, 0x |
py-spy, scalene |
pprof (CPU) |
cargo-flamegraph, samply |
DevTools Performance |
| Memory leaks |
clinic doctor, heap snapshot |
memray, tracemalloc |
pprof (heap) |
DHAT, heaptrack |
DevTools Memory |
| Memory allocation |
--heap-prof |
memray, scalene |
pprof (allocs) |
DHAT |
DevTools Allocation |
| Async bottlenecks |
clinic bubbleprof |
asyncio debug mode |
pprof (goroutine) |
tokio-console |
DevTools Performance |
| I/O profiling |
clinic doctor |
strace, py-spy |
pprof (block) |
strace, perf |
Network tab |
| GC pressure |
--trace-gc |
gc.set_debug |
GODEBUG=gctrace=1 |
N/A (no GC) |
Performance timeline |
| Lock contention |
N/A |
py-spy (threading) |
pprof (mutex) |
parking_lot stats |
N/A |
| Startup time |
--cpu-prof |
python -X importtime |
go build -v |
cargo build --timings |
Lighthouse |
CPU Profiling Quick Reference
Flamegraph Basics
Reading a flamegraph:
- X-axis: proportion of total samples (wider = more time)
- Y-axis: call stack depth (bottom = entry point, top = leaf)
- Color: random (not meaningful) in most tools
- Look for: wide plateaus at the top (hot functions)
- Ignore: narrow towers (called often but fast)
Key actions:
1. Find the widest bars at the TOP of the graph
2. Trace down to see what calls them
3. Focus optimization on the widest top-level functions
4. Re-profile after each change to verify improvement
Tool Quick Start
| Tool |
Language |
Command |
Output |
| py-spy |
Python |
py-spy record -o profile.svg -- python app.py |
SVG flamegraph |
| py-spy top |
Python |
py-spy top --pid PID |
Live top-like view |
| pprof |
Go |
go tool pprof -http :8080 http://localhost:6060/debug/pprof/profile?seconds=30 |
Interactive web UI |
| clinic flame |
Node.js |
clinic flame -- node app.js |
HTML flamegraph |
| 0x |
Node.js |
0x app.js |
SVG flamegraph |
| cargo-flamegraph |
Rust |
cargo flamegraph --bin myapp |
SVG flamegraph |
| samply |
Rust/C/C++ |
samply record ./target/release/myapp |
Firefox Profiler UI |
| perf |
Linux (any) |
perf record -g ./myapp && perf script \| inferno-flamegraph > out.svg |
SVG flamegraph |
Memory Profiling Quick Reference
| Tool |
Language |
Command |
What It Shows |
| memray |
Python |
memray run script.py && memray flamegraph output.bin |
Allocation flamegraph, leak detection |
| tracemalloc |
Python |
tracemalloc.start(); snapshot = tracemalloc.take_snapshot() |
Top allocators, allocation traceback |
| scalene |
Python |
scalene script.py |
CPU + memory + GPU in one profiler |
| heaptrack |
C/C++/Rust |
heaptrack ./myapp && heaptrack_gui heaptrack.myapp.*.zst |
Allocation timeline, flamegraph, leak candidates |
| DHAT |
Rust |
valgrind --tool=dhat ./target/debug/myapp |
Allocation sites, short-lived allocs |
| pprof (heap) |
Go |
go tool pprof http://localhost:6060/debug/pprof/heap |
Live heap, allocation counts |
| Chrome heap |
JS/Browser |
DevTools → Memory → Take heap snapshot |
Object retention, detached DOM |
| clinic doctor |
Node.js |
clinic doctor -- node app.js |
Memory + CPU + event loop diagnosis |
Bundle Analysis Quick Reference
| Tool |
Bundler |
Command |
Output |
| webpack-bundle-analyzer |
Webpack |
npx webpack-bundle-analyzer stats.json |
Interactive treemap |
| source-map-explorer |
Any |
npx source-map-explorer bundle.js |
Treemap from source maps |
| rollup-plugin-visualizer |
Rollup/Vite |
Add plugin, build |
HTML treemap |
| vite-bundle-visualizer |
Vite |
npx vite-bundle-visualizer |
Treemap visualization |
| bundlephobia |
npm |
npx bundlephobia <package> |
Package size analysis |
| size-limit |
Any |
Configure in package.json, run in CI |
Size budget enforcement |
Bundle Size Reduction Checklist
[ ] Dynamic imports for routes and heavy components
[ ] Tree shaking working (check for side effects in package.json)
[ ] No barrel file re-exports pulling in entire modules
[ ] Lodash: use lodash-es or individual imports (lodash/debounce)
[ ] Moment.js replaced with date-fns or dayjs
[ ] Images optimized (WebP/AVIF, responsive sizes, lazy loading)
[ ] Fonts subsetted to used characters
[ ] Gzip/Brotli compression enabled on server
[ ] Source maps excluded from production bundle size
[ ] CSS purged of unused styles (PurgeCSS, Tailwind JIT)
Database Performance Quick Reference
EXPLAIN ANALYZE Interpretation
Key metrics in EXPLAIN ANALYZE output:
│
├─ Seq Scan → Full table scan (often bad for large tables)
│ └─ Fix: Add index on filter columns
├─ Index Scan → Using index (good)
├─ Bitmap Index Scan → Multiple index conditions combined (good)
├─ Nested Loop → OK for small inner table, bad for large joins
│ └─ Fix: Add index on join column, consider Hash Join
├─ Hash Join → Good for large equi-joins
├─ Sort → Check if index can provide order
│ └─ Fix: Add index matching ORDER BY
├─ actual time → First row..last row in milliseconds
├─ rows → Actual rows vs planned (estimate accuracy)
└─ buffers → shared hit (cache) vs read (disk I/O)
N+1 Detection
Symptoms:
- Many identical queries with different WHERE values
- Response time scales linearly with result count
- Query log shows repeated patterns
Detection:
- Django: django-debug-toolbar, nplusone
- Rails: Bullet gem
- SQLAlchemy: sqlalchemy.echo=True, look for repeated patterns
- General: enable slow query log, count queries per request
Fix:
- Eager loading (JOIN, prefetch, include)
- Batch queries (WHERE id IN (...))
- DataLoader pattern (batch + cache per request)
Load Testing Quick Reference
| Tool |
Language |
Strengths |
Command |
| k6 |
Go (JS scripts) |
Scripted scenarios, thresholds, cloud |
k6 run script.js |
| artillery |
Node.js |
YAML config, plugins, Playwright |
artillery run config.yml |
| vegeta |
Go |
CLI piping, constant rate |
echo "GET http://localhost" \| vegeta attack \| vegeta report |
| wrk |
C |
Lightweight, Lua scripts |
wrk -t4 -c100 -d30s http://localhost |
| autocannon |
Node.js |
Programmatic, pipelining |
autocannon -c 100 -d 30 http://localhost |
| locust |
Python |
Python classes, distributed |
locust -f locustfile.py |
Load Test Types
Test Type Selection:
│
├─ Smoke Test
│ └─ Minimal load (1-2 VUs) to verify system works
│ Duration: 1-5 minutes
│
├─ Load Test
│ └─ Expected production load
│ Duration: 15-60 minutes
│ Goal: Verify SLOs are met under normal conditions
│
├─ Stress Test
│ └─ Beyond expected load, find breaking point
│ Ramp up until errors or unacceptable latency
│ Goal: Know the system's limits
│
├─ Spike Test
│ └─ Sudden burst of traffic
│ Instant jump to high load, then drop
│ Goal: Test auto-scaling, queue behavior
│
├─ Soak Test (Endurance)
│ └─ Moderate load for extended period (hours)
│ Goal: Find memory leaks, resource exhaustion, GC issues
│
└─ Breakpoint Test
└─ Continuously ramp up until failure
Goal: Find maximum capacity
Benchmarking Quick Reference
| Tool |
Domain |
Command |
Notes |
| hyperfine |
CLI commands |
hyperfine 'cmd1' 'cmd2' |
Warm-up, statistical analysis, export |
| criterion |
Rust |
cargo bench (with criterion dep) |
Statistical, HTML reports, regression detection |
| testing.B |
Go |
go test -bench=. -benchmem |
Built-in, memory allocs, sub-benchmarks |
| pytest-benchmark |
Python |
pytest --benchmark-only |
Statistical, histograms, comparison |
| vitest bench |
JS/TS |
vitest bench |
Built-in to Vitest, Tinybench engine |
| Benchmark.js |
JS |
Programmatic setup |
Statistical analysis, ops/sec |
Benchmarking Best Practices
[ ] Warm up before measuring (JIT compilation, cache population)
[ ] Run multiple iterations (minimum 10, prefer 100+)
[ ] Report statistical summary (mean, median, stddev, min, max)
[ ] Control for system noise (close other apps, pin CPU frequency)
[ ] Compare against baseline (previous version, alternative impl)
[ ] Measure what matters (end-to-end, not micro-operations in isolation)
[ ] Profile before benchmarking (know WHAT to benchmark)
[ ] Document environment (hardware, OS, runtime version, flags)
Optimization Patterns Quick Reference
| Pattern |
When to Use |
Example |
| Caching |
Repeated expensive computations or I/O |
Redis, in-memory LRU, CDN, memoization |
| Lazy loading |
Resources not needed immediately |
Dynamic imports, virtual scrolling, pagination |
| Connection pooling |
Frequent DB/HTTP connections |
PgBouncer, HikariCP, urllib3 pool |
| Batch operations |
Many small operations on same resource |
Bulk INSERT, DataLoader, batch API calls |
| Pagination |
Large result sets |
Cursor-based (not offset) for large datasets |
| Compression |
Network transfer of text data |
Brotli > gzip for static, gzip for dynamic |
| Streaming |
Processing large files or datasets |
Line-by-line, chunk processing, async iterators |
| Precomputation |
Predictable expensive calculations |
Materialized views, build-time generation |
| Denormalization |
Read-heavy with expensive joins |
Duplicate data for read performance |
| Index optimization |
Slow queries on large tables |
Composite indexes matching query patterns |
Common Gotchas
| Gotcha |
Why It Hurts |
Fix |
| Premature optimization |
Wastes time on non-bottlenecks, adds complexity |
Profile first, optimize the measured hot path |
| Micro-benchmarks misleading |
JIT, caching, branch prediction differ from real workload |
Benchmark realistic workloads, validate with production metrics |
| Profiling overhead |
Profiler itself skews results (observer effect) |
Use sampling profilers (py-spy, pprof) not tracing profilers |
| Cache invalidation |
Stale data served, inconsistent state across nodes |
TTL + event-based invalidation, cache-aside pattern |
| Optimizing cold path |
Spending effort on rarely-executed code |
Focus on hot paths identified by profiling |
| Ignoring tail latency |
p50 looks great but p99 is 10x worse |
Measure and optimize p95/p99, not just averages |
| N+1 queries hidden by ORM |
Each page load fires hundreds of queries |
Enable query logging, use eager loading |
| Compression on small payloads |
Overhead exceeds savings for payloads <150 bytes |
Only compress above minimum size threshold |
| Connection pool too large |
Each connection uses memory, causes lock contention |
Size pool to CPU cores x 2-3, not hundreds |
| Missing async in I/O path |
One blocking call serializes all concurrent requests |
Audit entire request path for blocking calls |
| Benchmarking debug builds |
Debug builds 10-100x slower, misleading results |
Always benchmark release/optimized builds |
| Over-indexing database |
Write performance degrades, storage bloats |
Only index columns in WHERE, JOIN, ORDER BY clauses |
Reference Files
| File |
Contents |
Lines |
references/cpu-memory-profiling.md |
Flamegraph interpretation, Node.js/Python/Go/Rust/Browser profiling, memory leak detection |
~700 |
references/load-testing.md |
k6, Artillery, vegeta, wrk, Locust, load testing methodology, CI integration |
~600 |
references/optimization-patterns.md |
Caching, database, frontend, API, concurrency, memory, algorithm optimization |
~550 |
See Also
| Skill |
When to Combine |
debug-ops |
Debugging performance regressions, root cause analysis for slowdowns |
monitoring-ops |
Production metrics, alerting on latency/throughput, dashboards |
testing-ops |
Performance regression tests in CI, benchmark suites |
code-stats |
Identify complex code that may be performance-sensitive |
postgres-ops |
PostgreSQL-specific query optimization, indexing, EXPLAIN |
container-orchestration |
Resource limits, pod scaling, container performance |