Browse Source

feat(skills): Rewrite perf-ops as orchestrator with tier system

Replace static reference dump with lean orchestrator (295 lines).
Introduces three-tier safety model: T1 diagnose inline, T2 profile
via language expert agents, T3 optimize with preflight confirmation.
Adds parallel profiling, before/after protocol, and fallback mechanism.
Extract reference material to diagnosis-quickref.md and add
ci-integration.md for performance budgets and regression detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0xDarkMatter 3 weeks ago
parent
commit
47e9a5fb62

+ 243 - 274
skills/perf-ops/SKILL.md

@@ -1,325 +1,294 @@
 ---
 name: perf-ops
-description: "Performance profiling and optimization - CPU, memory, bundle analysis, load testing, flamegraphs. Use for: performance, profiling, flamegraph, pprof, py-spy, clinic.js, memray, heaptrack, bundle size, webpack analyzer, load testing, k6, artillery, vegeta, locust, benchmark, hyperfine, criterion, slow query, EXPLAIN ANALYZE, N+1, caching, optimization, latency, throughput, p99, memory leak, CPU spike, bottleneck."
-allowed-tools: "Read Edit Write Bash Glob Grep Agent"
-related-skills: [debug-ops, monitoring-ops, testing-ops, code-stats]
+description: "Performance profiling and optimization orchestrator - diagnoses symptoms, dispatches profiling to language experts, manages before/after comparisons. Triggers on: performance, profiling, flamegraph, pprof, py-spy, clinic.js, memray, heaptrack, bundle size, webpack analyzer, load testing, k6, artillery, vegeta, locust, benchmark, hyperfine, criterion, slow query, EXPLAIN ANALYZE, N+1, caching, optimization, latency, throughput, p99, memory leak, CPU spike, bottleneck."
+allowed-tools: "Read Edit Write Bash Glob Grep Agent TaskCreate TaskUpdate"
+related-skills: [debug-ops, monitoring-ops, testing-ops, code-stats, postgres-ops]
 ---
 
 # Performance Operations
 
-Cross-language performance profiling, optimization patterns, and load testing methodology.
+Orchestrator for cross-language performance profiling and optimization. Classifies symptoms inline, dispatches profiling to language expert agents (background), and manages optimization with confirmation.
 
-## Performance Issue Decision Tree
+## Architecture
 
 ```
-What symptom are you observing?
-│
-├─ High CPU usage
-│  ├─ Sustained 100% on one core
-│  │  └─ CPU-bound: hot loop, regex backtracking, tight computation
-│  │     → Profile with flamegraph (py-spy, pprof, clinic flame, samply)
-│  ├─ Sustained 100% across all cores
-│  │  └─ Parallelism gone wrong: fork bomb, unbounded workers, spin locks
-│  │     → Check process count, thread count, lock contention
-│  └─ Periodic spikes
-│     └─ GC pressure, cron job, batch processing, cache stampede
-│        → Correlate with GC logs, scheduled tasks, traffic patterns
-│
-├─ High memory usage
-│  ├─ Growing over time (never decreasing)
-│  │  └─ Memory leak: unclosed resources, growing caches, event listener accumulation
-│  │     → Heap snapshots over time, compare retained objects
-│  ├─ Sudden large allocation
-│  │  └─ Unbounded buffer, loading full dataset into memory, large file read
-│  │     → Check allocation sizes, switch to streaming
-│  └─ High but stable
-│     └─ May be normal: in-memory cache, preloaded data, memory-mapped files
-│        → Verify with expected working set size
-│
-├─ Slow responses / high latency
-│  ├─ All endpoints slow
-│  │  └─ Systemic: resource exhaustion, GC pauses, DNS issues, TLS overhead
-│  │     → Check resource utilization, GC metrics, network path
-│  ├─ Specific endpoint slow
-│  │  └─ Query-specific: N+1 queries, missing index, unoptimized algorithm
-│  │     → EXPLAIN ANALYZE, query logging, endpoint profiling
-│  └─ Intermittently slow (p99 spikes)
-│     └─ Contention: lock wait, connection pool exhaustion, noisy neighbor
-│        → Check lock metrics, pool sizes, correlated traffic
-│
-├─ Low throughput
-│  ├─ CPU not saturated
-│  │  └─ I/O bound: disk wait, network latency, blocking calls in async code
-│  │     → Check iowait, network RTT, ensure async throughout
-│  ├─ CPU saturated
-│  │  └─ Compute bound: need algorithmic improvement or horizontal scaling
-│  │     → Profile hot paths, optimize or scale out
-│  └─ Queues backing up
-│     └─ Consumer too slow: batch size, consumer count, downstream bottleneck
-│        → Increase consumers, optimize processing, check downstream
-│
-├─ Large bundle size (frontend)
-│  ├─ Main bundle too large
-│  │  └─ Missing code splitting, tree shaking not working, barrel file imports
-│  │     → Bundle analyzer, check import patterns, add dynamic imports
-│  ├─ Duplicate dependencies
-│  │  └─ Multiple versions of same library bundled
-│  │     → Dedupe, check peer dependencies, use resolutions
-│  └─ Large assets
-│     └─ Unoptimized images, embedded fonts, inline data URIs
-│        → Image optimization, font subsetting, external assets
-│
-└─ Slow database queries
-   ├─ Single slow query
-   │  └─ Missing index, suboptimal join order, full table scan
-   │     → EXPLAIN ANALYZE, add index, rewrite query
-   ├─ Many small queries (N+1)
-   │  └─ ORM lazy loading, loop with individual queries
-   │     → Eager loading, batch queries, dataloader pattern
-   └─ Lock contention
-      └─ Long transactions, row-level locks, table locks
-         → Shorten transactions, check isolation level, advisory locks
+User describes performance issue or requests profiling
+    |
+    +---> T1: Diagnose (inline, fast)
+    |       +---> Classify symptom (decision tree)
+    |       +---> Detect language/runtime from project
+    |       +---> Check installed profiling tools
+    |       +---> Determine production vs development
+    |       +---> Gather system baseline (CPU/mem/disk)
+    |       +---> Present: diagnosis + recommended profiling approach
+    |
+    +---> T2: Profile (dispatch to language expert, background)
+    |       +---> Select expert agent from routing table
+    |       +---> Build perf-focused dispatch prompt
+    |       +---> Expert runs profiler, collects data, interprets results
+    |       |       +---> Fallback: general-purpose with tool commands inlined
+    |       +---> Returns: findings + bottleneck identification + suggestions
+    |       |
+    |       +---> [Optional parallel dispatch]:
+    |             +---> CPU profiling agent  ---+
+    |             +---> Memory profiling agent --+--> Consolidate findings
+    |             +---> Baseline benchmark ------+
+    |
+    +---> T3: Optimize (dispatch to expert, foreground + confirm)
+            +---> Expert proposes specific code changes
+            +---> Preflight: what changes, expected impact, risks
+            +---> User confirms
+            +---> Apply changes
+            +---> Re-benchmark for before/after delta
 ```
 
-## Profiling Tool Selection Matrix
+## Safety Tiers
 
-| Problem | Node.js | Python | Go | Rust | Browser |
-|---------|---------|--------|----|------|---------|
-| **CPU hotspots** | clinic flame, 0x | py-spy, scalene | pprof (CPU) | cargo-flamegraph, samply | DevTools Performance |
-| **Memory leaks** | clinic doctor, heap snapshot | memray, tracemalloc | pprof (heap) | DHAT, heaptrack | DevTools Memory |
-| **Memory allocation** | --heap-prof | memray, scalene | pprof (allocs) | DHAT | DevTools Allocation |
-| **Async bottlenecks** | clinic bubbleprof | asyncio debug mode | pprof (goroutine) | tokio-console | DevTools Performance |
-| **I/O profiling** | clinic doctor | strace, py-spy | pprof (block) | strace, perf | Network tab |
-| **GC pressure** | --trace-gc | gc.set_debug | GODEBUG=gctrace=1 | N/A (no GC) | Performance timeline |
-| **Lock contention** | N/A | py-spy (threading) | pprof (mutex) | parking_lot stats | N/A |
-| **Startup time** | --cpu-prof | python -X importtime | go build -v | cargo build --timings | Lighthouse |
+### T1: Diagnose - Run Inline
 
-## CPU Profiling Quick Reference
+No agent needed. Execute directly via Bash for instant results.
 
-### Flamegraph Basics
+| Operation | Command / Method |
+|-----------|-----------------|
+| Detect Python profilers | `which py-spy && which memray && which scalene` |
+| Detect Go profilers | `which go && go tool pprof -h 2>/dev/null` |
+| Detect Rust profilers | `which cargo-flamegraph && which samply` |
+| Detect Node profilers | `which clinic && which 0x` |
+| Detect benchmarking tools | `which hyperfine && which k6 && which vegeta` |
+| System CPU baseline | `top -bn1 -o %CPU \| head -20` (Linux) or `wmic cpu get loadpercentage` (Win) |
+| System memory baseline | `free -h` (Linux) or `wmic OS get FreePhysicalMemory` (Win) |
+| Disk I/O check | `iostat -x 1 3` (Linux) |
+| Identify language | Check for `package.json`, `go.mod`, `Cargo.toml`, `pyproject.toml`, `requirements.txt` |
+| Production vs dev | Ask user or detect from environment (NODE_ENV, FLASK_ENV, etc.) |
+| Read existing profiles | Parse `.prof`, `.svg`, `.bin` files in project |
 
-```
-Reading a flamegraph:
-- X-axis: proportion of total samples (wider = more time)
-- Y-axis: call stack depth (bottom = entry point, top = leaf)
-- Color: random (not meaningful) in most tools
-- Look for: wide plateaus at the top (hot functions)
-- Ignore: narrow towers (called often but fast)
-
-Key actions:
-1. Find the widest bars at the TOP of the graph
-2. Trace down to see what calls them
-3. Focus optimization on the widest top-level functions
-4. Re-profile after each change to verify improvement
-```
+**Production safety rule:** In production environments, only recommend sampling profilers (py-spy, pprof HTTP endpoint, perf). Never suggest attaching debuggers, tracing profilers, or tools that require process restart.
+
+### T2: Profile - Dispatch to Expert Agent
+
+Gather context from T1 diagnosis, then dispatch to the appropriate language expert.
+
+**Language Expert Routing:**
 
-### Tool Quick Start
-
-| Tool | Language | Command | Output |
-|------|----------|---------|--------|
-| **py-spy** | Python | `py-spy record -o profile.svg -- python app.py` | SVG flamegraph |
-| **py-spy top** | Python | `py-spy top --pid PID` | Live top-like view |
-| **pprof** | Go | `go tool pprof -http :8080 http://localhost:6060/debug/pprof/profile?seconds=30` | Interactive web UI |
-| **clinic flame** | Node.js | `clinic flame -- node app.js` | HTML flamegraph |
-| **0x** | Node.js | `0x app.js` | SVG flamegraph |
-| **cargo-flamegraph** | Rust | `cargo flamegraph --bin myapp` | SVG flamegraph |
-| **samply** | Rust/C/C++ | `samply record ./target/release/myapp` | Firefox Profiler UI |
-| **perf** | Linux (any) | `perf record -g ./myapp && perf script \| inferno-flamegraph > out.svg` | SVG flamegraph |
-
-## Memory Profiling Quick Reference
-
-| Tool | Language | Command | What It Shows |
-|------|----------|---------|---------------|
-| **memray** | Python | `memray run script.py && memray flamegraph output.bin` | Allocation flamegraph, leak detection |
-| **tracemalloc** | Python | `tracemalloc.start(); snapshot = tracemalloc.take_snapshot()` | Top allocators, allocation traceback |
-| **scalene** | Python | `scalene script.py` | CPU + memory + GPU in one profiler |
-| **heaptrack** | C/C++/Rust | `heaptrack ./myapp && heaptrack_gui heaptrack.myapp.*.zst` | Allocation timeline, flamegraph, leak candidates |
-| **DHAT** | Rust | `valgrind --tool=dhat ./target/debug/myapp` | Allocation sites, short-lived allocs |
-| **pprof (heap)** | Go | `go tool pprof http://localhost:6060/debug/pprof/heap` | Live heap, allocation counts |
-| **Chrome heap** | JS/Browser | DevTools → Memory → Take heap snapshot | Object retention, detached DOM |
-| **clinic doctor** | Node.js | `clinic doctor -- node app.js` | Memory + CPU + event loop diagnosis |
-
-## Bundle Analysis Quick Reference
-
-| Tool | Bundler | Command | Output |
-|------|---------|---------|--------|
-| **webpack-bundle-analyzer** | Webpack | `npx webpack-bundle-analyzer stats.json` | Interactive treemap |
-| **source-map-explorer** | Any | `npx source-map-explorer bundle.js` | Treemap from source maps |
-| **rollup-plugin-visualizer** | Rollup/Vite | Add plugin, build | HTML treemap |
-| **vite-bundle-visualizer** | Vite | `npx vite-bundle-visualizer` | Treemap visualization |
-| **bundlephobia** | npm | `npx bundlephobia <package>` | Package size analysis |
-| **size-limit** | Any | Configure in package.json, run in CI | Size budget enforcement |
-
-### Bundle Size Reduction Checklist
+| Detected Language | Expert Agent | Key Profiling Tools |
+|-------------------|-------------|---------------------|
+| Python (.py, pyproject.toml, requirements.txt) | python-expert | py-spy, memray, scalene, tracemalloc |
+| Go (go.mod, .go files) | go-expert | pprof (CPU/heap/goroutine/mutex), benchstat |
+| Rust (Cargo.toml, .rs files) | rust-expert | cargo-flamegraph, samply, DHAT, criterion |
+| TypeScript/JavaScript (backend, package.json + server) | javascript-expert | clinic flame/doctor/bubbleprof, 0x |
+| TypeScript/JavaScript (frontend, bundle issues) | typescript-expert | webpack-bundle-analyzer, Lighthouse, source-map-explorer |
+| SQL / PostgreSQL | postgres-expert | EXPLAIN ANALYZE, pg_stat_statements, pgbench |
+| General / unknown / CLI benchmarking | general-purpose | hyperfine, perf, strace |
+
+**Dispatch template (T2):**
 
 ```
-[ ] Dynamic imports for routes and heavy components
-[ ] Tree shaking working (check for side effects in package.json)
-[ ] No barrel file re-exports pulling in entire modules
-[ ] Lodash: use lodash-es or individual imports (lodash/debounce)
-[ ] Moment.js replaced with date-fns or dayjs
-[ ] Images optimized (WebP/AVIF, responsive sizes, lazy loading)
-[ ] Fonts subsetted to used characters
-[ ] Gzip/Brotli compression enabled on server
-[ ] Source maps excluded from production bundle size
-[ ] CSS purged of unused styles (PurgeCSS, Tailwind JIT)
+You are handling a performance profiling task dispatched by the perf-ops orchestrator.
+
+## Diagnosis (from T1)
+- Symptom: {classified symptom from decision tree}
+- Language/Runtime: {detected language}
+- Environment: {production | development}
+- Installed tools: {list from tool detection}
+- System baseline: {CPU/memory/disk metrics}
+
+## Profiling Task
+{specific profiling request - e.g., "CPU profile the API server under load"}
+
+## Target
+- Process/file: {target application or endpoint}
+- Expected workload: {how to generate representative load if needed}
+
+## Instructions
+1. Run the appropriate profiler for this language and symptom
+2. Collect sufficient samples (minimum 30 seconds for CPU, multiple snapshots for memory)
+3. Interpret the results - identify the top 3-5 bottlenecks
+4. For each bottleneck: explain what it is, why it's slow, and suggest a fix
+5. Report findings in structured format with metrics
 ```
 
-## Database Performance Quick Reference
+**Execution mode:**
+
+| Scenario | Mode | Why |
+|----------|------|-----|
+| User waiting for results | `run_in_background=False` | They need findings before continuing |
+| User continuing other work | `run_in_background=True` | Don't block the main session |
+| Quick benchmark (hyperfine) | `run_in_background=False` | Fast enough to wait |
+| Load test (k6, artillery) | `run_in_background=True` | Takes minutes |
+
+### T3: Optimize - Preflight Required
+
+Dispatch to language expert with explicit instruction to produce a preflight report before any code changes.
 
-### EXPLAIN ANALYZE Interpretation
+**Dispatch template (T3 preflight):**
 
 ```
-Key metrics in EXPLAIN ANALYZE output:
-│
-├─ Seq Scan          → Full table scan (often bad for large tables)
-│  └─ Fix: Add index on filter columns
-├─ Index Scan        → Using index (good)
-├─ Bitmap Index Scan → Multiple index conditions combined (good)
-├─ Nested Loop       → OK for small inner table, bad for large joins
-│  └─ Fix: Add index on join column, consider Hash Join
-├─ Hash Join         → Good for large equi-joins
-├─ Sort              → Check if index can provide order
-│  └─ Fix: Add index matching ORDER BY
-├─ actual time       → First row..last row in milliseconds
-├─ rows              → Actual rows vs planned (estimate accuracy)
-└─ buffers           → shared hit (cache) vs read (disk I/O)
+You are handling a performance optimization dispatched by the perf-ops orchestrator.
+
+## Profiling Results (from T2)
+{bottleneck findings, metrics, flamegraph interpretation}
+
+## Optimization Request
+{specific optimization - e.g., "Fix the N+1 query in UserController.list"}
+
+IMPORTANT: Do NOT apply changes yet. Produce a Preflight Report:
+1. Exactly what code/config changes you will make
+2. Expected performance improvement (with reasoning)
+3. Risks (correctness, side effects, edge cases)
+4. How to verify the improvement (specific benchmark or test)
+5. How to revert if the optimization causes issues
 ```
 
-### N+1 Detection
+**After user confirms:** Re-dispatch with execute authority plus the before/after protocol.
+
+**Dispatch template (T3 execute + before/after):**
 
 ```
-Symptoms:
-- Many identical queries with different WHERE values
-- Response time scales linearly with result count
-- Query log shows repeated patterns
-
-Detection:
-- Django: django-debug-toolbar, nplusone
-- Rails: Bullet gem
-- SQLAlchemy: sqlalchemy.echo=True, look for repeated patterns
-- General: enable slow query log, count queries per request
-
-Fix:
-- Eager loading (JOIN, prefetch, include)
-- Batch queries (WHERE id IN (...))
-- DataLoader pattern (batch + cache per request)
+User confirmed the optimization. Proceed with execution.
+
+## Approved Changes
+{exact changes from preflight report}
+
+## Before/After Protocol
+1. Record the current benchmark baseline: {specific command from T2}
+2. Apply the approved changes
+3. Run the same benchmark again
+4. Report comparison:
+   - Metric: before value -> after value (% change)
+   - Include statistical confidence if tool supports it
+5. If regression detected: revert and report
 ```
 
-## Load Testing Quick Reference
+## Parallel Profiling
+
+When multiple independent symptoms are detected, or the user requests comprehensive profiling, dispatch parallel agents.
+
+**Parallelizable combinations:**
+
+| Agent 1 | Agent 2 | Why Independent |
+|---------|---------|-----------------|
+| CPU profiler | Memory profiler | Different tools, different data |
+| CPU profiler | Baseline benchmark | Read vs measurement |
+| Backend profiler | Frontend bundle analysis | Different runtimes |
+| Service A profiler | Service B profiler | Different processes |
+
+**NOT parallelizable:**
+
+| Operation A | Operation B | Why Sequential |
+|-------------|-------------|----------------|
+| Profile | Interpret results | Dependency |
+| Before benchmark | After benchmark | Requires code change between |
+| Load test | CPU profile same process | Tool interference |
+
+**Dispatch pattern for parallel profiling:**
+
+```python
+# Example: CPU + memory profiling in parallel
+Agent(
+    subagent_type="python-expert",
+    model="sonnet",
+    run_in_background=True,
+    prompt="CPU profiling task: {cpu_prompt}"
+)
+Agent(
+    subagent_type="python-expert",
+    model="sonnet",
+    run_in_background=True,
+    prompt="Memory profiling task: {memory_prompt}"
+)
+# Both run simultaneously, consolidate findings when both complete
+```
 
-| Tool | Language | Strengths | Command |
-|------|----------|-----------|---------|
-| **k6** | Go (JS scripts) | Scripted scenarios, thresholds, cloud | `k6 run script.js` |
-| **artillery** | Node.js | YAML config, plugins, Playwright | `artillery run config.yml` |
-| **vegeta** | Go | CLI piping, constant rate | `echo "GET http://localhost" \| vegeta attack \| vegeta report` |
-| **wrk** | C | Lightweight, Lua scripts | `wrk -t4 -c100 -d30s http://localhost` |
-| **autocannon** | Node.js | Programmatic, pipelining | `autocannon -c 100 -d 30 http://localhost` |
-| **locust** | Python | Python classes, distributed | `locust -f locustfile.py` |
+## Fallback: When Expert Agent Is Unavailable
 
-### Load Test Types
+If the target language expert is not registered as a subagent type, fall back to `general-purpose` with profiling commands inlined.
 
-```
-Test Type Selection:
-│
-├─ Smoke Test
-│  └─ Minimal load (1-2 VUs) to verify system works
-│     Duration: 1-5 minutes
-│
-├─ Load Test
-│  └─ Expected production load
-│     Duration: 15-60 minutes
-│     Goal: Verify SLOs are met under normal conditions
-│
-├─ Stress Test
-│  └─ Beyond expected load, find breaking point
-│     Ramp up until errors or unacceptable latency
-│     Goal: Know the system's limits
-│
-├─ Spike Test
-│  └─ Sudden burst of traffic
-│     Instant jump to high load, then drop
-│     Goal: Test auto-scaling, queue behavior
-│
-├─ Soak Test (Endurance)
-│  └─ Moderate load for extended period (hours)
-│     Goal: Find memory leaks, resource exhaustion, GC issues
-│
-└─ Breakpoint Test
-   └─ Continuously ramp up until failure
-      Goal: Find maximum capacity
+```python
+Agent(
+    subagent_type="general-purpose",
+    model="sonnet",
+    run_in_background=True,
+    prompt="""You are acting as a performance profiling agent for {language}.
+
+Use these specific tools and commands:
+{tool commands from diagnosis-quickref.md for the detected language}
+
+{original dispatch prompt}
+"""
+)
 ```
 
-## Benchmarking Quick Reference
+For simple benchmarks (hyperfine, single command timing), skip agent dispatch entirely and run inline via Bash.
 
-| Tool | Domain | Command | Notes |
-|------|--------|---------|-------|
-| **hyperfine** | CLI commands | `hyperfine 'cmd1' 'cmd2'` | Warm-up, statistical analysis, export |
-| **criterion** | Rust | `cargo bench` (with criterion dep) | Statistical, HTML reports, regression detection |
-| **testing.B** | Go | `go test -bench=. -benchmem` | Built-in, memory allocs, sub-benchmarks |
-| **pytest-benchmark** | Python | `pytest --benchmark-only` | Statistical, histograms, comparison |
-| **vitest bench** | JS/TS | `vitest bench` | Built-in to Vitest, Tinybench engine |
-| **Benchmark.js** | JS | Programmatic setup | Statistical analysis, ops/sec |
+## Decision Logic
 
-### Benchmarking Best Practices
+When a performance-related request arrives:
 
 ```
-[ ] Warm up before measuring (JIT compilation, cache population)
-[ ] Run multiple iterations (minimum 10, prefer 100+)
-[ ] Report statistical summary (mean, median, stddev, min, max)
-[ ] Control for system noise (close other apps, pin CPU frequency)
-[ ] Compare against baseline (previous version, alternative impl)
-[ ] Measure what matters (end-to-end, not micro-operations in isolation)
-[ ] Profile before benchmarking (know WHAT to benchmark)
-[ ] Document environment (hardware, OS, runtime version, flags)
+1. Classify the request:
+   - Symptom description? -> Start at T1 (diagnose)
+   - "Profile my app"? -> T1 (detect language + tools) then T2 (profile)
+   - "Benchmark X vs Y"? -> T2 directly (hyperfine or language benchmark)
+   - "Optimize this"? -> T2 (profile first) then T3 (optimize)
+   - "Why is X slow"? -> T1 (diagnose) then T2 (targeted profile)
+
+2. T1 Diagnose (always runs first for new issues):
+   - Detect language/runtime
+   - Check installed profiling tools
+   - Classify symptom using decision tree (see diagnosis-quickref.md)
+   - Determine production vs development
+   - Present findings + recommend next step
+
+3. T2 Profile (when diagnosis points to a specific bottleneck):
+   - Route to appropriate language expert
+   - Decide foreground vs background
+   - Consider parallel dispatch if multiple symptoms
+   - Consolidate findings from all agents
+
+4. T3 Optimize (only when user wants changes applied):
+   - Always produce preflight report first
+   - Wait for explicit user confirmation
+   - Execute with before/after comparison
+   - Report delta with statistical confidence
 ```
 
-## Optimization Patterns Quick Reference
-
-| Pattern | When to Use | Example |
-|---------|-------------|---------|
-| **Caching** | Repeated expensive computations or I/O | Redis, in-memory LRU, CDN, memoization |
-| **Lazy loading** | Resources not needed immediately | Dynamic imports, virtual scrolling, pagination |
-| **Connection pooling** | Frequent DB/HTTP connections | PgBouncer, HikariCP, urllib3 pool |
-| **Batch operations** | Many small operations on same resource | Bulk INSERT, DataLoader, batch API calls |
-| **Pagination** | Large result sets | Cursor-based (not offset) for large datasets |
-| **Compression** | Network transfer of text data | Brotli > gzip for static, gzip for dynamic |
-| **Streaming** | Processing large files or datasets | Line-by-line, chunk processing, async iterators |
-| **Precomputation** | Predictable expensive calculations | Materialized views, build-time generation |
-| **Denormalization** | Read-heavy with expensive joins | Duplicate data for read performance |
-| **Index optimization** | Slow queries on large tables | Composite indexes matching query patterns |
-
-## Common Gotchas
-
-| Gotcha | Why It Hurts | Fix |
-|--------|-------------|-----|
-| Premature optimization | Wastes time on non-bottlenecks, adds complexity | Profile first, optimize the measured hot path |
-| Micro-benchmarks misleading | JIT, caching, branch prediction differ from real workload | Benchmark realistic workloads, validate with production metrics |
-| Profiling overhead | Profiler itself skews results (observer effect) | Use sampling profilers (py-spy, pprof) not tracing profilers |
-| Cache invalidation | Stale data served, inconsistent state across nodes | TTL + event-based invalidation, cache-aside pattern |
-| Optimizing cold path | Spending effort on rarely-executed code | Focus on hot paths identified by profiling |
-| Ignoring tail latency | p50 looks great but p99 is 10x worse | Measure and optimize p95/p99, not just averages |
-| N+1 queries hidden by ORM | Each page load fires hundreds of queries | Enable query logging, use eager loading |
-| Compression on small payloads | Overhead exceeds savings for payloads <150 bytes | Only compress above minimum size threshold |
-| Connection pool too large | Each connection uses memory, causes lock contention | Size pool to CPU cores x 2-3, not hundreds |
-| Missing async in I/O path | One blocking call serializes all concurrent requests | Audit entire request path for blocking calls |
-| Benchmarking debug builds | Debug builds 10-100x slower, misleading results | Always benchmark release/optimized builds |
-| Over-indexing database | Write performance degrades, storage bloats | Only index columns in WHERE, JOIN, ORDER BY clauses |
+## Quick Reference
+
+| Task | Tier | Execution |
+|------|------|-----------|
+| Detect tools | T1 | Inline |
+| Check system metrics | T1 | Inline |
+| Classify symptom | T1 | Inline |
+| Identify language | T1 | Inline |
+| Run CPU profiler | T2 | Agent (bg) |
+| Run memory profiler | T2 | Agent (bg) |
+| Run load test | T2 | Agent (bg) |
+| Run benchmark | T2 | Agent (bg or inline for hyperfine) |
+| Bundle analysis | T2 | Agent (bg) |
+| EXPLAIN ANALYZE | T2 | Agent (fg) |
+| Before/after comparison | T2 | Agent (fg) |
+| Apply optimization | T3 | Agent + confirm |
+| Add index | T3 | Agent + confirm |
+| Refactor hot path | T3 | Agent + confirm |
 
 ## Reference Files
 
-| File | Contents | Lines |
-|------|----------|-------|
-| `references/cpu-memory-profiling.md` | Flamegraph interpretation, Node.js/Python/Go/Rust/Browser profiling, memory leak detection | ~700 |
-| `references/load-testing.md` | k6, Artillery, vegeta, wrk, Locust, load testing methodology, CI integration | ~600 |
-| `references/optimization-patterns.md` | Caching, database, frontend, API, concurrency, memory, algorithm optimization | ~550 |
+| File | Contents |
+|------|----------|
+| `references/diagnosis-quickref.md` | Decision tree, tool selection matrix, quick references for all profiling domains, common gotchas |
+| `references/cpu-memory-profiling.md` | Deep flamegraph interpretation, language-specific CPU/memory profiling guides |
+| `references/load-testing.md` | k6, Artillery, vegeta, wrk, Locust methodology and CI integration |
+| `references/optimization-patterns.md` | Caching, database, frontend, API, concurrency, memory optimization strategies |
+| `references/ci-integration.md` | Performance budgets, regression detection, CI pipeline patterns, benchmark baselines |
+
+Load reference files when deeper tool-specific guidance is needed beyond what the dispatch prompt provides.
 
 ## See Also
 
 | Skill | When to Combine |
 |-------|----------------|
-| `debug-ops` | Debugging performance regressions, root cause analysis for slowdowns |
-| `monitoring-ops` | Production metrics, alerting on latency/throughput, dashboards |
+| `debug-ops` | Root cause analysis for performance regressions |
+| `monitoring-ops` | Production metrics, alerting on latency/throughput |
 | `testing-ops` | Performance regression tests in CI, benchmark suites |
 | `code-stats` | Identify complex code that may be performance-sensitive |
 | `postgres-ops` | PostgreSQL-specific query optimization, indexing, EXPLAIN |

+ 316 - 0
skills/perf-ops/references/ci-integration.md

@@ -0,0 +1,316 @@
+# Performance CI Integration
+
+Patterns for enforcing performance budgets, detecting regressions, and integrating profiling into CI/CD pipelines.
+
+## Performance Budgets
+
+### Bundle Size Budgets (Frontend)
+
+**size-limit** - Enforce JavaScript bundle budgets in CI:
+
+```json
+// package.json
+{
+  "size-limit": [
+    { "path": "dist/index.js", "limit": "50 kB" },
+    { "path": "dist/vendor.js", "limit": "150 kB" },
+    { "path": "dist/**/*.css", "limit": "30 kB" }
+  ]
+}
+```
+
+```yaml
+# GitHub Actions
+- name: Check bundle size
+  run: npx size-limit
+  # Fails if any bundle exceeds limit
+```
+
+**bundlewatch** - Track bundle sizes across PRs:
+
+```yaml
+- name: Bundle size check
+  uses: jackyef/bundlewatch-gh-action@master
+  with:
+    bundlewatch-config: .bundlewatch.config.js
+    bundlewatch-github-token: ${{ secrets.GITHUB_TOKEN }}
+```
+
+### Lighthouse CI (Web Performance)
+
+```yaml
+# .lighthouserc.json
+{
+  "ci": {
+    "collect": {
+      "url": ["http://localhost:3000/", "http://localhost:3000/dashboard"],
+      "numberOfRuns": 3
+    },
+    "assert": {
+      "assertions": {
+        "categories:performance": ["error", { "minScore": 0.9 }],
+        "first-contentful-paint": ["warn", { "maxNumericValue": 2000 }],
+        "largest-contentful-paint": ["error", { "maxNumericValue": 2500 }],
+        "cumulative-layout-shift": ["error", { "maxNumericValue": 0.1 }],
+        "total-blocking-time": ["error", { "maxNumericValue": 300 }]
+      }
+    },
+    "upload": {
+      "target": "temporary-public-storage"
+    }
+  }
+}
+```
+
+```yaml
+# GitHub Actions
+- name: Lighthouse CI
+  run: |
+    npm install -g @lhci/cli
+    lhci autorun
+```
+
+### API Response Time Budgets
+
+**k6 thresholds** - Fail CI if response times exceed SLOs:
+
+```javascript
+// perf-test.js
+export const options = {
+  thresholds: {
+    http_req_duration: ['p(95)<500', 'p(99)<1000'],
+    http_req_failed: ['rate<0.01'],
+    iterations: ['rate>100'],
+  },
+};
+```
+
+```yaml
+- name: API performance test
+  run: k6 run --out json=results.json perf-test.js
+- name: Upload results
+  if: always()
+  uses: actions/upload-artifact@v4
+  with:
+    name: k6-results
+    path: results.json
+```
+
+## Regression Detection
+
+### Benchmark Baselines
+
+**Store benchmarks in git** for cross-commit comparison:
+
+```yaml
+# Go benchmarks with benchstat
+- name: Run benchmarks
+  run: go test -bench=. -benchmem -count=5 ./... > new.txt
+
+- name: Compare with baseline
+  run: |
+    git stash
+    go test -bench=. -benchmem -count=5 ./... > old.txt
+    git stash pop
+    benchstat old.txt new.txt
+```
+
+**Python with pytest-benchmark:**
+
+```yaml
+- name: Run benchmarks
+  run: pytest --benchmark-only --benchmark-json=benchmark.json
+
+- name: Compare with baseline
+  run: pytest --benchmark-only --benchmark-compare=0001_baseline.json
+```
+
+**Rust with criterion:**
+
+```yaml
+- name: Benchmark
+  run: cargo bench -- --save-baseline pr-${{ github.event.number }}
+
+- name: Compare
+  run: cargo bench -- --baseline main --save-baseline pr-compare
+  # criterion outputs comparison automatically
+```
+
+**hyperfine for CLI tools:**
+
+```yaml
+- name: Benchmark CLI
+  run: |
+    hyperfine --export-json bench.json \
+      --warmup 3 \
+      './target/release/mytool process data.csv'
+```
+
+### Statistical Comparison
+
+When comparing benchmarks, avoid naive percentage comparison. Use statistical tests:
+
+```
+Good: "p95 latency increased from 45ms to 52ms (benchstat: p=0.003, statistically significant)"
+Bad:  "latency increased 15%" (no sample size, no confidence interval)
+```
+
+**benchstat** (Go) computes significance automatically:
+
+```
+name        old time/op  new time/op  delta
+Parse-8     45.2ms +- 2%  52.1ms +- 3%  +15.27% (p=0.003 n=5+5)
+```
+
+**pytest-benchmark** comparison output:
+
+```
+Name                 Min      Max     Mean    StdDev   Rounds
+test_parse        42.1ms   48.3ms   45.2ms    1.8ms       10
+test_parse (base) 38.9ms   42.1ms   40.5ms    1.1ms       10
+```
+
+### Alerting on Regressions
+
+**GitHub Actions comment on PR:**
+
+```yaml
+- name: Comment benchmark results
+  uses: benchmark-action/github-action-benchmark@v1
+  with:
+    tool: 'go'
+    output-file-path: bench.txt
+    github-token: ${{ secrets.GITHUB_TOKEN }}
+    comment-on-alert: true
+    alert-threshold: '150%'  # Alert if 50%+ regression
+    fail-on-alert: true
+```
+
+## CI Pipeline Patterns
+
+### Pre-merge Performance Gate
+
+```yaml
+name: Performance Gate
+on: pull_request
+
+jobs:
+  bundle-size:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - run: npm ci
+      - run: npm run build
+      - run: npx size-limit
+
+  api-perf:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - run: docker compose up -d
+      - run: sleep 10  # Wait for services
+      - run: k6 run --out json=results.json tests/perf/smoke.js
+      - uses: actions/upload-artifact@v4
+        with:
+          name: perf-results
+          path: results.json
+
+  benchmark:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0  # Need history for baseline
+      - run: |
+          # Run current benchmarks
+          go test -bench=. -benchmem -count=5 ./... > new.txt
+          # Run baseline benchmarks
+          git checkout main
+          go test -bench=. -benchmem -count=5 ./... > old.txt
+          git checkout -
+          # Compare
+          benchstat old.txt new.txt | tee comparison.txt
+```
+
+### Scheduled Soak Test
+
+```yaml
+name: Nightly Soak Test
+on:
+  schedule:
+    - cron: '0 2 * * 1-5'  # 2 AM weekdays
+
+jobs:
+  soak:
+    runs-on: ubuntu-latest
+    timeout-minutes: 120
+    steps:
+      - uses: actions/checkout@v4
+      - run: docker compose up -d
+      - name: Run soak test (1 hour)
+        run: k6 run --duration 1h tests/perf/soak.js
+      - name: Check for memory leaks
+        run: |
+          # Compare start vs end memory usage
+          docker stats --no-stream --format "{{.MemUsage}}" app
+```
+
+### Performance Dashboard Integration
+
+```yaml
+# Push metrics to Grafana Cloud / InfluxDB
+- name: Push to dashboard
+  run: |
+    k6 run \
+      --out influxdb=http://influxdb:8086/k6 \
+      tests/perf/load.js
+```
+
+## Tool-Specific CI Patterns
+
+### k6 in CI
+
+```yaml
+- name: Install k6
+  run: |
+    sudo gpg -k
+    sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
+      --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D68
+    echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
+      | sudo tee /etc/apt/sources.list.d/k6.list
+    sudo apt-get update && sudo apt-get install k6
+- name: Run load test
+  run: k6 run tests/perf/load.js
+```
+
+### Artillery in CI
+
+```yaml
+- name: Run Artillery
+  run: npx artillery run tests/perf/config.yml --output report.json
+- name: Generate report
+  run: npx artillery report report.json --output report.html
+```
+
+### Lighthouse in CI
+
+```yaml
+- name: Audit with Lighthouse
+  run: |
+    npm install -g @lhci/cli
+    lhci autorun --config=.lighthouserc.json
+```
+
+## Budget Sizing Guidelines
+
+| Metric | Good | Acceptable | Poor |
+|--------|------|------------|------|
+| JS bundle (gzipped) | <50 kB | <150 kB | >300 kB |
+| CSS (gzipped) | <20 kB | <50 kB | >100 kB |
+| LCP | <1.5s | <2.5s | >4.0s |
+| FCP | <1.0s | <1.8s | >3.0s |
+| CLS | <0.05 | <0.1 | >0.25 |
+| TBT | <150ms | <300ms | >600ms |
+| API p95 | <200ms | <500ms | >1000ms |
+| API p99 | <500ms | <1000ms | >3000ms |
+| API error rate | <0.1% | <1% | >5% |

+ 300 - 0
skills/perf-ops/references/diagnosis-quickref.md

@@ -0,0 +1,300 @@
+# Performance Diagnosis Quick Reference
+
+Symptom classification, tool selection, and common patterns for rapid performance triage.
+
+## Performance Issue Decision Tree
+
+```
+What symptom are you observing?
+|
++- High CPU usage
+|  +- Sustained 100% on one core
+|  |  +- CPU-bound: hot loop, regex backtracking, tight computation
+|  |     -> Profile with flamegraph (py-spy, pprof, clinic flame, samply)
+|  +- Sustained 100% across all cores
+|  |  +- Parallelism gone wrong: fork bomb, unbounded workers, spin locks
+|  |     -> Check process count, thread count, lock contention
+|  +- Periodic spikes
+|     +- GC pressure, cron job, batch processing, cache stampede
+|        -> Correlate with GC logs, scheduled tasks, traffic patterns
+|
++- High memory usage
+|  +- Growing over time (never decreasing)
+|  |  +- Memory leak: unclosed resources, growing caches, event listener accumulation
+|  |     -> Heap snapshots over time, compare retained objects
+|  +- Sudden large allocation
+|  |  +- Unbounded buffer, loading full dataset into memory, large file read
+|  |     -> Check allocation sizes, switch to streaming
+|  +- High but stable
+|     +- May be normal: in-memory cache, preloaded data, memory-mapped files
+|        -> Verify with expected working set size
+|
++- Slow responses / high latency
+|  +- All endpoints slow
+|  |  +- Systemic: resource exhaustion, GC pauses, DNS issues, TLS overhead
+|  |     -> Check resource utilization, GC metrics, network path
+|  +- Specific endpoint slow
+|  |  +- Query-specific: N+1 queries, missing index, unoptimized algorithm
+|  |     -> EXPLAIN ANALYZE, query logging, endpoint profiling
+|  +- Intermittently slow (p99 spikes)
+|     +- Contention: lock wait, connection pool exhaustion, noisy neighbor
+|        -> Check lock metrics, pool sizes, correlated traffic
+|
++- Low throughput
+|  +- CPU not saturated
+|  |  +- I/O bound: disk wait, network latency, blocking calls in async code
+|  |     -> Check iowait, network RTT, ensure async throughout
+|  +- CPU saturated
+|  |  +- Compute bound: need algorithmic improvement or horizontal scaling
+|  |     -> Profile hot paths, optimize or scale out
+|  +- Queues backing up
+|     +- Consumer too slow: batch size, consumer count, downstream bottleneck
+|        -> Increase consumers, optimize processing, check downstream
+|
++- Large bundle size (frontend)
+|  +- Main bundle too large
+|  |  +- Missing code splitting, tree shaking not working, barrel file imports
+|  |     -> Bundle analyzer, check import patterns, add dynamic imports
+|  +- Duplicate dependencies
+|  |  +- Multiple versions of same library bundled
+|  |     -> Dedupe, check peer dependencies, use resolutions
+|  +- Large assets
+|     +- Unoptimized images, embedded fonts, inline data URIs
+|        -> Image optimization, font subsetting, external assets
+|
++- Slow database queries
+   +- Single slow query
+   |  +- Missing index, suboptimal join order, full table scan
+   |     -> EXPLAIN ANALYZE, add index, rewrite query
+   +- Many small queries (N+1)
+   |  +- ORM lazy loading, loop with individual queries
+   |     -> Eager loading, batch queries, dataloader pattern
+   +- Lock contention
+      +- Long transactions, row-level locks, table locks
+         -> Shorten transactions, check isolation level, advisory locks
+```
+
+## Profiling Tool Selection Matrix
+
+| Problem | Node.js | Python | Go | Rust | Browser |
+|---------|---------|--------|----|------|---------|
+| **CPU hotspots** | clinic flame, 0x | py-spy, scalene | pprof (CPU) | cargo-flamegraph, samply | DevTools Performance |
+| **Memory leaks** | clinic doctor, heap snapshot | memray, tracemalloc | pprof (heap) | DHAT, heaptrack | DevTools Memory |
+| **Memory allocation** | --heap-prof | memray, scalene | pprof (allocs) | DHAT | DevTools Allocation |
+| **Async bottlenecks** | clinic bubbleprof | asyncio debug mode | pprof (goroutine) | tokio-console | DevTools Performance |
+| **I/O profiling** | clinic doctor | strace, py-spy | pprof (block) | strace, perf | Network tab |
+| **GC pressure** | --trace-gc | gc.set_debug | GODEBUG=gctrace=1 | N/A (no GC) | Performance timeline |
+| **Lock contention** | N/A | py-spy (threading) | pprof (mutex) | parking_lot stats | N/A |
+| **Startup time** | --cpu-prof | python -X importtime | go build -v | cargo build --timings | Lighthouse |
+
+## CPU Profiling Quick Reference
+
+### Flamegraph Basics
+
+```
+Reading a flamegraph:
+- X-axis: proportion of total samples (wider = more time)
+- Y-axis: call stack depth (bottom = entry point, top = leaf)
+- Color: random (not meaningful) in most tools
+- Look for: wide plateaus at the top (hot functions)
+- Ignore: narrow towers (called often but fast)
+
+Key actions:
+1. Find the widest bars at the TOP of the graph
+2. Trace down to see what calls them
+3. Focus optimization on the widest top-level functions
+4. Re-profile after each change to verify improvement
+```
+
+### Tool Quick Start
+
+| Tool | Language | Command | Output |
+|------|----------|---------|--------|
+| **py-spy** | Python | `py-spy record -o profile.svg -- python app.py` | SVG flamegraph |
+| **py-spy top** | Python | `py-spy top --pid PID` | Live top-like view |
+| **pprof** | Go | `go tool pprof -http :8080 http://localhost:6060/debug/pprof/profile?seconds=30` | Interactive web UI |
+| **clinic flame** | Node.js | `clinic flame -- node app.js` | HTML flamegraph |
+| **0x** | Node.js | `0x app.js` | SVG flamegraph |
+| **cargo-flamegraph** | Rust | `cargo flamegraph --bin myapp` | SVG flamegraph |
+| **samply** | Rust/C/C++ | `samply record ./target/release/myapp` | Firefox Profiler UI |
+| **perf** | Linux (any) | `perf record -g ./myapp && perf script \| inferno-flamegraph > out.svg` | SVG flamegraph |
+
+## Memory Profiling Quick Reference
+
+| Tool | Language | Command | What It Shows |
+|------|----------|---------|---------------|
+| **memray** | Python | `memray run script.py && memray flamegraph output.bin` | Allocation flamegraph, leak detection |
+| **tracemalloc** | Python | `tracemalloc.start(); snapshot = tracemalloc.take_snapshot()` | Top allocators, allocation traceback |
+| **scalene** | Python | `scalene script.py` | CPU + memory + GPU in one profiler |
+| **heaptrack** | C/C++/Rust | `heaptrack ./myapp && heaptrack_gui heaptrack.myapp.*.zst` | Allocation timeline, flamegraph, leak candidates |
+| **DHAT** | Rust | `valgrind --tool=dhat ./target/debug/myapp` | Allocation sites, short-lived allocs |
+| **pprof (heap)** | Go | `go tool pprof http://localhost:6060/debug/pprof/heap` | Live heap, allocation counts |
+| **Chrome heap** | JS/Browser | DevTools - Memory - Take heap snapshot | Object retention, detached DOM |
+| **clinic doctor** | Node.js | `clinic doctor -- node app.js` | Memory + CPU + event loop diagnosis |
+
+## Bundle Analysis Quick Reference
+
+| Tool | Bundler | Command | Output |
+|------|---------|---------|--------|
+| **webpack-bundle-analyzer** | Webpack | `npx webpack-bundle-analyzer stats.json` | Interactive treemap |
+| **source-map-explorer** | Any | `npx source-map-explorer bundle.js` | Treemap from source maps |
+| **rollup-plugin-visualizer** | Rollup/Vite | Add plugin, build | HTML treemap |
+| **vite-bundle-visualizer** | Vite | `npx vite-bundle-visualizer` | Treemap visualization |
+| **bundlephobia** | npm | `npx bundlephobia <package>` | Package size analysis |
+| **size-limit** | Any | Configure in package.json, run in CI | Size budget enforcement |
+
+### Bundle Size Reduction Checklist
+
+```
+[ ] Dynamic imports for routes and heavy components
+[ ] Tree shaking working (check for side effects in package.json)
+[ ] No barrel file re-exports pulling in entire modules
+[ ] Lodash: use lodash-es or individual imports (lodash/debounce)
+[ ] Moment.js replaced with date-fns or dayjs
+[ ] Images optimized (WebP/AVIF, responsive sizes, lazy loading)
+[ ] Fonts subsetted to used characters
+[ ] Gzip/Brotli compression enabled on server
+[ ] Source maps excluded from production bundle size
+[ ] CSS purged of unused styles (PurgeCSS, Tailwind JIT)
+```
+
+## Database Performance Quick Reference
+
+### EXPLAIN ANALYZE Interpretation
+
+```
+Key metrics in EXPLAIN ANALYZE output:
+|
++- Seq Scan          -> Full table scan (often bad for large tables)
+|  +- Fix: Add index on filter columns
++- Index Scan        -> Using index (good)
++- Bitmap Index Scan -> Multiple index conditions combined (good)
++- Nested Loop       -> OK for small inner table, bad for large joins
+|  +- Fix: Add index on join column, consider Hash Join
++- Hash Join         -> Good for large equi-joins
++- Sort              -> Check if index can provide order
+|  +- Fix: Add index matching ORDER BY
++- actual time       -> First row..last row in milliseconds
++- rows              -> Actual rows vs planned (estimate accuracy)
++- buffers           -> shared hit (cache) vs read (disk I/O)
+```
+
+### N+1 Detection
+
+```
+Symptoms:
+- Many identical queries with different WHERE values
+- Response time scales linearly with result count
+- Query log shows repeated patterns
+
+Detection:
+- Django: django-debug-toolbar, nplusone
+- Rails: Bullet gem
+- SQLAlchemy: sqlalchemy.echo=True, look for repeated patterns
+- General: enable slow query log, count queries per request
+
+Fix:
+- Eager loading (JOIN, prefetch, include)
+- Batch queries (WHERE id IN (...))
+- DataLoader pattern (batch + cache per request)
+```
+
+## Load Testing Quick Reference
+
+| Tool | Language | Strengths | Command |
+|------|----------|-----------|---------|
+| **k6** | Go (JS scripts) | Scripted scenarios, thresholds, cloud | `k6 run script.js` |
+| **artillery** | Node.js | YAML config, plugins, Playwright | `artillery run config.yml` |
+| **vegeta** | Go | CLI piping, constant rate | `echo "GET http://localhost" \| vegeta attack \| vegeta report` |
+| **wrk** | C | Lightweight, Lua scripts | `wrk -t4 -c100 -d30s http://localhost` |
+| **autocannon** | Node.js | Programmatic, pipelining | `autocannon -c 100 -d 30 http://localhost` |
+| **locust** | Python | Python classes, distributed | `locust -f locustfile.py` |
+
+### Load Test Types
+
+```
+Test Type Selection:
+|
++- Smoke Test
+|  +- Minimal load (1-2 VUs) to verify system works
+|     Duration: 1-5 minutes
+|
++- Load Test
+|  +- Expected production load
+|     Duration: 15-60 minutes
+|     Goal: Verify SLOs are met under normal conditions
+|
++- Stress Test
+|  +- Beyond expected load, find breaking point
+|     Ramp up until errors or unacceptable latency
+|     Goal: Know the system's limits
+|
++- Spike Test
+|  +- Sudden burst of traffic
+|     Instant jump to high load, then drop
+|     Goal: Test auto-scaling, queue behavior
+|
++- Soak Test (Endurance)
+|  +- Moderate load for extended period (hours)
+|     Goal: Find memory leaks, resource exhaustion, GC issues
+|
++- Breakpoint Test
+   +- Continuously ramp up until failure
+      Goal: Find maximum capacity
+```
+
+## Benchmarking Quick Reference
+
+| Tool | Domain | Command | Notes |
+|------|--------|---------|-------|
+| **hyperfine** | CLI commands | `hyperfine 'cmd1' 'cmd2'` | Warm-up, statistical analysis, export |
+| **criterion** | Rust | `cargo bench` (with criterion dep) | Statistical, HTML reports, regression detection |
+| **testing.B** | Go | `go test -bench=. -benchmem` | Built-in, memory allocs, sub-benchmarks |
+| **pytest-benchmark** | Python | `pytest --benchmark-only` | Statistical, histograms, comparison |
+| **vitest bench** | JS/TS | `vitest bench` | Built-in to Vitest, Tinybench engine |
+| **Benchmark.js** | JS | Programmatic setup | Statistical analysis, ops/sec |
+
+### Benchmarking Best Practices
+
+```
+[ ] Warm up before measuring (JIT compilation, cache population)
+[ ] Run multiple iterations (minimum 10, prefer 100+)
+[ ] Report statistical summary (mean, median, stddev, min, max)
+[ ] Control for system noise (close other apps, pin CPU frequency)
+[ ] Compare against baseline (previous version, alternative impl)
+[ ] Measure what matters (end-to-end, not micro-operations in isolation)
+[ ] Profile before benchmarking (know WHAT to benchmark)
+[ ] Document environment (hardware, OS, runtime version, flags)
+```
+
+## Optimization Patterns Quick Reference
+
+| Pattern | When to Use | Example |
+|---------|-------------|---------|
+| **Caching** | Repeated expensive computations or I/O | Redis, in-memory LRU, CDN, memoization |
+| **Lazy loading** | Resources not needed immediately | Dynamic imports, virtual scrolling, pagination |
+| **Connection pooling** | Frequent DB/HTTP connections | PgBouncer, HikariCP, urllib3 pool |
+| **Batch operations** | Many small operations on same resource | Bulk INSERT, DataLoader, batch API calls |
+| **Pagination** | Large result sets | Cursor-based (not offset) for large datasets |
+| **Compression** | Network transfer of text data | Brotli > gzip for static, gzip for dynamic |
+| **Streaming** | Processing large files or datasets | Line-by-line, chunk processing, async iterators |
+| **Precomputation** | Predictable expensive calculations | Materialized views, build-time generation |
+| **Denormalization** | Read-heavy with expensive joins | Duplicate data for read performance |
+| **Index optimization** | Slow queries on large tables | Composite indexes matching query patterns |
+
+## Common Gotchas
+
+| Gotcha | Why It Hurts | Fix |
+|--------|-------------|-----|
+| Premature optimization | Wastes time on non-bottlenecks, adds complexity | Profile first, optimize the measured hot path |
+| Micro-benchmarks misleading | JIT, caching, branch prediction differ from real workload | Benchmark realistic workloads, validate with production metrics |
+| Profiling overhead | Profiler itself skews results (observer effect) | Use sampling profilers (py-spy, pprof) not tracing profilers |
+| Cache invalidation | Stale data served, inconsistent state across nodes | TTL + event-based invalidation, cache-aside pattern |
+| Optimizing cold path | Spending effort on rarely-executed code | Focus on hot paths identified by profiling |
+| Ignoring tail latency | p50 looks great but p99 is 10x worse | Measure and optimize p95/p99, not just averages |
+| N+1 queries hidden by ORM | Each page load fires hundreds of queries | Enable query logging, use eager loading |
+| Compression on small payloads | Overhead exceeds savings for payloads <150 bytes | Only compress above minimum size threshold |
+| Connection pool too large | Each connection uses memory, causes lock contention | Size pool to CPU cores x 2-3, not hundreds |
+| Missing async in I/O path | One blocking call serializes all concurrent requests | Audit entire request path for blocking calls |
+| Benchmarking debug builds | Debug builds 10-100x slower, misleading results | Always benchmark release/optimized builds |
+| Over-indexing database | Write performance degrades, storage bloats | Only index columns in WHERE, JOIN, ORDER BY clauses |