optimization-patterns.md 22 KB

Optimization Patterns

Proven performance optimization strategies across the stack.

Caching Strategies

Cache Selection Decision Tree

What are you caching?
│
├─ Computation result (same input → same output)
│  ├─ In-process only
│  │  └─ In-memory cache (LRU map, memoization)
│  │     Eviction: LRU, LFU, TTL
│  │     Tools: lru-cache (Node), functools.lru_cache (Python), sync.Map (Go)
│  └─ Shared across processes/servers
│     └─ Redis / Memcached
│        TTL-based, key-value, sub-millisecond latency
│
├─ Database query result
│  ├─ Rarely changes, expensive to compute
│  │  └─ Materialized view (database-level cache)
│  │     Refresh: on schedule, on trigger, on demand
│  ├─ Changes with writes
│  │  └─ Cache-aside pattern (read: cache → DB, write: DB → invalidate cache)
│  └─ Read-heavy, tolerate slight staleness
│     └─ Read replica + cache with TTL
│
├─ API response
│  ├─ Same for all users
│  │  └─ CDN cache (Cloudflare, CloudFront)
│  │     Headers: Cache-Control, ETag, Last-Modified
│  ├─ Varies by user but cacheable
│  │  └─ Vary header + CDN or reverse proxy cache
│  └─ Personalized but expensive
│     └─ Server-side cache (Redis) with user-specific keys
│
├─ Static assets (JS, CSS, images)
│  └─ CDN + long cache + content-hashed filenames
│     Cache-Control: public, max-age=31536000, immutable
│     Bust cache by changing filename (app.a1b2c3.js)
│
└─ HTML pages
   ├─ Static content
   │  └─ Pre-render at build time (SSG)
   │     Cache-Control: public, max-age=3600
   ├─ Mostly static, some dynamic
   │  └─ Stale-while-revalidate
   │     Cache-Control: public, max-age=60, stale-while-revalidate=3600
   └─ Fully dynamic
      └─ Short TTL or no-cache + ETag for conditional requests

Cache Invalidation Patterns

Pattern: TTL (Time-To-Live)
├─ Simplest approach: cache expires after N seconds
├─ Pro: no coordination needed, self-healing
├─ Con: stale data for up to TTL duration
└─ Best for: session data, config, rate limits

Pattern: Cache-Aside (Lazy Loading)
├─ Read: check cache → miss → query DB → populate cache
├─ Write: update DB → delete cache key (not update)
├─ Pro: only caches what's actually requested
├─ Con: first request after invalidation is slow (cache miss)
└─ Best for: general purpose, most common pattern

Pattern: Write-Through
├─ Write: update cache AND DB in same operation
├─ Pro: cache always consistent with DB
├─ Con: write latency increases, caches unused data
└─ Best for: read-heavy data that must be fresh

Pattern: Write-Behind (Write-Back)
├─ Write: update cache, async flush to DB
├─ Pro: fast writes, batch DB operations
├─ Con: data loss risk if cache crashes before flush
└─ Best for: high-write-throughput, non-critical data (counters, analytics)

Pattern: Event-Driven Invalidation
├─ Publish event on data change, subscribers invalidate caches
├─ Pro: low latency invalidation, decoupled
├─ Con: eventual consistency, event delivery guarantees needed
└─ Best for: microservices, distributed systems

Cache Anti-Patterns

Anti-Pattern Problem Fix
Cache without eviction Memory grows unbounded Set max size + LRU/LFU eviction
Thundering herd Cache expires, all requests hit DB simultaneously Mutex/singleflight, stale-while-revalidate
Cache stampede Hot key expires under high load Background refresh before expiry
Inconsistent cache + DB Update DB, crash before invalidating cache Delete cache first (slightly stale reads), or use distributed transactions
Caching errors Error response cached, served to all users Only cache successful responses, or cache with very short TTL
Over-caching Too many cache layers, hard to debug Cache at one layer, usually closest to consumer

Database Optimization

Indexing Strategy

Index selection decision tree:
│
├─ Query uses WHERE clause
│  ├─ Single column filter
│  │  └─ B-tree index on that column
│  ├─ Multiple column filter (AND)
│  │  └─ Composite index (most selective column first)
│  │     CREATE INDEX idx ON table(col_a, col_b, col_c)
│  │     Left-prefix rule: this index covers (a), (a,b), (a,b,c) queries
│  └─ Text search (LIKE '%term%')
│     └─ Full-text index (not B-tree, which only helps prefix LIKE 'term%')
│
├─ Query uses ORDER BY
│  └─ Index matching ORDER BY columns avoids filesort
│     Combine with WHERE columns: INDEX(where_col, order_col)
│
├─ Query uses JOIN
│  └─ Index on join columns of the inner table
│     ON a.id = b.a_id → index on b.a_id
│
├─ Query uses GROUP BY / DISTINCT
│  └─ Index matching GROUP BY columns
│
└─ High cardinality vs low cardinality
   ├─ High cardinality (many unique values): good index candidate
   └─ Low cardinality (few unique values, e.g., boolean): partial index
      CREATE INDEX idx ON orders(status) WHERE status = 'pending'

Query Optimization Patterns

-- AVOID: SELECT * (fetches unnecessary columns)
SELECT * FROM users WHERE id = 1;
-- PREFER: select only needed columns
SELECT id, name, email FROM users WHERE id = 1;

-- AVOID: N+1 queries
-- Python/ORM: for user in users: user.orders  (fires query per user)
-- PREFER: eager loading
-- SELECT * FROM users JOIN orders ON users.id = orders.user_id

-- AVOID: OFFSET for pagination on large tables
SELECT * FROM posts ORDER BY created_at DESC LIMIT 20 OFFSET 10000;
-- PREFER: cursor-based pagination
SELECT * FROM posts WHERE created_at < '2025-01-01' ORDER BY created_at DESC LIMIT 20;

-- AVOID: functions on indexed columns in WHERE
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';
-- PREFER: expression index or store normalized
CREATE INDEX idx_email_lower ON users(LOWER(email));

-- AVOID: implicit type conversion
SELECT * FROM users WHERE id = '123';  -- string vs integer
-- PREFER: correct types
SELECT * FROM users WHERE id = 123;

-- AVOID: correlated subqueries
SELECT *, (SELECT COUNT(*) FROM orders WHERE orders.user_id = users.id) FROM users;
-- PREFER: JOIN with GROUP BY
SELECT users.*, COUNT(orders.id) FROM users LEFT JOIN orders ON users.id = orders.user_id GROUP BY users.id;

Connection Pooling

Connection pool sizing:
│
├─ Too small
│  └─ Requests queue waiting for connections
│     Symptom: latency spikes, "connection pool exhausted" errors
│
├─ Too large
│  └─ Database overwhelmed with connections
│     Symptom: high memory usage on DB, mutex contention, slower queries
│
└─ Right size
   └─ Formula: connections = (core_count * 2) + effective_spindle_count
      For SSD: connections ≈ core_count * 2-3
      For cloud DB (e.g., RDS): check instance limits

Tools:
- PostgreSQL: PgBouncer (transaction pooling, session pooling)
- MySQL: ProxySQL
- Java: HikariCP (fastest JVM pool)
- Python: SQLAlchemy pool (pool_size, max_overflow, pool_timeout)
- Node.js: built-in pool in pg, mysql2, knex
- Go: database/sql built-in pool (SetMaxOpenConns, SetMaxIdleConns)

Frontend Optimization

Code Splitting

When to split:
│
├─ Route-based splitting (most impactful)
│  └─ Each page/route is a separate chunk
│     React: React.lazy(() => import('./Page'))
│     Next.js: automatic per-page
│     Vue: defineAsyncComponent(() => import('./Page.vue'))
│
├─ Component-based splitting
│  └─ Heavy components loaded on demand
│     Modal dialogs, rich text editors, charts, maps
│     <Suspense fallback={<Spinner />}><LazyComponent /></Suspense>
│
├─ Library-based splitting
│  └─ Large libraries in separate chunks
│     Moment.js, chart libraries, syntax highlighters
│     // vite.config.js
│     build: { rollupOptions: { output: {
│       manualChunks: { vendor: ['react', 'react-dom'] }
│     }}}
│
└─ Conditional feature splitting
   └─ Features only some users need (admin panel, A/B tests)
      const AdminPanel = lazy(() => import('./AdminPanel'))

Image Optimization

Format selection:
│
├─ Photographs
│  ├─ Best: AVIF (30-50% smaller than JPEG)
│  ├─ Good: WebP (25-35% smaller than JPEG)
│  └─ Fallback: JPEG (universal support)
│
├─ Graphics/logos with transparency
│  ├─ Best: WebP or AVIF
│  ├─ Good: PNG (lossless)
│  └─ Simple graphics: SVG (scalable, tiny file size)
│
└─ Icons
   └─ SVG sprites or icon fonts (not individual PNGs)

Implementation:
<picture>
  <source srcset="image.avif" type="image/avif">
  <source srcset="image.webp" type="image/webp">
  <img src="image.jpg" alt="Description"
       loading="lazy"
       decoding="async"
       width="800" height="600">
</picture>

Responsive images:
<img srcset="small.jpg 400w, medium.jpg 800w, large.jpg 1200w"
     sizes="(max-width: 600px) 400px, (max-width: 900px) 800px, 1200px"
     src="medium.jpg" alt="Description">

Critical Rendering Path

Optimization checklist:
│
├─ Critical CSS
│  ├─ Inline above-the-fold CSS in <head>
│  ├─ Defer non-critical CSS: <link rel="preload" as="style">
│  └─ Tools: critical (npm package), Critters (webpack plugin)
│
├─ JavaScript loading
│  ├─ <script defer>: download parallel, execute after HTML parse
│  ├─ <script async>: download parallel, execute immediately (non-order)
│  ├─ <script type="module">: deferred by default, strict mode
│  └─ Move non-critical JS below the fold
│
├─ Resource hints
│  ├─ <link rel="preconnect" href="https://api.example.com">
│  │  Establish early connections to known origins
│  ├─ <link rel="dns-prefetch" href="https://cdn.example.com">
│  │  Resolve DNS early for third-party domains
│  ├─ <link rel="preload" href="font.woff2" as="font" crossorigin>
│  │  Preload critical resources (fonts, hero image, key CSS)
│  └─ <link rel="prefetch" href="next-page.js">
│     Prefetch resources for likely next navigation
│
├─ Font optimization
│  ├─ font-display: swap (show fallback immediately)
│  ├─ Subset fonts to used characters
│  ├─ Preload critical fonts
│  ├─ Use woff2 format (best compression)
│  └─ Self-host instead of Google Fonts (one fewer connection)
│
└─ Layout stability (CLS)
   ├─ Always set width/height on images and videos
   ├─ Reserve space for ads and embeds
   ├─ Use CSS aspect-ratio for responsive containers
   └─ Avoid inserting content above existing content

API Optimization

Response Optimization

Reducing payload size:
│
├─ Field selection (GraphQL-style)
│  └─ GET /users/123?fields=id,name,email
│     Only return requested fields
│
├─ Pagination
│  ├─ Offset-based: ?page=3&limit=20
│  │  Simple but slow for deep pages (OFFSET 10000)
│  ├─ Cursor-based: ?after=cursor_abc&limit=20
│  │  Fast at any depth, stable during inserts
│  └─ Keyset: ?created_after=2025-01-01&limit=20
│     Uses indexed column for efficient seeking
│
├─ Compression
│  ├─ Brotli (br): best ratio for static content
│  ├─ gzip: universal support, good for dynamic content
│  ├─ Accept-Encoding: br, gzip
│  └─ Skip compression for small payloads (<150 bytes)
│
├─ Batch endpoints
│  └─ POST /batch with array of operations
│     Reduces HTTP overhead, enables server-side optimization
│
└─ Caching headers
   ├─ ETag + If-None-Match: 304 Not Modified (no body transfer)
   ├─ Last-Modified + If-Modified-Since: same as ETag
   └─ Cache-Control: max-age=60, stale-while-revalidate=600

HTTP/2 and HTTP/3

HTTP/2 optimizations (changes from HTTP/1.1):
│
├─ Multiplexing: multiple requests on single connection
│  └─ STOP: domain sharding (hurts with HTTP/2)
│  └─ STOP: sprite sheets (individual files are fine)
│  └─ STOP: concatenating CSS/JS into mega-bundles
│
├─ Server Push: server sends resources before client requests
│  └─ Mostly deprecated: use <link rel="preload"> instead
│
├─ Header compression (HPACK)
│  └─ Automatic, no action needed
│
└─ Stream prioritization
   └─ Browsers handle automatically, server must support

HTTP/3 (QUIC):
├─ Faster connection setup (0-RTT)
├─ No head-of-line blocking
├─ Better on lossy networks (mobile)
└─ Enable on CDN/reverse proxy (Cloudflare, nginx with quic module)

Concurrency Optimization

Worker Pool Patterns

When to use worker pools:
│
├─ CPU-bound tasks
│  ├─ Pool size = number of CPU cores
│  ├─ Node.js: worker_threads
│  ├─ Python: multiprocessing.Pool, ProcessPoolExecutor
│  ├─ Go: bounded goroutine pool (semaphore pattern)
│  └─ Rust: rayon::ThreadPool, tokio::task::spawn_blocking
│
├─ I/O-bound tasks
│  ├─ Pool size = much larger than CPU cores (100s-1000s)
│  ├─ Node.js: async/await (event loop handles concurrency)
│  ├─ Python: asyncio, ThreadPoolExecutor for blocking I/O
│  ├─ Go: goroutines (lightweight, thousands are fine)
│  └─ Rust: tokio tasks (lightweight, thousands are fine)
│
└─ Mixed workloads
   └─ Separate pools for CPU and I/O work
      Don't let CPU-bound tasks block I/O pool
      Don't let I/O waits occupy CPU pool slots

Backpressure

Without backpressure:
Producer (fast) → → → → → QUEUE OVERFLOW → OOM/Crash
                          ↓ ↓ ↓ ↓ ↓ ↓ ↓
Consumer (slow) → → →

With backpressure:
Producer (fast) → → BLOCKED (queue full)
                          ↓ ↓ ↓
Consumer (slow) → → → (processes at own pace)

Implementation strategies:
├─ Bounded queues: reject/block when full
├─ Rate limiting: token bucket, leaky bucket
├─ Circuit breaker: stop sending when consumer is unhealthy
├─ Load shedding: drop low-priority work under pressure
└─ Reactive streams: consumer signals demand to producer

Batching

Individual operations:
Request 1 → DB Query → Response
Request 2 → DB Query → Response    Total: 100 DB queries
Request 3 → DB Query → Response
... (100 times)

Batched operations:
Request 1 ─┐
Request 2 ──┤ Batch → 1 DB Query → Responses    Total: 1 DB query
Request 3 ──┤
... (100)  ─┘

Implementation:
├─ DataLoader pattern (GraphQL/general)
│  Collect individual loads within one event loop tick
│  Execute as single batch query
│  Return individual results
│
├─ Bulk INSERT
│  Collect rows, INSERT multiple in one statement
│  INSERT INTO items (a, b) VALUES (1,2), (3,4), (5,6)
│
├─ Batch API calls
│  Collect individual API calls
│  Send as single batch request
│  POST /batch [{method, path, body}, ...]
│
└─ Write coalescing
   Buffer writes for short window (10-100ms)
   Flush as single operation
   Trade latency for throughput

Memory Optimization

Object Pooling

When to pool:
├─ Objects are expensive to create (DB connections, threads, buffers)
├─ Objects are created and destroyed frequently
├─ GC pressure is high (many short-lived allocations)
└─ Object initialization involves I/O or complex setup

Implementation pattern:
Pool {
  available: []     // Ready to use
  in_use: []        // Currently checked out
  max_size: N       // Maximum pool size

  acquire():
    if available.length > 0:
      return available.pop()
    elif in_use.length < max_size:
      return create_new()
    else:
      wait_or_reject()

  release(obj):
    reset(obj)       // Clean state for reuse
    available.push(obj)
}

Language-specific:
- Go: sync.Pool (GC-aware, may evict items)
- Rust: object-pool crate, crossbeam ObjectPool
- Java: Apache Commons Pool, HikariCP (connection pool)
- Python: queue.Queue-based custom pool
- Node.js: generic-pool package

Streaming vs Buffering

Buffering (load all into memory):
├─ Simple code
├─ Random access to data
├─ DANGER: memory scales with data size
└─ Max data size limited by available RAM

Streaming (process chunk by chunk):
├─ Constant memory regardless of data size
├─ Can start processing before all data arrives
├─ Required for: large files, real-time data, video/audio
└─ Slightly more complex code

Decision:
├─ Data < 10MB → buffering is fine
├─ Data > 100MB → streaming required
├─ Data size unknown → streaming required
├─ Real-time processing → streaming required
└─ Need multiple passes → buffer or use temp file

Examples:
- Node.js: fs.createReadStream() vs fs.readFileSync()
- Python: open().read() vs for line in open()
- Go: io.Reader/io.Writer interfaces
- Rust: BufReader/BufWriter, tokio AsyncRead/AsyncWrite

String Interning

Problem: many duplicate strings consuming memory

Before interning:
  "status: active"  →  String { ptr: 0x1000, len: 14 }
  "status: active"  →  String { ptr: 0x2000, len: 14 }  // Duplicate!
  "status: active"  →  String { ptr: 0x3000, len: 14 }  // Another!

After interning:
  "status: active"  →  all point to same allocation

When to use:
├─ Many repeated string values (status fields, tags, categories)
├─ Long-lived strings that are compared often
└─ Parsing structured data with repetitive fields

Language support:
- Python: sys.intern() (automatic for small strings)
- Java: String.intern() (JVM string pool)
- Go: no built-in, use map[string]string dedup
- Rust: string-interner crate
- JavaScript: V8 interns short strings automatically

Algorithm Optimization

Big-O Awareness

Common operations and their complexity:

O(1)        Array index, hash map lookup, stack push/pop
O(log n)    Binary search, balanced BST lookup, heap insert
O(n)        Linear scan, array copy, linked list traversal
O(n log n)  Efficient sort (merge, quick, heap sort)
O(n²)       Nested loops, bubble sort, insertion sort
O(2ⁿ)       Recursive Fibonacci (naive), power set

Practical impact (n = 1,000,000):
O(1)       = 1 operation
O(log n)   = 20 operations
O(n)       = 1,000,000 operations
O(n log n) = 20,000,000 operations
O(n²)      = 1,000,000,000,000 operations  ← TOO SLOW

Quick wins:
├─ Replace list search with set/hash map: O(n) → O(1)
├─ Sort + binary search instead of linear search: O(n) → O(log n)
├─ Replace nested loops with hash join: O(n²) → O(n)
├─ Cache computed values (memoization): avoid redundant work
└─ Use appropriate data structure for access pattern

Data Structure Selection

Access pattern → best data structure:
│
├─ Fast lookup by key
│  └─ Hash map / dictionary
│     O(1) average lookup, insert, delete
│
├─ Ordered iteration + fast lookup
│  └─ Balanced BST (TreeMap, BTreeMap)
│     O(log n) lookup, insert, delete, ordered iteration
│
├─ Fast insert/remove at both ends
│  └─ Deque (double-ended queue)
│     O(1) push/pop at front and back
│
├─ Priority-based access (always get min/max)
│  └─ Heap / priority queue
│     O(1) peek min/max, O(log n) insert and extract
│
├─ Fast membership testing
│  └─ Hash set
│     O(1) contains check
│  └─ Bloom filter (probabilistic, space-efficient)
│     O(1) contains, may have false positives
│
├─ Fast prefix search / autocomplete
│  └─ Trie
│     O(k) lookup where k = key length
│
├─ Range queries (find all items between A and B)
│  └─ Sorted array + binary search, or B-tree
│
└─ Graph relationships
   ├─ Sparse graph → adjacency list
   └─ Dense graph → adjacency matrix

Anti-Patterns

Premature Optimization

"Premature optimization is the root of all evil" — Knuth

Signs of premature optimization:
├─ Optimizing before measuring
├─ Optimizing code that runs once or rarely
├─ Sacrificing readability for negligible performance gain
├─ Optimizing at the wrong level (micro vs macro)
└─ Optimizing before the feature is correct

The right approach:
1. Make it work (correct behavior)
2. Make it right (clean, maintainable code)
3. Make it fast (profile, then optimize measured bottlenecks)

Exception: architectural decisions (data model, API design) are
expensive to change later. Think about performance at design time
for structural choices.

Common Performance Anti-Patterns

Anti-Pattern Why It's Bad Better Approach
N+1 queries 1000 items = 1001 DB queries Eager loading, batch queries, DataLoader
Unbounded growth Cache/queue/buffer grows without limit Set max size, eviction policy, backpressure
Synchronous I/O in async code Blocks event loop/thread pool, kills throughput Use async I/O throughout, offload blocking to thread pool
Re-rendering everything UI updates trigger full tree re-render Virtual DOM diffing, memoization, fine-grained reactivity
Serializing/deserializing repeatedly Data converted between formats multiple times Pass native objects, serialize once at boundary
Polling when events are available CPU waste checking for changes WebSockets, SSE, file watch, pub/sub
Logging in hot path String formatting + I/O in tight loop Sampling, async logging, log level guards
Global locks All threads contend on single mutex Fine-grained locks, lock-free structures, sharding
String concatenation in loop O(n²) due to repeated copying StringBuilder, join, format strings
Creating regex in loop Compile regex on every iteration Compile once, reuse compiled pattern
Deep cloning when shallow suffices Copying entire object graph unnecessarily Structural sharing, immutable data structures, shallow copy
Catching exceptions for flow control Exceptions are expensive (stack trace capture) Use return values, option types, result types