query-tuning.md 22 KB

PostgreSQL Query Tuning Reference

Table of Contents

  1. EXPLAIN Output Reference
  2. Plan Node Reference
  3. pg_stat_statements
  4. Common Optimization Patterns
  5. Parallel Query
  6. Statistics and Planner

EXPLAIN Output Reference

Format Options

-- Default text format (human readable)
EXPLAIN SELECT * FROM orders WHERE customer_id = 42;

-- With actual execution stats (runs the query)
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 42;

-- Full verbose output with all options
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT TEXT) SELECT * FROM orders WHERE customer_id = 42;

-- JSON format for programmatic parsing
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) SELECT * FROM orders WHERE customer_id = 42;

-- YAML format
EXPLAIN (ANALYZE, BUFFERS, FORMAT YAML) SELECT * FROM orders WHERE customer_id = 42;

Key Fields Decoded

Seq Scan on orders  (cost=0.00..4821.00 rows=1000 width=64)
                          ^      ^       ^         ^
                          |      |       |         estimated avg row width (bytes)
                          |      |       estimated output rows
                          |      total cost (return last row)
                          startup cost (return first row)

(actual time=0.042..18.340 rows=987 loops=1)
              ^       ^     ^        ^
              |       |     |        number of times node executed
              |       |     actual rows returned
              |       actual time to return last row (ms)
              actual time to return first row (ms)

Startup cost vs total cost: An index scan on a large table may have a high startup cost (building the bitmap) but low total cost per row. Nested loops favor low startup cost. A sort node has startup cost = full sort cost because no rows are returned until sorted.

Loops: When a node has loops=N, the actual time is per-loop average and actual rows is per-loop average. Multiply by loops to get totals. This matters for nested loop inners.

-- Identify row estimate errors (poor estimates = bad plans)
-- Look for large divergence between "rows=X" and "actual rows=Y"
-- A 10x+ difference warrants investigation via ANALYZE or statistics adjustments

Buffer Information

Buffers: shared hit=1024 read=256 dirtied=10 written=5
          ^              ^         ^           ^
          |              |         |           pages written to disk
          |              |         pages modified during query
          |              pages read from disk (cache miss)
          pages served from shared_buffers (cache hit)

Cache hit ratio for a single query:

  • hit / (hit + read) -- aim for > 0.99 in OLTP workloads

Reading Execution Time

-- Planning time vs execution time appear at bottom of EXPLAIN ANALYZE output
-- Planning Time: 1.234 ms
-- Execution Time: 45.678 ms
-- High planning time relative to execution suggests query plan caching issues
-- or extremely complex queries with many joins

Plan Node Reference

Scan Types

Sequential Scan -- reads entire table from disk in order. Chosen when: selectivity is high (returning large fraction of rows), no suitable index, small table fits in a few pages, or planner estimates index overhead exceeds benefit.

-- Force/prevent seq scan for testing (session level)
SET enable_seqscan = off;   -- discourages seq scan
SET enable_seqscan = on;    -- restore default

Index Scan -- follows index B-tree to find heap row pointers, fetches each heap page. Chosen when: high selectivity (few rows), index covers filter column, ORDER BY matches index. Drawback: random I/O on heap. Can be slower than seq scan on spinning disk for > ~5% of table.

Index Only Scan -- satisfies query entirely from index, no heap fetch (if visibility map allows). Requires: all SELECT and WHERE columns in the index. Needs up-to-date visibility map (regular VACUUM).

-- Check if index only scan is blocked by visibility map
SELECT relname, n_dead_tup, last_vacuum, last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'orders';

-- Create covering index to enable index only scan
CREATE INDEX idx_orders_covering ON orders (customer_id) INCLUDE (total, status, created_at);

Bitmap Index Scan + Bitmap Heap Scan -- builds bitmap of matching pages in memory, then fetches those pages in order (reduces random I/O vs plain Index Scan for moderate selectivity). Two-phase: BitmapIndexScan builds the bitmap, BitmapHeapScan fetches heap pages.

-- Bitmap scans combine multiple indexes via BitmapAnd / BitmapOr
-- Useful when query has multiple filter conditions each with their own index
EXPLAIN SELECT * FROM orders WHERE status = 'pending' AND region = 'EU';
-- May show: BitmapAnd -> BitmapIndexScan on idx_status + BitmapIndexScan on idx_region

Join Types

Nested Loop -- for each outer row, scan inner relation. Cost: O(outer_rows * inner_scan_cost). Best when outer is small and inner lookup is fast (indexed). Chosen when: outer result set is small, inner has index on join column.

-- Nested loop is ideal for:
-- SELECT * FROM orders o JOIN customers c ON c.id = o.customer_id WHERE o.id = 99;
-- (single order -> single customer lookup via PK)

SET enable_nestloop = off;  -- force alternative join type for testing

Hash Join -- build hash table from smaller relation, probe with larger. Cost: O(build + probe). Best for large unsorted relations with no useful index on join key. Chosen when: joining large tables, no index on join columns, equality join only. Memory: controlled by work_mem. If hash table exceeds work_mem, spills to disk (batch mode).

-- Check for hash join disk spills in EXPLAIN ANALYZE
-- Batches: 4 means spilled to disk in 4 batches -- increase work_mem to fix
-- Hash Batches: 1 is ideal (all in memory)
SET work_mem = '256MB';  -- session level for large analytical queries

Merge Join -- sort both relations on join key, merge in order. Cost: O(N log N + M log M) for sorting. Best when inputs are already sorted (index). Chosen when: both sides are large, inputs already sorted, range or equality join.

SET enable_hashjoin = off;
SET enable_mergejoin = off;
-- Use sparingly in production; better to fix the cause (add index, fix statistics)

Aggregation

HashAggregate -- builds hash table of group keys, accumulates aggregates. Chosen for: unsorted input, many distinct groups. Memory: bounded by work_mem. When it exceeds work_mem, spills to disk (check Disk: XkB in EXPLAIN ANALYZE output).

GroupAggregate -- streams sorted input, emits group when key changes. Chosen when: input already sorted on GROUP BY columns (index), or few distinct groups. Zero memory overhead but requires sorted input.

-- Force sorted approach by ensuring index on GROUP BY columns
CREATE INDEX idx_orders_customer ON orders (customer_id, created_at);
-- Now GROUP BY customer_id may use GroupAggregate instead of HashAggregate

Sort Operations

EXPLAIN ANALYZE SELECT * FROM orders ORDER BY created_at DESC LIMIT 100;

-- In-memory sort: Sort Method: quicksort  Memory: 2048kB
-- Disk sort:      Sort Method: external merge  Disk: 512000kB  -- bad, increase work_mem

-- Top-N Heapsort: Sort Method: top-N heapsort  Memory: 64kB  -- efficient for LIMIT
-- Top-N heapsort is optimal for ORDER BY ... LIMIT N patterns

pg_stat_statements

Setup

-- postgresql.conf (requires restart)
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000          -- number of query fingerprints tracked
pg_stat_statements.track = all          -- top|all|none (all includes nested queries)
pg_stat_statements.track_utility = on   -- track COPY, CREATE TABLE, etc.

-- After restart, create extension in each database you want to monitor
CREATE EXTENSION pg_stat_statements;

Key Columns (PostgreSQL 14+)

SELECT
    queryid,                          -- internal hash identifier
    query,                            -- normalized query text (params replaced with $1, $2)
    calls,                            -- number of times executed
    total_exec_time,                  -- total execution time (ms)
    mean_exec_time,                   -- avg execution time (ms)
    stddev_exec_time,                 -- std deviation (high = inconsistent)
    min_exec_time,
    max_exec_time,
    rows,                             -- total rows returned/affected
    shared_blks_hit,                  -- buffer cache hits
    shared_blks_read,                 -- disk reads
    shared_blks_dirtied,
    shared_blks_written,
    temp_blks_read,                   -- temp file reads (work_mem overflow)
    temp_blks_written,
    wal_bytes,                        -- WAL generated (high = write-heavy)
    toplevel                          -- true if called at top level (PG14+)
FROM pg_stat_statements;

Finding Problem Queries

-- Top 10 queries by total time (cumulative load on server)
SELECT
    round(total_exec_time::numeric, 2) AS total_ms,
    calls,
    round(mean_exec_time::numeric, 2) AS mean_ms,
    round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct_total,
    left(query, 80) AS query_snippet
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Top 10 by mean execution time (slowest individual queries)
SELECT
    calls,
    round(mean_exec_time::numeric, 2) AS mean_ms,
    round(stddev_exec_time::numeric, 2) AS stddev_ms,
    round(max_exec_time::numeric, 2) AS max_ms,
    left(query, 80) AS query_snippet
FROM pg_stat_statements
WHERE calls > 10                   -- ignore rarely-run queries
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Queries with worst cache hit ratio (I/O bound candidates)
SELECT
    calls,
    round(mean_exec_time::numeric, 2) AS mean_ms,
    shared_blks_hit + shared_blks_read AS total_blks,
    round(
        100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0),
        2
    ) AS hit_pct,
    left(query, 80) AS query_snippet
FROM pg_stat_statements
WHERE shared_blks_hit + shared_blks_read > 1000
ORDER BY hit_pct ASC
LIMIT 10;

-- Queries generating most temp files (work_mem too low or bad query)
SELECT
    calls,
    temp_blks_written,
    round(mean_exec_time::numeric, 2) AS mean_ms,
    left(query, 80) AS query_snippet
FROM pg_stat_statements
WHERE temp_blks_written > 0
ORDER BY temp_blks_written DESC
LIMIT 10;

Reset Strategy

-- Reset stats for all queries (do after tuning to get fresh baseline)
SELECT pg_stat_statements_reset();

-- Reset stats for specific query (PG12+ by queryid)
SELECT pg_stat_statements_reset(userid, dbid, queryid)
FROM pg_stat_statements
WHERE query LIKE '%orders%'
LIMIT 1;

Common Optimization Patterns

CTE Materialization (PostgreSQL 12+)

-- Pre-PG12: CTEs were always materialized (optimization fence)
-- PG12+: planner decides, but you can force behavior

-- MATERIALIZED: always execute CTE once and cache result
-- Use when: CTE is expensive but referenced multiple times
WITH expensive_agg AS MATERIALIZED (
    SELECT customer_id, sum(total) AS lifetime_value
    FROM orders
    GROUP BY customer_id
)
SELECT c.name, e.lifetime_value
FROM customers c
JOIN expensive_agg e ON e.customer_id = c.id;

-- NOT MATERIALIZED: inline the CTE (allow planner to push predicates in)
-- Use when: CTE is referenced once, or predicate pushdown is important
WITH recent_orders AS NOT MATERIALIZED (
    SELECT * FROM orders WHERE status = 'complete'
)
SELECT * FROM recent_orders WHERE customer_id = 42;
-- Planner can now push "customer_id = 42" into the subquery and use an index

EXISTS vs IN vs JOIN

-- EXISTS: short-circuits on first match, good for correlated checks
-- Best when: checking existence only, inner side can be large
SELECT c.id, c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.status = 'pending'
);

-- IN with subquery: similar to EXISTS in modern PG (planner converts to semi-join)
-- Bad when: subquery returns NULLs (IN with NULLs behaves unexpectedly)
SELECT c.id, c.name
FROM customers c
WHERE c.id IN (SELECT customer_id FROM orders WHERE status = 'pending');

-- JOIN (semi-join via DISTINCT): explicit, predictable
-- Needed when: you want columns from both sides, or deduplication matters
SELECT DISTINCT c.id, c.name
FROM customers c
JOIN orders o ON o.customer_id = c.id AND o.status = 'pending';

-- NOT IN danger with NULLs: returns zero rows if subquery has any NULL
-- Always use NOT EXISTS for negation checks
SELECT * FROM customers WHERE id NOT IN (SELECT customer_id FROM orders);
-- If ANY customer_id in orders is NULL, returns no rows!
-- Use instead:
SELECT * FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

Lateral Join vs Subquery

-- LATERAL: allows subquery to reference columns from preceding FROM items
-- Useful for: top-N per group, correlated row-limited subqueries

-- Top 3 orders per customer (lateral is clean and indexed)
SELECT c.name, o.id, o.total
FROM customers c
CROSS JOIN LATERAL (
    SELECT id, total
    FROM orders
    WHERE customer_id = c.id
    ORDER BY total DESC
    LIMIT 3
) o;

-- Equivalent window function approach (often similar performance)
SELECT name, order_id, total
FROM (
    SELECT c.name, o.id AS order_id, o.total,
           row_number() OVER (PARTITION BY c.id ORDER BY o.total DESC) AS rn
    FROM customers c
    JOIN orders o ON o.customer_id = c.id
) ranked
WHERE rn <= 3;

Pagination: OFFSET vs Keyset

-- OFFSET pagination: simple but degrades at high page numbers
-- At page 1000 with LIMIT 20, PostgreSQL fetches 20020 rows and discards 20000
SELECT id, name, created_at FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 20000;

-- Keyset (cursor) pagination: O(1) regardless of page depth
-- Requires: sorting by unique+indexed column(s), no arbitrary page jumping
-- After receiving last row of previous page with (created_at='2024-01-15', id=9876):
SELECT id, name, created_at
FROM orders
WHERE (created_at, id) < ('2024-01-15', 9876)  -- uses row comparison
ORDER BY created_at DESC, id DESC
LIMIT 20;

-- Index to support keyset:
CREATE INDEX idx_orders_keyset ON orders (created_at DESC, id DESC);

DISTINCT ON vs Window Function Deduplication

-- DISTINCT ON: PostgreSQL extension, returns first row per group (by ORDER BY)
-- Fast, single pass, leverages index on distinct columns
SELECT DISTINCT ON (customer_id)
    customer_id, id AS order_id, total, created_at
FROM orders
ORDER BY customer_id, created_at DESC;   -- gets most recent order per customer

-- Create index to support: (customer_id, created_at DESC)
CREATE INDEX idx_orders_latest ON orders (customer_id, created_at DESC);

-- Window function equivalent (more portable, more flexible)
SELECT customer_id, order_id, total, created_at
FROM (
    SELECT customer_id, id AS order_id, total, created_at,
           row_number() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
    FROM orders
) t
WHERE rn = 1;

Bulk Operations

-- COPY is fastest for bulk insert (bypasses most overhead)
-- From file:
COPY orders (customer_id, total, status) FROM '/tmp/orders.csv' WITH (FORMAT csv, HEADER);

-- From stdin (psql):
\COPY orders (customer_id, total, status) FROM 'orders.csv' CSV HEADER

-- unnest trick for bulk insert from application (avoids N round trips)
-- Send arrays of values, unnest server-side
INSERT INTO orders (customer_id, total, status)
SELECT * FROM unnest(
    ARRAY[1, 2, 3],              -- customer_ids
    ARRAY[100.00, 200.00, 50.00], -- totals
    ARRAY['pending', 'complete', 'pending']::text[]
) AS t(customer_id, total, status);

-- For very large bulk loads, disable indexes and re-add after:
ALTER TABLE orders DISABLE TRIGGER ALL;
-- ... COPY ...
ALTER TABLE orders ENABLE TRIGGER ALL;
-- Or: drop indexes, load, recreate (faster than incremental index updates)

-- Batch INSERT with ON CONFLICT (UPSERT)
INSERT INTO order_status_log (order_id, status, updated_at)
VALUES (1, 'shipped', now()), (2, 'delivered', now())
ON CONFLICT (order_id) DO UPDATE
    SET status = EXCLUDED.status,
        updated_at = EXCLUDED.updated_at;

Parallel Query

Configuration Settings

-- Key settings (postgresql.conf or ALTER SYSTEM)
max_parallel_workers_per_gather = 4      -- max workers per Gather node (default: 2)
max_parallel_workers = 8                  -- total parallel workers across all queries
max_worker_processes = 16                 -- total background workers (includes parallel)
min_parallel_table_scan_size = '8MB'     -- table must be > this for parallel seq scan
min_parallel_index_scan_size = '512kB'   -- index must be > this for parallel index scan
parallel_tuple_cost = 0.1                -- cost of passing tuple between workers
parallel_setup_cost = 1000               -- overhead of launching workers

When Parallel Query Engages

-- Parallel is chosen when: large table, high work_mem not limiting, no write operations
-- Check if parallel is being used:
EXPLAIN SELECT count(*), avg(total) FROM orders;
-- Should show: Gather -> Partial Aggregate -> Parallel Seq Scan on orders

-- Force parallel for testing (lower thresholds):
SET min_parallel_table_scan_size = 0;
SET parallel_setup_cost = 0;
SET max_parallel_workers_per_gather = 4;

When Parallel Does NOT Kick In

  • Queries that write (INSERT, UPDATE, DELETE, MERGE)
  • Queries inside functions marked PARALLEL UNSAFE (default for user functions)
  • Queries using cursors (DECLARE ... CURSOR FOR)
  • Queries called from another parallel worker
  • When max_parallel_workers_per_gather = 0
  • When LIMIT is small relative to table size (planner avoids parallel startup cost)
-- Mark functions parallel safe to allow parallel plans that call them
CREATE OR REPLACE FUNCTION calculate_discount(total numeric) RETURNS numeric
LANGUAGE sql
PARALLEL SAFE    -- only if function has no side effects and is truly safe
AS $$
    SELECT total * 0.9;
$$;

Statistics and Planner

Column Statistics

-- Default statistics target is 100 (samples ~30000 rows per column)
-- Increase for columns with many distinct values or skewed distributions

-- Check current statistics targets
SELECT attname, attstattarget
FROM pg_attribute
WHERE attrelid = 'orders'::regclass AND attnum > 0;

-- Increase statistics for a specific column
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;  -- must re-run ANALYZE to collect new statistics

-- Check what the planner knows about a column
SELECT * FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
-- Key fields: n_distinct, most_common_vals, most_common_freqs, histogram_bounds

Extended Statistics

-- When two columns are correlated, single-column stats mislead the planner
-- Example: city and zip_code are correlated; planner underestimates after filtering both

-- Create extended statistics to capture column correlations
CREATE STATISTICS orders_region_status_stats (dependencies, ndistinct)
    ON region, status FROM orders;

ANALYZE orders;

-- Check extended statistics
SELECT * FROM pg_statistic_ext;
SELECT * FROM pg_statistic_ext_data;

-- MCV (most common values) extended statistics
CREATE STATISTICS orders_mcv ON region, status FROM orders
    WITH (kind = mcv);
ANALYZE orders;

n_distinct Overrides

-- When planner guesses wrong number of distinct values
-- Positive value = exact count, negative = fraction of total rows

-- Tell planner there are exactly 50 distinct statuses
ALTER TABLE orders ALTER COLUMN status SET (n_distinct = 50);

-- Tell planner distinct count is 10% of table rows
ALTER TABLE orders ALTER COLUMN customer_id SET (n_distinct = -0.1);

ANALYZE orders;  -- re-analyze to apply

pg_hint_plan (Last Resort)

-- Install pg_hint_plan extension (not in core, must compile or use package)
-- Use only when statistics fixes and index changes are insufficient

-- Hints are embedded in comments before the query
/*+ SeqScan(orders) */ SELECT * FROM orders WHERE status = 'pending';

/*+ IndexScan(orders idx_orders_status) */ SELECT * FROM orders WHERE status = 'pending';

/*+ HashJoin(orders customers) Leading(orders customers) */
SELECT * FROM orders o JOIN customers c ON c.id = o.customer_id;

-- Available hint types:
-- Scan: SeqScan, IndexScan, IndexOnlyScan, BitmapScan, NoSeqScan, NoIndexScan
-- Join: NestLoop, HashJoin, MergeJoin, NoNestLoop, NoHashJoin, NoMergeJoin
-- Join order: Leading(table1 table2 table3)
-- Parallel: Parallel(table N)  -- N = number of workers

-- Always document WHY a hint is needed and create a ticket to fix root cause
-- Hints become stale as data grows and can cause regressions after schema changes

Diagnosing Estimate vs Actual Divergence

-- Large divergence between estimated and actual rows is the #1 cause of bad plans
-- Use this query pattern to identify problem queries via pg_stat_statements + EXPLAIN

-- Step 1: find high-variance queries in pg_stat_statements
-- Step 2: run EXPLAIN ANALYZE and look for nodes where rows estimate is off by 10x+
-- Step 3: check pg_stats for the filtered columns

-- Example: orders table filtered on two correlated columns
EXPLAIN (ANALYZE, FORMAT JSON)
SELECT * FROM orders WHERE region = 'US' AND status = 'pending';

-- If estimated rows = 10 but actual rows = 50000, investigate:
SELECT n_distinct, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders' AND attname IN ('region', 'status');

-- Fix options in priority order:
-- 1. ANALYZE (if stats are stale)
-- 2. Increase statistics target: ALTER TABLE ... ALTER COLUMN ... SET STATISTICS 500
-- 3. Create extended statistics for correlated columns
-- 4. Rewrite query to give planner better information
-- 5. pg_hint_plan as absolute last resort