config-tuning.md 21 KB

PostgreSQL Configuration & Tuning Reference

Table of Contents

  1. Memory Settings
    • shared_buffers
    • work_mem
    • maintenance_work_mem
    • effective_cache_size
    • huge_pages
  2. WAL & Checkpoint Settings
    • wal_level
    • wal_buffers
    • checkpoint_completion_target
    • max_wal_size and min_wal_size
    • full_page_writes
  3. Query Planner Settings
    • random_page_cost and seq_page_cost
    • effective_io_concurrency
    • JIT compilation
  4. Parallelism Settings
  5. Connection Settings
  6. Logging
  7. OLTP vs OLAP Profiles
  8. Extensions
    • pg_stat_statements
    • pg_trgm
    • PostGIS
    • timescaledb
    • pgcrypto
    • auto_explain

Memory Settings

shared_buffers

The PostgreSQL buffer cache: how much memory the server reserves for caching data pages.

shared_buffers = 8GB   # Recommended: 25% of total RAM

Rules of thumb:

  • Start at 25% of RAM. Going above 40% rarely helps and can hurt because the OS page cache also buffers the same pages.
  • On dedicated database servers, 25% is conservative but safe. Profile with pg_buffercache to measure actual cache hit rates.
  • Requires a server restart to take effect.

Check cache hit ratio:

SELECT
    sum(heap_blks_hit)  AS heap_hit,
    sum(heap_blks_read) AS heap_read,
    round(
        sum(heap_blks_hit)::numeric /
        nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100, 2
    ) AS hit_ratio_pct
FROM pg_statio_user_tables;
-- Target: > 99% for OLTP, > 95% for OLAP

Identify which tables consume the most buffer space (requires pg_buffercache):

CREATE EXTENSION pg_buffercache;

SELECT
    relname,
    count(*) * 8192 / 1024 / 1024 AS cached_mb,
    round(count(*) * 100.0 / (SELECT count(*) FROM pg_buffercache), 2) AS pct_of_cache
FROM pg_buffercache bc
JOIN pg_class c ON bc.relfilenode = c.relfilenode
WHERE c.relkind = 'r'
GROUP BY relname
ORDER BY cached_mb DESC
LIMIT 20;

work_mem

Memory granted per sort, hash, or merge operation. Each query node (sort, hash join, hash aggregate) can use up to work_mem individually.

work_mem = 64MB   # Default 4MB is usually too low

Critical nuance: if a query has 5 sort nodes and 20 parallel workers, it can consume 5 * 20 * work_mem = 100x work_mem. For a 32GB server running 100 connections, setting work_mem = 320MB is catastrophic.

Sizing strategy:

  1. Estimate concurrent queries: max_connections * avg_active_fraction
  2. Reserve memory for OS + shared_buffers + maintenance_work_mem
  3. Divide remainder: work_mem = remaining / (active_connections * avg_nodes_per_query)

For most OLTP systems: 16-64MB. For analytics: 256MB-1GB with fewer connections.

Override per session for specific heavy queries:

SET work_mem = '512MB';
SELECT ... FROM large_table ORDER BY ...;
RESET work_mem;

Monitor actual temporary file creation to detect under-allocation:

log_temp_files = 0   # Log all temp files (0 = log everything, N = only above N bytes)
-- Check existing temp file usage stats
SELECT query, temp_blks_written
FROM pg_stat_statements
WHERE temp_blks_written > 0
ORDER BY temp_blks_written DESC
LIMIT 10;

maintenance_work_mem

Memory for maintenance operations: VACUUM, ANALYZE, CREATE INDEX, ALTER TABLE ADD FOREIGN KEY, CLUSTER.

maintenance_work_mem = 2GB   # Recommended: up to 10% RAM or 1-4GB

Larger values dramatically speed up CREATE INDEX and VACUUM on large tables. Unlike work_mem, there are never many concurrent maintenance operations, so you can set this aggressively.

Override per session before a large index build:

SET maintenance_work_mem = '4GB';
CREATE INDEX CONCURRENTLY idx_events_created_at ON events (created_at);
RESET maintenance_work_mem;

effective_cache_size

A hint to the query planner about total memory available for caching (RAM + OS page cache). It does not allocate memory; it only influences cost estimates.

effective_cache_size = 24GB   # Recommended: 75% of total RAM

Higher values make the planner prefer index scans (which benefit from caching) over sequential scans. Too low a value causes the planner to choose sequential scans even when an index scan would be faster.

huge_pages

Huge pages (2MB pages on Linux instead of 4KB) reduce TLB pressure and can improve throughput on large shared_buffers values (above 8GB).

huge_pages = try    # 'try' falls back gracefully; use 'on' to enforce

Linux OS setup (must be done before starting PostgreSQL):

# Calculate pages needed: shared_buffers / 2MB
# For shared_buffers = 16GB: 16384 MB / 2 MB = 8192 huge pages, add 10% buffer
echo 9000 > /proc/sys/vm/nr_hugepages

# Persist across reboots
echo "vm.nr_hugepages = 9000" >> /etc/sysctl.conf
sysctl -p

# Verify allocation
grep HugePages /proc/meminfo

WAL & Checkpoint Settings

wal_level

Controls how much information is written to WAL.

wal_level = replica    # Minimum for streaming replication
wal_level = logical    # Required for logical replication (writes more)

wal_level = minimal disables replication and reduces WAL volume slightly. Use only for standalone servers where you never need PITR.

wal_buffers

Memory for WAL writes before flushing to disk. PostgreSQL auto-tunes this to 1/32 of shared_buffers, capped at 16MB.

wal_buffers = 64MB   # Manual override; auto value is usually fine

Rarely needs manual tuning. Increase only if you see contention on WALBufMappingLock in pg_stat_activity.

checkpoint_completion_target

Fraction of the checkpoint interval over which to spread checkpoint I/O. Reduces I/O spikes at checkpoint time.

checkpoint_completion_target = 0.9   # Recommended (default is 0.9 in PG14+)

With max_wal_size = 4GB and checkpoint_completion_target = 0.9, PostgreSQL spreads writes over 90% of the checkpoint interval instead of flushing all at once.

max_wal_size and min_wal_size

Control WAL retention between checkpoints. Larger values reduce checkpoint frequency (less I/O) at the cost of more WAL on disk and longer crash recovery time.

min_wal_size = 1GB     # Minimum WAL to retain (default 80MB)
max_wal_size = 8GB     # Triggers checkpoint when exceeded (default 1GB)

For write-heavy workloads, increase max_wal_size to reduce checkpoint frequency. Monitor checkpoint frequency:

SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time, checkpoint_sync_time
FROM pg_stat_bgwriter;
-- checkpoints_req >> checkpoints_timed means max_wal_size is too small

full_page_writes

After a checkpoint, PostgreSQL writes the full page image of a modified page the first time it is touched. This protects against torn page writes when the OS crashes mid-write.

full_page_writes = on   # NEVER disable this

Disabling full_page_writes can cause unrecoverable data corruption after an OS crash. The only safe way to reduce full-page write overhead is to use a filesystem or storage that guarantees atomic page writes (ZFS, some SAN configurations) and you fully understand the implications.


Query Planner Settings

random_page_cost and seq_page_cost

Control the planner's cost model for I/O. Lower values make the planner favor the corresponding access method.

# For NVMe/SSD storage:
random_page_cost = 1.1
seq_page_cost = 1.0

# For traditional HDD:
random_page_cost = 4.0
seq_page_cost = 1.0

The default random_page_cost = 4.0 is calibrated for spinning disk. On SSD, it causes the planner to undervalue index scans, leading to unnecessary sequential scans. Always set random_page_cost = 1.1 on SSD-based servers.

Override per session to diagnose planner choices:

SET random_page_cost = 1.1;
EXPLAIN ANALYZE SELECT ...;

effective_io_concurrency

Number of concurrent I/O operations the planner assumes the storage can handle. Affects bitmap index scan prefetching.

effective_io_concurrency = 200   # NVMe SSD (high parallelism)
effective_io_concurrency = 2     # Traditional HDD (low parallelism)
effective_io_concurrency = 1     # NFS/SAN (conservative)

JIT Compilation

JIT (Just-In-Time compilation via LLVM) can speed up CPU-intensive queries (complex aggregations, many expressions) but adds compilation overhead that hurts short OLTP queries.

jit = on                  # Enable JIT globally (default on in PG11+)
jit_above_cost = 100000   # Only JIT-compile queries above this cost
jit_optimize_above_cost = 500000  # Apply expensive optimizations above this cost
jit_inline_above_cost = 500000    # Inline functions above this cost

For OLTP workloads where queries are fast and simple:

jit = off   # Disable entirely to avoid overhead

Check if JIT was used in a query:

EXPLAIN (ANALYZE, VERBOSE, FORMAT TEXT)
SELECT sum(total) FROM orders WHERE created_at > now() - interval '1 year';
-- Look for "JIT:" section in output

Parallelism Settings

PostgreSQL can parallelize sequential scans, aggregations, joins, and index scans.

# Total background workers available to the instance
max_worker_processes = 16           # Default 8; should be >= CPU cores

# Maximum parallel workers available for queries at any time
max_parallel_workers = 8            # Default 8; cap at physical CPU cores

# Workers per individual query node
max_parallel_workers_per_gather = 4 # Default 2; practical limit 4-8

# Minimum table size before considering parallel scan
min_parallel_table_scan_size = 8MB  # Default; lower to enable on smaller tables
min_parallel_index_scan_size = 512kB

# Include leader process in parallel work (default on)
parallel_leader_participation = on

Force parallelism for testing (dangerous in production):

SET max_parallel_workers_per_gather = 8;
SET parallel_setup_cost = 0;
SET parallel_tuple_cost = 0;
EXPLAIN ANALYZE SELECT count(*) FROM large_table;

Disable parallelism for a session (useful when debugging):

SET max_parallel_workers_per_gather = 0;

Connection Settings

max_connections

PostgreSQL creates one process per connection. High connection counts waste memory and cause lock contention.

max_connections = 200   # Keep below 300; use pgBouncer for more

Each idle connection consumes ~5MB RAM just for the process overhead. With work_mem = 64MB and a sort-heavy query, one connection can briefly use 64MB * N sort nodes.

Use PgBouncer in transaction mode for OLTP:

# pgbouncer.ini
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 20   # Connections to PostgreSQL per database/user pair
# Reserve connections for superusers (DBA access during emergencies)
superuser_reserved_connections = 5

TCP Keepalives

Detect dead connections (e.g., after network partition) without relying on the application:

tcp_keepalives_idle = 60      # Start keepalives after 60s idle
tcp_keepalives_interval = 10  # Retry every 10s
tcp_keepalives_count = 6      # Drop connection after 6 failed probes (1 minute)

Monitor current connections and their state:

SELECT
    state,
    count(*),
    max(now() - state_change) AS longest_in_state
FROM pg_stat_activity
WHERE datname = current_database()
GROUP BY state
ORDER BY count DESC;

-- Find idle connections older than 10 minutes
SELECT pid, usename, application_name, state, state_change, query
FROM pg_stat_activity
WHERE state = 'idle'
  AND state_change < now() - interval '10 minutes';

Logging

Slow Query Logging

log_min_duration_statement = 1000   # Log queries taking > 1 second (ms)
                                     # Set to 0 to log all; -1 to disable

Statement-Level Logging

log_statement = 'ddl'   # Recommended for most production servers
# Options: none | ddl | mod | all
# 'ddl'  = CREATE, DROP, ALTER, TRUNCATE
# 'mod'  = ddl + INSERT, UPDATE, DELETE, COPY
# 'all'  = everything (very verbose, for debugging only)

Lock Logging

log_lock_waits = on       # Log if a query waits for a lock
deadlock_timeout = 1s     # Time before checking for deadlock (and logging wait)

Deadlocks are logged automatically at log_error_verbosity level. Lock waits (not deadlocks) require log_lock_waits = on:

# Also useful for identifying lock contention:
log_min_duration_statement = 500    # Catch queries slow due to lock waits

Query current lock waits:

SELECT
    blocked.pid                   AS blocked_pid,
    blocked.query                 AS blocked_query,
    blocking.pid                  AS blocking_pid,
    blocking.query                AS blocking_query,
    now() - blocked.query_start   AS wait_duration
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.cardinality(pg_blocking_pids(blocked.pid)) > 0;

auto_explain

Automatically log EXPLAIN ANALYZE for slow queries without modifying application code:

# Load as a shared library (requires restart)
shared_preload_libraries = 'pg_stat_statements, auto_explain'

# auto_explain settings (in postgresql.conf or per session)
auto_explain.log_min_duration = 5000    # Log plans for queries > 5 seconds
auto_explain.log_analyze = on           # Include ANALYZE (actual vs estimated rows)
auto_explain.log_buffers = on           # Include buffer usage
auto_explain.log_format = text          # text | json | yaml | xml
auto_explain.log_verbose = off          # Include column-level output (very noisy)
auto_explain.log_nested_statements = off # Exclude PL/pgSQL internal queries
auto_explain.sample_rate = 1.0          # Sample 100% of queries; set lower under load

Enable per session without restart:

LOAD 'auto_explain';
SET auto_explain.log_min_duration = '1s';
SET auto_explain.log_analyze = true;

OLTP vs OLAP Profiles

Two complete configuration profiles showing key differences.

OLTP Profile (32GB RAM, NVMe SSD, 200 connections)

# Memory
shared_buffers = 8GB
work_mem = 32MB
maintenance_work_mem = 1GB
effective_cache_size = 24GB
huge_pages = try

# WAL & Checkpoints
wal_level = replica
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB
min_wal_size = 1GB
full_page_writes = on

# Planner
random_page_cost = 1.1
seq_page_cost = 1.0
effective_io_concurrency = 200
jit = off                        # Short queries don't benefit; avoid overhead

# Parallelism - conservative for OLTP
max_worker_processes = 16
max_parallel_workers = 4
max_parallel_workers_per_gather = 2

# Connections
max_connections = 200
superuser_reserved_connections = 5
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 6

# Logging
log_min_duration_statement = 500
log_statement = 'ddl'
log_lock_waits = on
deadlock_timeout = 1s

# Extensions
shared_preload_libraries = 'pg_stat_statements, auto_explain'
pg_stat_statements.track = all
auto_explain.log_min_duration = 2000
auto_explain.log_analyze = on
auto_explain.log_buffers = on

OLAP Profile (128GB RAM, NVMe SSD, 20 connections, analytics workload)

# Memory - larger allocations per query
shared_buffers = 32GB
work_mem = 2GB                   # Large sorts and hash joins for analytics
maintenance_work_mem = 4GB
effective_cache_size = 96GB
huge_pages = on

# WAL & Checkpoints - less frequent, larger checkpoints
wal_level = replica
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 16GB              # Fewer checkpoints for write-heavy ETL
min_wal_size = 4GB
full_page_writes = on

# Planner - favor parallel plans and large scans
random_page_cost = 1.1
seq_page_cost = 1.0
effective_io_concurrency = 200
jit = on                         # CPU-heavy aggregations benefit from JIT
jit_above_cost = 50000           # Lower threshold to engage JIT sooner

# Parallelism - aggressive for analytics
max_worker_processes = 32
max_parallel_workers = 24
max_parallel_workers_per_gather = 12
min_parallel_table_scan_size = 1MB
min_parallel_index_scan_size = 128kB
parallel_leader_participation = on

# Connections - low count, use pooling at application layer
max_connections = 50
superuser_reserved_connections = 5
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 6

# Logging
log_min_duration_statement = 5000   # Only log very slow queries
log_statement = 'ddl'
log_lock_waits = on
deadlock_timeout = 5s

# Extensions
shared_preload_libraries = 'pg_stat_statements, auto_explain'
pg_stat_statements.track = all
auto_explain.log_min_duration = 10000
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_verbose = on        # Column-level detail useful for analytics

Extensions

pg_stat_statements

Tracks cumulative execution statistics for all SQL statements. Essential for identifying slow queries.

# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all          # top | all (includes nested statements)
pg_stat_statements.max = 10000          # Max distinct statements tracked
pg_stat_statements.track_utility = on   # Track VACUUM, CREATE, etc.
CREATE EXTENSION pg_stat_statements;

-- Top 10 queries by total execution time
SELECT
    round(total_exec_time::numeric, 2) AS total_ms,
    calls,
    round(mean_exec_time::numeric, 2) AS mean_ms,
    round(stddev_exec_time::numeric, 2) AS stddev_ms,
    rows,
    round(100.0 * total_exec_time / sum(total_exec_time) OVER (), 2) AS pct,
    left(query, 120) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Queries with high I/O (temp file usage)
SELECT query, calls, total_exec_time, temp_blks_written
FROM pg_stat_statements
WHERE temp_blks_written > 0
ORDER BY temp_blks_written DESC
LIMIT 10;

-- Reset statistics
SELECT pg_stat_statements_reset();

pg_trgm

Trigram similarity enables fast fuzzy text search and LIKE/ILIKE acceleration with GIN or GiST indexes.

CREATE EXTENSION pg_trgm;

-- Similarity search (0 to 1 score)
SELECT name, similarity(name, 'PostgreSQL') AS sim
FROM products
WHERE similarity(name, 'PostgreSQL') > 0.3
ORDER BY sim DESC;

-- Accelerate LIKE/ILIKE with GIN index
CREATE INDEX idx_products_name_trgm ON products USING gin (name gin_trgm_ops);

-- Now this query uses the index:
EXPLAIN ANALYZE SELECT * FROM products WHERE name ILIKE '%ostgre%';

-- Word similarity (better for phrase matching)
SELECT word_similarity('PostgreSQL', 'Postgres SQL tutorial');

PostGIS

Spatial and geographic data types, indexing, and functions. Use GiST indexes for geometry columns.

CREATE EXTENSION postgis;

-- Spatial columns
CREATE TABLE locations (
    id      bigserial PRIMARY KEY,
    name    text,
    geom    geometry(Point, 4326)   -- WGS84 lat/lng
);

CREATE INDEX idx_locations_geom ON locations USING gist (geom);

-- Find points within 10km of a given point
SELECT name, ST_Distance(geom::geography, ST_MakePoint(-73.9857, 40.7484)::geography) AS dist_m
FROM locations
WHERE ST_DWithin(geom::geography, ST_MakePoint(-73.9857, 40.7484)::geography, 10000)
ORDER BY dist_m;

timescaledb

Automatically partitions time-series data into chunks, enables continuous aggregates, and provides compression.

CREATE EXTENSION timescaledb;

-- Convert a regular table to a hypertable (partitioned by time)
CREATE TABLE metrics (
    time        timestamptz NOT NULL,
    device_id   int,
    temperature double precision
);
SELECT create_hypertable('metrics', 'time', chunk_time_interval => interval '1 day');

-- Automatic compression for old chunks
ALTER TABLE metrics SET (
    timescaledb.compress,
    timescaledb.compress_orderby = 'time DESC',
    timescaledb.compress_segmentby = 'device_id'
);
SELECT add_compression_policy('metrics', interval '7 days');

-- Continuous aggregate (materialized, auto-refreshed)
CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', time) AS bucket, device_id, avg(temperature) AS avg_temp
FROM metrics
GROUP BY bucket, device_id;

pgcrypto

Cryptographic functions for hashing, encryption, and key generation.

CREATE EXTENSION pgcrypto;

-- Password hashing (bcrypt)
INSERT INTO users (email, password_hash)
VALUES ('user@example.com', crypt('user_password', gen_salt('bf', 12)));

-- Verify password
SELECT id FROM users
WHERE email = 'user@example.com'
  AND password_hash = crypt('supplied_password', password_hash);

-- Symmetric encryption (AES via pgp_sym_encrypt)
SELECT pgp_sym_encrypt('sensitive data', 'encryption_key');
SELECT pgp_sym_decrypt(encrypted_col, 'encryption_key') FROM secrets;

-- Generate random UUID
SELECT gen_random_uuid();

-- Generate cryptographically secure random bytes
SELECT encode(gen_random_bytes(32), 'hex');

auto_explain

Logs query execution plans automatically for slow queries. Configured as a shared library (see Logging section). No SQL setup required beyond loading the library.

Load temporarily in a session for debugging without a server restart:

LOAD 'auto_explain';
SET auto_explain.log_min_duration = 0;    -- Log everything in this session
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;

-- Run your query; check PostgreSQL logs for the plan
SELECT * FROM orders WHERE customer_id = 12345 ORDER BY created_at DESC LIMIT 100;

Sample only a fraction of queries under high load to reduce log volume:

auto_explain.sample_rate = 0.01   # Log plans for ~1% of qualifying queries