The PostgreSQL buffer cache: how much memory the server reserves for caching data pages.
shared_buffers = 8GB # Recommended: 25% of total RAM
Rules of thumb:
pg_buffercache to measure actual cache hit rates.Check cache hit ratio:
SELECT
sum(heap_blks_hit) AS heap_hit,
sum(heap_blks_read) AS heap_read,
round(
sum(heap_blks_hit)::numeric /
nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100, 2
) AS hit_ratio_pct
FROM pg_statio_user_tables;
-- Target: > 99% for OLTP, > 95% for OLAP
Identify which tables consume the most buffer space (requires pg_buffercache):
CREATE EXTENSION pg_buffercache;
SELECT
relname,
count(*) * 8192 / 1024 / 1024 AS cached_mb,
round(count(*) * 100.0 / (SELECT count(*) FROM pg_buffercache), 2) AS pct_of_cache
FROM pg_buffercache bc
JOIN pg_class c ON bc.relfilenode = c.relfilenode
WHERE c.relkind = 'r'
GROUP BY relname
ORDER BY cached_mb DESC
LIMIT 20;
Memory granted per sort, hash, or merge operation. Each query node (sort, hash join, hash aggregate) can use up to work_mem individually.
work_mem = 64MB # Default 4MB is usually too low
Critical nuance: if a query has 5 sort nodes and 20 parallel workers, it can consume 5 * 20 * work_mem = 100x work_mem. For a 32GB server running 100 connections, setting work_mem = 320MB is catastrophic.
Sizing strategy:
max_connections * avg_active_fractionwork_mem = remaining / (active_connections * avg_nodes_per_query)For most OLTP systems: 16-64MB. For analytics: 256MB-1GB with fewer connections.
Override per session for specific heavy queries:
SET work_mem = '512MB';
SELECT ... FROM large_table ORDER BY ...;
RESET work_mem;
Monitor actual temporary file creation to detect under-allocation:
log_temp_files = 0 # Log all temp files (0 = log everything, N = only above N bytes)
-- Check existing temp file usage stats
SELECT query, temp_blks_written
FROM pg_stat_statements
WHERE temp_blks_written > 0
ORDER BY temp_blks_written DESC
LIMIT 10;
Memory for maintenance operations: VACUUM, ANALYZE, CREATE INDEX, ALTER TABLE ADD FOREIGN KEY, CLUSTER.
maintenance_work_mem = 2GB # Recommended: up to 10% RAM or 1-4GB
Larger values dramatically speed up CREATE INDEX and VACUUM on large tables. Unlike work_mem, there are never many concurrent maintenance operations, so you can set this aggressively.
Override per session before a large index build:
SET maintenance_work_mem = '4GB';
CREATE INDEX CONCURRENTLY idx_events_created_at ON events (created_at);
RESET maintenance_work_mem;
A hint to the query planner about total memory available for caching (RAM + OS page cache). It does not allocate memory; it only influences cost estimates.
effective_cache_size = 24GB # Recommended: 75% of total RAM
Higher values make the planner prefer index scans (which benefit from caching) over sequential scans. Too low a value causes the planner to choose sequential scans even when an index scan would be faster.
Huge pages (2MB pages on Linux instead of 4KB) reduce TLB pressure and can improve throughput on large shared_buffers values (above 8GB).
huge_pages = try # 'try' falls back gracefully; use 'on' to enforce
Linux OS setup (must be done before starting PostgreSQL):
# Calculate pages needed: shared_buffers / 2MB
# For shared_buffers = 16GB: 16384 MB / 2 MB = 8192 huge pages, add 10% buffer
echo 9000 > /proc/sys/vm/nr_hugepages
# Persist across reboots
echo "vm.nr_hugepages = 9000" >> /etc/sysctl.conf
sysctl -p
# Verify allocation
grep HugePages /proc/meminfo
Controls how much information is written to WAL.
wal_level = replica # Minimum for streaming replication
wal_level = logical # Required for logical replication (writes more)
wal_level = minimal disables replication and reduces WAL volume slightly. Use only for standalone servers where you never need PITR.
Memory for WAL writes before flushing to disk. PostgreSQL auto-tunes this to 1/32 of shared_buffers, capped at 16MB.
wal_buffers = 64MB # Manual override; auto value is usually fine
Rarely needs manual tuning. Increase only if you see contention on WALBufMappingLock in pg_stat_activity.
Fraction of the checkpoint interval over which to spread checkpoint I/O. Reduces I/O spikes at checkpoint time.
checkpoint_completion_target = 0.9 # Recommended (default is 0.9 in PG14+)
With max_wal_size = 4GB and checkpoint_completion_target = 0.9, PostgreSQL spreads writes over 90% of the checkpoint interval instead of flushing all at once.
Control WAL retention between checkpoints. Larger values reduce checkpoint frequency (less I/O) at the cost of more WAL on disk and longer crash recovery time.
min_wal_size = 1GB # Minimum WAL to retain (default 80MB)
max_wal_size = 8GB # Triggers checkpoint when exceeded (default 1GB)
For write-heavy workloads, increase max_wal_size to reduce checkpoint frequency. Monitor checkpoint frequency:
SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time, checkpoint_sync_time
FROM pg_stat_bgwriter;
-- checkpoints_req >> checkpoints_timed means max_wal_size is too small
After a checkpoint, PostgreSQL writes the full page image of a modified page the first time it is touched. This protects against torn page writes when the OS crashes mid-write.
full_page_writes = on # NEVER disable this
Disabling full_page_writes can cause unrecoverable data corruption after an OS crash. The only safe way to reduce full-page write overhead is to use a filesystem or storage that guarantees atomic page writes (ZFS, some SAN configurations) and you fully understand the implications.
Control the planner's cost model for I/O. Lower values make the planner favor the corresponding access method.
# For NVMe/SSD storage:
random_page_cost = 1.1
seq_page_cost = 1.0
# For traditional HDD:
random_page_cost = 4.0
seq_page_cost = 1.0
The default random_page_cost = 4.0 is calibrated for spinning disk. On SSD, it causes the planner to undervalue index scans, leading to unnecessary sequential scans. Always set random_page_cost = 1.1 on SSD-based servers.
Override per session to diagnose planner choices:
SET random_page_cost = 1.1;
EXPLAIN ANALYZE SELECT ...;
Number of concurrent I/O operations the planner assumes the storage can handle. Affects bitmap index scan prefetching.
effective_io_concurrency = 200 # NVMe SSD (high parallelism)
effective_io_concurrency = 2 # Traditional HDD (low parallelism)
effective_io_concurrency = 1 # NFS/SAN (conservative)
JIT (Just-In-Time compilation via LLVM) can speed up CPU-intensive queries (complex aggregations, many expressions) but adds compilation overhead that hurts short OLTP queries.
jit = on # Enable JIT globally (default on in PG11+)
jit_above_cost = 100000 # Only JIT-compile queries above this cost
jit_optimize_above_cost = 500000 # Apply expensive optimizations above this cost
jit_inline_above_cost = 500000 # Inline functions above this cost
For OLTP workloads where queries are fast and simple:
jit = off # Disable entirely to avoid overhead
Check if JIT was used in a query:
EXPLAIN (ANALYZE, VERBOSE, FORMAT TEXT)
SELECT sum(total) FROM orders WHERE created_at > now() - interval '1 year';
-- Look for "JIT:" section in output
PostgreSQL can parallelize sequential scans, aggregations, joins, and index scans.
# Total background workers available to the instance
max_worker_processes = 16 # Default 8; should be >= CPU cores
# Maximum parallel workers available for queries at any time
max_parallel_workers = 8 # Default 8; cap at physical CPU cores
# Workers per individual query node
max_parallel_workers_per_gather = 4 # Default 2; practical limit 4-8
# Minimum table size before considering parallel scan
min_parallel_table_scan_size = 8MB # Default; lower to enable on smaller tables
min_parallel_index_scan_size = 512kB
# Include leader process in parallel work (default on)
parallel_leader_participation = on
Force parallelism for testing (dangerous in production):
SET max_parallel_workers_per_gather = 8;
SET parallel_setup_cost = 0;
SET parallel_tuple_cost = 0;
EXPLAIN ANALYZE SELECT count(*) FROM large_table;
Disable parallelism for a session (useful when debugging):
SET max_parallel_workers_per_gather = 0;
PostgreSQL creates one process per connection. High connection counts waste memory and cause lock contention.
max_connections = 200 # Keep below 300; use pgBouncer for more
Each idle connection consumes ~5MB RAM just for the process overhead. With work_mem = 64MB and a sort-heavy query, one connection can briefly use 64MB * N sort nodes.
Use PgBouncer in transaction mode for OLTP:
# pgbouncer.ini
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 20 # Connections to PostgreSQL per database/user pair
# Reserve connections for superusers (DBA access during emergencies)
superuser_reserved_connections = 5
Detect dead connections (e.g., after network partition) without relying on the application:
tcp_keepalives_idle = 60 # Start keepalives after 60s idle
tcp_keepalives_interval = 10 # Retry every 10s
tcp_keepalives_count = 6 # Drop connection after 6 failed probes (1 minute)
Monitor current connections and their state:
SELECT
state,
count(*),
max(now() - state_change) AS longest_in_state
FROM pg_stat_activity
WHERE datname = current_database()
GROUP BY state
ORDER BY count DESC;
-- Find idle connections older than 10 minutes
SELECT pid, usename, application_name, state, state_change, query
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < now() - interval '10 minutes';
log_min_duration_statement = 1000 # Log queries taking > 1 second (ms)
# Set to 0 to log all; -1 to disable
log_statement = 'ddl' # Recommended for most production servers
# Options: none | ddl | mod | all
# 'ddl' = CREATE, DROP, ALTER, TRUNCATE
# 'mod' = ddl + INSERT, UPDATE, DELETE, COPY
# 'all' = everything (very verbose, for debugging only)
log_lock_waits = on # Log if a query waits for a lock
deadlock_timeout = 1s # Time before checking for deadlock (and logging wait)
Deadlocks are logged automatically at log_error_verbosity level. Lock waits (not deadlocks) require log_lock_waits = on:
# Also useful for identifying lock contention:
log_min_duration_statement = 500 # Catch queries slow due to lock waits
Query current lock waits:
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
now() - blocked.query_start AS wait_duration
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.cardinality(pg_blocking_pids(blocked.pid)) > 0;
Automatically log EXPLAIN ANALYZE for slow queries without modifying application code:
# Load as a shared library (requires restart)
shared_preload_libraries = 'pg_stat_statements, auto_explain'
# auto_explain settings (in postgresql.conf or per session)
auto_explain.log_min_duration = 5000 # Log plans for queries > 5 seconds
auto_explain.log_analyze = on # Include ANALYZE (actual vs estimated rows)
auto_explain.log_buffers = on # Include buffer usage
auto_explain.log_format = text # text | json | yaml | xml
auto_explain.log_verbose = off # Include column-level output (very noisy)
auto_explain.log_nested_statements = off # Exclude PL/pgSQL internal queries
auto_explain.sample_rate = 1.0 # Sample 100% of queries; set lower under load
Enable per session without restart:
LOAD 'auto_explain';
SET auto_explain.log_min_duration = '1s';
SET auto_explain.log_analyze = true;
Two complete configuration profiles showing key differences.
# Memory
shared_buffers = 8GB
work_mem = 32MB
maintenance_work_mem = 1GB
effective_cache_size = 24GB
huge_pages = try
# WAL & Checkpoints
wal_level = replica
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB
min_wal_size = 1GB
full_page_writes = on
# Planner
random_page_cost = 1.1
seq_page_cost = 1.0
effective_io_concurrency = 200
jit = off # Short queries don't benefit; avoid overhead
# Parallelism - conservative for OLTP
max_worker_processes = 16
max_parallel_workers = 4
max_parallel_workers_per_gather = 2
# Connections
max_connections = 200
superuser_reserved_connections = 5
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 6
# Logging
log_min_duration_statement = 500
log_statement = 'ddl'
log_lock_waits = on
deadlock_timeout = 1s
# Extensions
shared_preload_libraries = 'pg_stat_statements, auto_explain'
pg_stat_statements.track = all
auto_explain.log_min_duration = 2000
auto_explain.log_analyze = on
auto_explain.log_buffers = on
# Memory - larger allocations per query
shared_buffers = 32GB
work_mem = 2GB # Large sorts and hash joins for analytics
maintenance_work_mem = 4GB
effective_cache_size = 96GB
huge_pages = on
# WAL & Checkpoints - less frequent, larger checkpoints
wal_level = replica
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 16GB # Fewer checkpoints for write-heavy ETL
min_wal_size = 4GB
full_page_writes = on
# Planner - favor parallel plans and large scans
random_page_cost = 1.1
seq_page_cost = 1.0
effective_io_concurrency = 200
jit = on # CPU-heavy aggregations benefit from JIT
jit_above_cost = 50000 # Lower threshold to engage JIT sooner
# Parallelism - aggressive for analytics
max_worker_processes = 32
max_parallel_workers = 24
max_parallel_workers_per_gather = 12
min_parallel_table_scan_size = 1MB
min_parallel_index_scan_size = 128kB
parallel_leader_participation = on
# Connections - low count, use pooling at application layer
max_connections = 50
superuser_reserved_connections = 5
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 6
# Logging
log_min_duration_statement = 5000 # Only log very slow queries
log_statement = 'ddl'
log_lock_waits = on
deadlock_timeout = 5s
# Extensions
shared_preload_libraries = 'pg_stat_statements, auto_explain'
pg_stat_statements.track = all
auto_explain.log_min_duration = 10000
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_verbose = on # Column-level detail useful for analytics
Tracks cumulative execution statistics for all SQL statements. Essential for identifying slow queries.
# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all # top | all (includes nested statements)
pg_stat_statements.max = 10000 # Max distinct statements tracked
pg_stat_statements.track_utility = on # Track VACUUM, CREATE, etc.
CREATE EXTENSION pg_stat_statements;
-- Top 10 queries by total execution time
SELECT
round(total_exec_time::numeric, 2) AS total_ms,
calls,
round(mean_exec_time::numeric, 2) AS mean_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms,
rows,
round(100.0 * total_exec_time / sum(total_exec_time) OVER (), 2) AS pct,
left(query, 120) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
-- Queries with high I/O (temp file usage)
SELECT query, calls, total_exec_time, temp_blks_written
FROM pg_stat_statements
WHERE temp_blks_written > 0
ORDER BY temp_blks_written DESC
LIMIT 10;
-- Reset statistics
SELECT pg_stat_statements_reset();
Trigram similarity enables fast fuzzy text search and LIKE/ILIKE acceleration with GIN or GiST indexes.
CREATE EXTENSION pg_trgm;
-- Similarity search (0 to 1 score)
SELECT name, similarity(name, 'PostgreSQL') AS sim
FROM products
WHERE similarity(name, 'PostgreSQL') > 0.3
ORDER BY sim DESC;
-- Accelerate LIKE/ILIKE with GIN index
CREATE INDEX idx_products_name_trgm ON products USING gin (name gin_trgm_ops);
-- Now this query uses the index:
EXPLAIN ANALYZE SELECT * FROM products WHERE name ILIKE '%ostgre%';
-- Word similarity (better for phrase matching)
SELECT word_similarity('PostgreSQL', 'Postgres SQL tutorial');
Spatial and geographic data types, indexing, and functions. Use GiST indexes for geometry columns.
CREATE EXTENSION postgis;
-- Spatial columns
CREATE TABLE locations (
id bigserial PRIMARY KEY,
name text,
geom geometry(Point, 4326) -- WGS84 lat/lng
);
CREATE INDEX idx_locations_geom ON locations USING gist (geom);
-- Find points within 10km of a given point
SELECT name, ST_Distance(geom::geography, ST_MakePoint(-73.9857, 40.7484)::geography) AS dist_m
FROM locations
WHERE ST_DWithin(geom::geography, ST_MakePoint(-73.9857, 40.7484)::geography, 10000)
ORDER BY dist_m;
Automatically partitions time-series data into chunks, enables continuous aggregates, and provides compression.
CREATE EXTENSION timescaledb;
-- Convert a regular table to a hypertable (partitioned by time)
CREATE TABLE metrics (
time timestamptz NOT NULL,
device_id int,
temperature double precision
);
SELECT create_hypertable('metrics', 'time', chunk_time_interval => interval '1 day');
-- Automatic compression for old chunks
ALTER TABLE metrics SET (
timescaledb.compress,
timescaledb.compress_orderby = 'time DESC',
timescaledb.compress_segmentby = 'device_id'
);
SELECT add_compression_policy('metrics', interval '7 days');
-- Continuous aggregate (materialized, auto-refreshed)
CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', time) AS bucket, device_id, avg(temperature) AS avg_temp
FROM metrics
GROUP BY bucket, device_id;
Cryptographic functions for hashing, encryption, and key generation.
CREATE EXTENSION pgcrypto;
-- Password hashing (bcrypt)
INSERT INTO users (email, password_hash)
VALUES ('user@example.com', crypt('user_password', gen_salt('bf', 12)));
-- Verify password
SELECT id FROM users
WHERE email = 'user@example.com'
AND password_hash = crypt('supplied_password', password_hash);
-- Symmetric encryption (AES via pgp_sym_encrypt)
SELECT pgp_sym_encrypt('sensitive data', 'encryption_key');
SELECT pgp_sym_decrypt(encrypted_col, 'encryption_key') FROM secrets;
-- Generate random UUID
SELECT gen_random_uuid();
-- Generate cryptographically secure random bytes
SELECT encode(gen_random_bytes(32), 'hex');
Logs query execution plans automatically for slow queries. Configured as a shared library (see Logging section). No SQL setup required beyond loading the library.
Load temporarily in a session for debugging without a server restart:
LOAD 'auto_explain';
SET auto_explain.log_min_duration = 0; -- Log everything in this session
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;
-- Run your query; check PostgreSQL logs for the plan
SELECT * FROM orders WHERE customer_id = 12345 ORDER BY created_at DESC LIMIT 100;
Sample only a fraction of queries under high load to reduce log volume:
auto_explain.sample_rate = 0.01 # Log plans for ~1% of qualifying queries