Edit postgresql.conf on the primary:
# Minimum required for streaming replication
wal_level = replica # or 'logical' if you also need logical replication
max_wal_senders = 10 # number of concurrent standby connections
wal_keep_size = 1GB # retain WAL to prevent standby falling behind
# prefer replication slots over this setting
# Optional but recommended
hot_standby_feedback = on # prevents primary from vacuuming rows standby needs
Create a replication role on the primary:
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secret';
Allow the standby in pg_hba.conf on the primary:
# TYPE DATABASE USER ADDRESS METHOD
host replication replicator 192.168.1.0/24 scram-sha-256
Reload after editing pg_hba.conf:
SELECT pg_reload_conf();
Take a base backup from the primary (run on standby host):
pg_basebackup \
--host=primary-host \
--username=replicator \
--pgdata=/var/lib/postgresql/data \
--wal-method=stream \
--checkpoint=fast \
--progress
Create postgresql.conf overrides or postgresql.auto.conf on the replica:
primary_conninfo = 'host=primary-host port=5432 user=replicator password=secret'
primary_slot_name = 'replica1_slot' # if using replication slots
hot_standby = on # allow read queries on replica
recovery_min_apply_delay = 0 # set to e.g. '30min' for delayed replica
Create the standby signal file (PG12+):
touch /var/lib/postgresql/data/standby.signal
Asynchronous (default): primary commits without waiting for standby. Risk of data loss on primary failure equal to replication lag.
Synchronous: primary waits for at least one standby to confirm WAL receipt before returning to client.
# On primary postgresql.conf
synchronous_standby_names = 'replica1'
# or for ANY 1 of multiple standbys:
synchronous_standby_names = 'ANY 1 (replica1, replica2, replica3)'
# or require ALL listed:
synchronous_standby_names = 'FIRST 2 (replica1, replica2, replica3)'
Standby names come from the application_name in primary_conninfo:
primary_conninfo = 'host=primary port=5432 user=replicator application_name=replica1'
Trade-offs:
| Mode | Durability | Write Latency | Throughput |
|---|---|---|---|
| Async | Data loss possible | Low | Highest |
| Sync (remote_write) | WAL received, not flushed | Medium | High |
| Sync (on) | WAL flushed to disk | Higher | Lower |
| Sync (remote_apply) | Changes applied | Highest | Lowest |
# Control sync level (default is 'on' = flush to standby disk)
synchronous_commit = remote_write # faster, slight durability trade-off
On the primary, query pg_stat_replication:
SELECT
application_name,
client_addr,
state, -- startup, catchup, streaming
sync_state, -- async, sync, potential
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
-- Replication lag in bytes
(sent_lsn - replay_lsn) AS replay_lag_bytes,
-- Replication lag in time (PG10+)
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
On the replica, check if it is in recovery and its LSN position:
SELECT
pg_is_in_recovery(),
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn(),
pg_last_xact_replay_timestamp(),
-- Time lag (approximate)
now() - pg_last_xact_replay_timestamp() AS replication_delay;
Alert when lag exceeds threshold:
-- Alert if replay lag > 30 seconds
SELECT application_name, replay_lag
FROM pg_stat_replication
WHERE replay_lag > interval '30 seconds';
Replication slots prevent the primary from removing WAL segments needed by a standby, eliminating the need for wal_keep_size tuning. The risk is unbounded WAL accumulation if a slot is abandoned.
Create a physical slot on the primary:
SELECT pg_create_physical_replication_slot('replica1_slot');
List all slots and check for lag:
SELECT
slot_name,
slot_type,
active,
restart_lsn,
confirmed_flush_lsn,
-- WAL retained by this slot in bytes
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_size
FROM pg_replication_slots;
Drop an abandoned slot to reclaim disk:
SELECT pg_drop_replication_slot('replica1_slot');
Set a safety limit to prevent disk exhaustion (PG13+):
max_slot_wal_keep_size = 10GB # drop slot if WAL retention exceeds this
Logical replication decodes WAL into row-level change streams. It allows selective table sync and works across major versions.
A publication defines what changes to export:
-- All tables, all operations
CREATE PUBLICATION pub_all FOR ALL TABLES;
-- Specific tables
CREATE PUBLICATION pub_orders FOR TABLE orders, order_items;
-- Specific operations only
CREATE PUBLICATION pub_inserts FOR TABLE events WITH (publish = 'insert');
-- With row filter (PG15+): only published rows matching WHERE
CREATE PUBLICATION pub_active_orders FOR TABLE orders
WHERE (status != 'cancelled');
-- With column list (PG15+): only publish selected columns
CREATE PUBLICATION pub_orders_summary FOR TABLE orders (id, status, total, created_at);
Manage publications:
ALTER PUBLICATION pub_orders ADD TABLE shipments;
ALTER PUBLICATION pub_orders DROP TABLE order_items;
DROP PUBLICATION pub_orders;
-- Inspect
SELECT * FROM pg_publication;
SELECT * FROM pg_publication_tables;
The publisher must have wal_level = logical:
wal_level = logical
max_replication_slots = 10
max_wal_senders = 10
On the subscriber database:
CREATE SUBSCRIPTION sub_orders
CONNECTION 'host=primary-host dbname=mydb user=replicator password=secret'
PUBLICATION pub_orders;
The subscriber creates a replication slot on the publisher automatically. The target tables must already exist with compatible schemas.
-- Disable/re-enable a subscription
ALTER SUBSCRIPTION sub_orders DISABLE;
ALTER SUBSCRIPTION sub_orders ENABLE;
-- Refresh after publisher adds tables
ALTER SUBSCRIPTION sub_orders REFRESH PUBLICATION;
-- Skip copying initial data (for ongoing sync only)
CREATE SUBSCRIPTION sub_orders
CONNECTION '...'
PUBLICATION pub_orders
WITH (copy_data = false);
-- Drop subscription (also drops remote slot)
DROP SUBSCRIPTION sub_orders;
Monitor subscriptions:
-- On subscriber
SELECT * FROM pg_stat_subscription;
-- On publisher - logical slots
SELECT slot_name, active, confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_type = 'logical';
pg_largeobject) are not replicated.ALTER SUBSCRIPTION ... SKIP to advance past a conflict LSN.REPLICA IDENTITY on tables without primary keys:-- Full row image (slow, safe for tables without PK)
ALTER TABLE events REPLICA IDENTITY FULL;
-- Use a unique index as identity
ALTER TABLE events REPLICA IDENTITY USING INDEX events_uuid_idx;
Trigger promotion using pg_promote() (PG12+, no file touch needed):
-- Connect to the standby and run:
SELECT pg_promote();
Or use pg_ctl:
pg_ctl promote -D /var/lib/postgresql/data
After promotion, the former standby becomes a normal read-write primary. Update primary_conninfo on remaining standbys to point to the new primary and restart them.
Every promotion increments the timeline ID. PostgreSQL uses timelines to track branching histories, allowing standbys to follow the correct WAL history.
-- Check current timeline on any server
SELECT timeline_id FROM pg_control_checkpoint();
-- View WAL segment filenames: first 8 hex chars = timeline
-- 000000020000000000000001 = timeline 2, segment 1
When a former primary comes back, configure it as a new standby using recovery_target_timeline = 'latest' (the default), which lets it follow the new timeline.
HAProxy (layer 4, health-check based):
frontend postgres_write
bind *:5432
default_backend postgres_primary
backend postgres_primary
option httpchk GET /primary # Patroni health endpoint
server pg1 192.168.1.1:5432 check port 8008
server pg2 192.168.1.2:5432 check port 8008
backend postgres_replica
option httpchk GET /replica
server pg1 192.168.1.1:5432 check port 8008
server pg2 192.168.1.2:5432 check port 8008
PgBouncer target switch: update [databases] section and reload:
[databases]
mydb = host=new-primary-ip port=5432 dbname=mydb
psql -p 6432 pgbouncer -c "RELOAD"
DNS-based: Update the DNS record for pg-primary.internal to point to the new primary's IP. Works well with short TTLs (30s) and application-level retry logic.
Declarative partitioning (PG10+) uses PARTITION BY on the parent table. The parent table itself holds no rows.
Most common for time-series and log data:
CREATE TABLE orders (
id bigserial,
created_at timestamptz NOT NULL,
customer_id bigint,
total numeric(12,2)
) PARTITION BY RANGE (created_at);
-- Create partitions for each month
CREATE TABLE orders_2024_01
PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE orders_2024_02
PARTITION OF orders
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- Catch-all default partition
CREATE TABLE orders_default
PARTITION OF orders DEFAULT;
Useful for discrete categorical values:
CREATE TABLE products (
id bigserial,
region text NOT NULL,
name text
) PARTITION BY LIST (region);
CREATE TABLE products_us PARTITION OF products FOR VALUES IN ('us', 'ca');
CREATE TABLE products_eu PARTITION OF products FOR VALUES IN ('de', 'fr', 'uk');
CREATE TABLE products_apac PARTITION OF products FOR VALUES IN ('au', 'jp', 'sg');
CREATE TABLE products_other PARTITION OF products DEFAULT;
Distributes rows evenly when there is no natural range or list split:
CREATE TABLE sessions (
id uuid NOT NULL,
user_id bigint,
data jsonb
) PARTITION BY HASH (id);
-- 8 partitions, modulus = total count, remainder = partition number
CREATE TABLE sessions_0 PARTITION OF sessions FOR VALUES WITH (modulus 8, remainder 0);
CREATE TABLE sessions_1 PARTITION OF sessions FOR VALUES WITH (modulus 8, remainder 1);
-- ... through remainder 7
Combine strategies: partition by month, then by region within each month:
CREATE TABLE events (
id bigserial,
created_at timestamptz NOT NULL,
region text NOT NULL
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2024_01
PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01')
PARTITION BY LIST (region);
CREATE TABLE events_2024_01_us
PARTITION OF events_2024_01
FOR VALUES IN ('us');
The planner eliminates irrelevant partitions at plan time (static) or execution time (dynamic):
-- Enable/disable for debugging
SET enable_partition_pruning = on; -- default on
EXPLAIN SELECT * FROM orders WHERE created_at >= '2024-06-01' AND created_at < '2024-07-01';
-- Should show only orders_2024_06 in the plan, not all partitions
Each partition should have its own indexes. Indexes on the parent do not cascade automatically (they do in PG11+ for primary keys and unique constraints created on the parent):
-- Create index on all existing partitions at once (PG11+ creates on parent + all children)
CREATE INDEX ON orders (customer_id);
-- Add a new partition (no locking on existing data)
CREATE TABLE orders_2025_01
PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
-- Detach a partition (it becomes a standalone table, no data movement)
ALTER TABLE orders DETACH PARTITION orders_2023_01;
-- PG14+: detach concurrently (non-blocking)
ALTER TABLE orders DETACH PARTITION orders_2023_01 CONCURRENTLY;
-- Drop old data instantly (no vacuum needed)
DROP TABLE orders_2023_01;
-- Attach an existing table as a partition (verify constraint first)
ALTER TABLE orders_old ADD CONSTRAINT orders_old_check
CHECK (created_at >= '2022-01-01' AND created_at < '2023-01-01');
ALTER TABLE orders ATTACH PARTITION orders_old
FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
Partition when:
Do not partition just because a table is large. Partitioning adds overhead for queries that scan all partitions (no partition key filter). A well-indexed single table often outperforms a partitioned one for OLTP workloads.
FDWs allow PostgreSQL to query external data sources as if they were local tables.
-- 1. Install extension
CREATE EXTENSION postgres_fdw;
-- 2. Define the remote server
CREATE SERVER remote_analytics
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
host 'analytics-db.internal',
port '5432',
dbname 'analytics'
);
-- 3. Map local user to remote credentials
CREATE USER MAPPING FOR current_user
SERVER remote_analytics
OPTIONS (user 'readonly_user', password 'secret');
-- 4. Create individual foreign tables
CREATE FOREIGN TABLE remote_events (
id bigint,
event_type text,
created_at timestamptz,
payload jsonb
)
SERVER remote_analytics
OPTIONS (schema_name 'public', table_name 'events');
Import all (or selected) tables from a remote schema at once:
-- Import entire remote schema
IMPORT FOREIGN SCHEMA public
FROM SERVER remote_analytics
INTO local_remote_schema;
-- Import only specific tables
IMPORT FOREIGN SCHEMA public
LIMIT TO (events, pageviews, sessions)
FROM SERVER remote_analytics
INTO local_remote_schema;
-- Exclude specific tables
IMPORT FOREIGN SCHEMA public
EXCEPT (internal_audit_log)
FROM SERVER remote_analytics
INTO local_remote_schema;
postgres_fdw pushes WHERE clauses, ORDER BY, LIMIT, and aggregates to the remote server when possible, reducing data transfer.
-- Check what gets pushed down with EXPLAIN VERBOSE
EXPLAIN (VERBOSE, ANALYZE)
SELECT event_type, count(*)
FROM remote_events
WHERE created_at > now() - interval '7 days'
GROUP BY event_type;
-- Look for "Remote SQL:" in the output
Join pushdown (PG14+): joins between two foreign tables on the same server are pushed down to a single remote query:
-- Both tables on same server -> single remote query
SELECT e.event_type, s.user_id
FROM remote_events e
JOIN remote_sessions s ON e.session_id = s.id
WHERE e.created_at > now() - interval '1 day';
Control pushdown behavior per server:
ALTER SERVER remote_analytics OPTIONS (
use_remote_estimate 'true', -- fetch remote row estimates for better plans
fetch_size '10000' -- rows fetched per round-trip (default 100)
);
Inspect all configured FDW objects:
SELECT srvname, srvfdw, srvoptions FROM pg_foreign_server;
SELECT * FROM pg_user_mappings;
SELECT foreign_table_schema, foreign_table_name, foreign_server_name
FROM information_schema.foreign_tables;