replication.md 17 KB

PostgreSQL Replication, Partitioning & FDW Reference

Table of Contents

  1. Streaming Replication
    • Primary Configuration
    • Replica Configuration
    • Synchronous vs Asynchronous
    • Monitoring Replication
    • Replication Slots
  2. Logical Replication
    • Publications
    • Subscriptions
    • Row Filters and Column Lists (PG15+)
    • Use Cases and Limitations
  3. Failover
    • Promoting a Standby
    • Timeline Switches
    • Connection Routing
  4. Table Partitioning
    • RANGE Partitioning
    • LIST Partitioning
    • HASH Partitioning
    • Sub-partitioning
    • Partition Maintenance
    • When to Partition
  5. Foreign Data Wrappers
    • postgres_fdw Setup
    • IMPORT FOREIGN SCHEMA
    • Performance and Pushdown

Streaming Replication

Primary Configuration

Edit postgresql.conf on the primary:

# Minimum required for streaming replication
wal_level = replica          # or 'logical' if you also need logical replication
max_wal_senders = 10         # number of concurrent standby connections
wal_keep_size = 1GB          # retain WAL to prevent standby falling behind
                             # prefer replication slots over this setting

# Optional but recommended
hot_standby_feedback = on    # prevents primary from vacuuming rows standby needs

Create a replication role on the primary:

CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secret';

Allow the standby in pg_hba.conf on the primary:

# TYPE  DATABASE        USER         ADDRESS          METHOD
host    replication     replicator   192.168.1.0/24   scram-sha-256

Reload after editing pg_hba.conf:

SELECT pg_reload_conf();

Replica Configuration

Take a base backup from the primary (run on standby host):

pg_basebackup \
  --host=primary-host \
  --username=replicator \
  --pgdata=/var/lib/postgresql/data \
  --wal-method=stream \
  --checkpoint=fast \
  --progress

Create postgresql.conf overrides or postgresql.auto.conf on the replica:

primary_conninfo = 'host=primary-host port=5432 user=replicator password=secret'
primary_slot_name = 'replica1_slot'   # if using replication slots
hot_standby = on                       # allow read queries on replica
recovery_min_apply_delay = 0           # set to e.g. '30min' for delayed replica

Create the standby signal file (PG12+):

touch /var/lib/postgresql/data/standby.signal

Synchronous vs Asynchronous Replication

Asynchronous (default): primary commits without waiting for standby. Risk of data loss on primary failure equal to replication lag.

Synchronous: primary waits for at least one standby to confirm WAL receipt before returning to client.

# On primary postgresql.conf
synchronous_standby_names = 'replica1'
# or for ANY 1 of multiple standbys:
synchronous_standby_names = 'ANY 1 (replica1, replica2, replica3)'
# or require ALL listed:
synchronous_standby_names = 'FIRST 2 (replica1, replica2, replica3)'

Standby names come from the application_name in primary_conninfo:

primary_conninfo = 'host=primary port=5432 user=replicator application_name=replica1'

Trade-offs:

Mode Durability Write Latency Throughput
Async Data loss possible Low Highest
Sync (remote_write) WAL received, not flushed Medium High
Sync (on) WAL flushed to disk Higher Lower
Sync (remote_apply) Changes applied Highest Lowest
# Control sync level (default is 'on' = flush to standby disk)
synchronous_commit = remote_write   # faster, slight durability trade-off

Monitoring Replication

On the primary, query pg_stat_replication:

SELECT
    application_name,
    client_addr,
    state,                          -- startup, catchup, streaming
    sync_state,                     -- async, sync, potential
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    -- Replication lag in bytes
    (sent_lsn - replay_lsn) AS replay_lag_bytes,
    -- Replication lag in time (PG10+)
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

On the replica, check if it is in recovery and its LSN position:

SELECT
    pg_is_in_recovery(),
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn(),
    pg_last_xact_replay_timestamp(),
    -- Time lag (approximate)
    now() - pg_last_xact_replay_timestamp() AS replication_delay;

Alert when lag exceeds threshold:

-- Alert if replay lag > 30 seconds
SELECT application_name, replay_lag
FROM pg_stat_replication
WHERE replay_lag > interval '30 seconds';

Replication Slots

Replication slots prevent the primary from removing WAL segments needed by a standby, eliminating the need for wal_keep_size tuning. The risk is unbounded WAL accumulation if a slot is abandoned.

Create a physical slot on the primary:

SELECT pg_create_physical_replication_slot('replica1_slot');

List all slots and check for lag:

SELECT
    slot_name,
    slot_type,
    active,
    restart_lsn,
    confirmed_flush_lsn,
    -- WAL retained by this slot in bytes
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_size
FROM pg_replication_slots;

Drop an abandoned slot to reclaim disk:

SELECT pg_drop_replication_slot('replica1_slot');

Set a safety limit to prevent disk exhaustion (PG13+):

max_slot_wal_keep_size = 10GB   # drop slot if WAL retention exceeds this

Logical Replication

Logical replication decodes WAL into row-level change streams. It allows selective table sync and works across major versions.

Publications

A publication defines what changes to export:

-- All tables, all operations
CREATE PUBLICATION pub_all FOR ALL TABLES;

-- Specific tables
CREATE PUBLICATION pub_orders FOR TABLE orders, order_items;

-- Specific operations only
CREATE PUBLICATION pub_inserts FOR TABLE events WITH (publish = 'insert');

-- With row filter (PG15+): only published rows matching WHERE
CREATE PUBLICATION pub_active_orders FOR TABLE orders
    WHERE (status != 'cancelled');

-- With column list (PG15+): only publish selected columns
CREATE PUBLICATION pub_orders_summary FOR TABLE orders (id, status, total, created_at);

Manage publications:

ALTER PUBLICATION pub_orders ADD TABLE shipments;
ALTER PUBLICATION pub_orders DROP TABLE order_items;
DROP PUBLICATION pub_orders;

-- Inspect
SELECT * FROM pg_publication;
SELECT * FROM pg_publication_tables;

The publisher must have wal_level = logical:

wal_level = logical
max_replication_slots = 10
max_wal_senders = 10

Subscriptions

On the subscriber database:

CREATE SUBSCRIPTION sub_orders
    CONNECTION 'host=primary-host dbname=mydb user=replicator password=secret'
    PUBLICATION pub_orders;

The subscriber creates a replication slot on the publisher automatically. The target tables must already exist with compatible schemas.

-- Disable/re-enable a subscription
ALTER SUBSCRIPTION sub_orders DISABLE;
ALTER SUBSCRIPTION sub_orders ENABLE;

-- Refresh after publisher adds tables
ALTER SUBSCRIPTION sub_orders REFRESH PUBLICATION;

-- Skip copying initial data (for ongoing sync only)
CREATE SUBSCRIPTION sub_orders
    CONNECTION '...'
    PUBLICATION pub_orders
    WITH (copy_data = false);

-- Drop subscription (also drops remote slot)
DROP SUBSCRIPTION sub_orders;

Monitor subscriptions:

-- On subscriber
SELECT * FROM pg_stat_subscription;

-- On publisher - logical slots
SELECT slot_name, active, confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_type = 'logical';

Limitations of Logical Replication

  • DDL changes are not replicated. Schema changes must be applied manually to subscribers before altering the publisher.
  • Sequences are not replicated. After failover, reset sequences on the new primary.
  • Large objects (pg_largeobject) are not replicated.
  • Conflict resolution is basic: by default, subscriber errors on unique constraint conflicts. Use ALTER SUBSCRIPTION ... SKIP to advance past a conflict LSN.
  • Requires REPLICA IDENTITY on tables without primary keys:
-- Full row image (slow, safe for tables without PK)
ALTER TABLE events REPLICA IDENTITY FULL;

-- Use a unique index as identity
ALTER TABLE events REPLICA IDENTITY USING INDEX events_uuid_idx;

Failover

Promoting a Standby

Trigger promotion using pg_promote() (PG12+, no file touch needed):

-- Connect to the standby and run:
SELECT pg_promote();

Or use pg_ctl:

pg_ctl promote -D /var/lib/postgresql/data

After promotion, the former standby becomes a normal read-write primary. Update primary_conninfo on remaining standbys to point to the new primary and restart them.

Timeline Switches

Every promotion increments the timeline ID. PostgreSQL uses timelines to track branching histories, allowing standbys to follow the correct WAL history.

-- Check current timeline on any server
SELECT timeline_id FROM pg_control_checkpoint();

-- View WAL segment filenames: first 8 hex chars = timeline
-- 000000020000000000000001 = timeline 2, segment 1

When a former primary comes back, configure it as a new standby using recovery_target_timeline = 'latest' (the default), which lets it follow the new timeline.

Connection Routing

HAProxy (layer 4, health-check based):

frontend postgres_write
    bind *:5432
    default_backend postgres_primary

backend postgres_primary
    option httpchk GET /primary  # Patroni health endpoint
    server pg1 192.168.1.1:5432 check port 8008
    server pg2 192.168.1.2:5432 check port 8008

backend postgres_replica
    option httpchk GET /replica
    server pg1 192.168.1.1:5432 check port 8008
    server pg2 192.168.1.2:5432 check port 8008

PgBouncer target switch: update [databases] section and reload:

[databases]
mydb = host=new-primary-ip port=5432 dbname=mydb
psql -p 6432 pgbouncer -c "RELOAD"

DNS-based: Update the DNS record for pg-primary.internal to point to the new primary's IP. Works well with short TTLs (30s) and application-level retry logic.


Table Partitioning

Declarative partitioning (PG10+) uses PARTITION BY on the parent table. The parent table itself holds no rows.

RANGE Partitioning

Most common for time-series and log data:

CREATE TABLE orders (
    id          bigserial,
    created_at  timestamptz NOT NULL,
    customer_id bigint,
    total       numeric(12,2)
) PARTITION BY RANGE (created_at);

-- Create partitions for each month
CREATE TABLE orders_2024_01
    PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_2024_02
    PARTITION OF orders
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Catch-all default partition
CREATE TABLE orders_default
    PARTITION OF orders DEFAULT;

LIST Partitioning

Useful for discrete categorical values:

CREATE TABLE products (
    id     bigserial,
    region text NOT NULL,
    name   text
) PARTITION BY LIST (region);

CREATE TABLE products_us   PARTITION OF products FOR VALUES IN ('us', 'ca');
CREATE TABLE products_eu   PARTITION OF products FOR VALUES IN ('de', 'fr', 'uk');
CREATE TABLE products_apac PARTITION OF products FOR VALUES IN ('au', 'jp', 'sg');
CREATE TABLE products_other PARTITION OF products DEFAULT;

HASH Partitioning

Distributes rows evenly when there is no natural range or list split:

CREATE TABLE sessions (
    id      uuid NOT NULL,
    user_id bigint,
    data    jsonb
) PARTITION BY HASH (id);

-- 8 partitions, modulus = total count, remainder = partition number
CREATE TABLE sessions_0 PARTITION OF sessions FOR VALUES WITH (modulus 8, remainder 0);
CREATE TABLE sessions_1 PARTITION OF sessions FOR VALUES WITH (modulus 8, remainder 1);
-- ... through remainder 7

Sub-partitioning

Combine strategies: partition by month, then by region within each month:

CREATE TABLE events (
    id         bigserial,
    created_at timestamptz NOT NULL,
    region     text NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2024_01
    PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01')
    PARTITION BY LIST (region);

CREATE TABLE events_2024_01_us
    PARTITION OF events_2024_01
    FOR VALUES IN ('us');

Partition Pruning

The planner eliminates irrelevant partitions at plan time (static) or execution time (dynamic):

-- Enable/disable for debugging
SET enable_partition_pruning = on;  -- default on

EXPLAIN SELECT * FROM orders WHERE created_at >= '2024-06-01' AND created_at < '2024-07-01';
-- Should show only orders_2024_06 in the plan, not all partitions

Each partition should have its own indexes. Indexes on the parent do not cascade automatically (they do in PG11+ for primary keys and unique constraints created on the parent):

-- Create index on all existing partitions at once (PG11+ creates on parent + all children)
CREATE INDEX ON orders (customer_id);

Partition Maintenance

-- Add a new partition (no locking on existing data)
CREATE TABLE orders_2025_01
    PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

-- Detach a partition (it becomes a standalone table, no data movement)
ALTER TABLE orders DETACH PARTITION orders_2023_01;
-- PG14+: detach concurrently (non-blocking)
ALTER TABLE orders DETACH PARTITION orders_2023_01 CONCURRENTLY;

-- Drop old data instantly (no vacuum needed)
DROP TABLE orders_2023_01;

-- Attach an existing table as a partition (verify constraint first)
ALTER TABLE orders_old ADD CONSTRAINT orders_old_check
    CHECK (created_at >= '2022-01-01' AND created_at < '2023-01-01');
ALTER TABLE orders ATTACH PARTITION orders_old
    FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');

When to Partition

Partition when:

  • Table exceeds ~100M rows or 100GB and queries frequently filter on the partition key
  • You need instant bulk deletes (drop a partition vs DELETE + VACUUM)
  • You want to spread data across tablespaces on different disks
  • Autovacuum cannot keep up with a single large table

Do not partition just because a table is large. Partitioning adds overhead for queries that scan all partitions (no partition key filter). A well-indexed single table often outperforms a partitioned one for OLTP workloads.


Foreign Data Wrappers

FDWs allow PostgreSQL to query external data sources as if they were local tables.

postgres_fdw Setup

-- 1. Install extension
CREATE EXTENSION postgres_fdw;

-- 2. Define the remote server
CREATE SERVER remote_analytics
    FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (
        host 'analytics-db.internal',
        port '5432',
        dbname 'analytics'
    );

-- 3. Map local user to remote credentials
CREATE USER MAPPING FOR current_user
    SERVER remote_analytics
    OPTIONS (user 'readonly_user', password 'secret');

-- 4. Create individual foreign tables
CREATE FOREIGN TABLE remote_events (
    id         bigint,
    event_type text,
    created_at timestamptz,
    payload    jsonb
)
SERVER remote_analytics
OPTIONS (schema_name 'public', table_name 'events');

IMPORT FOREIGN SCHEMA

Import all (or selected) tables from a remote schema at once:

-- Import entire remote schema
IMPORT FOREIGN SCHEMA public
    FROM SERVER remote_analytics
    INTO local_remote_schema;

-- Import only specific tables
IMPORT FOREIGN SCHEMA public
    LIMIT TO (events, pageviews, sessions)
    FROM SERVER remote_analytics
    INTO local_remote_schema;

-- Exclude specific tables
IMPORT FOREIGN SCHEMA public
    EXCEPT (internal_audit_log)
    FROM SERVER remote_analytics
    INTO local_remote_schema;

Performance and Pushdown

postgres_fdw pushes WHERE clauses, ORDER BY, LIMIT, and aggregates to the remote server when possible, reducing data transfer.

-- Check what gets pushed down with EXPLAIN VERBOSE
EXPLAIN (VERBOSE, ANALYZE)
SELECT event_type, count(*)
FROM remote_events
WHERE created_at > now() - interval '7 days'
GROUP BY event_type;
-- Look for "Remote SQL:" in the output

Join pushdown (PG14+): joins between two foreign tables on the same server are pushed down to a single remote query:

-- Both tables on same server -> single remote query
SELECT e.event_type, s.user_id
FROM remote_events e
JOIN remote_sessions s ON e.session_id = s.id
WHERE e.created_at > now() - interval '1 day';

Control pushdown behavior per server:

ALTER SERVER remote_analytics OPTIONS (
    use_remote_estimate 'true',   -- fetch remote row estimates for better plans
    fetch_size '10000'             -- rows fetched per round-trip (default 100)
);

Inspect all configured FDW objects:

SELECT srvname, srvfdw, srvoptions FROM pg_foreign_server;
SELECT * FROM pg_user_mappings;
SELECT foreign_table_schema, foreign_table_name, foreign_server_name
FROM information_schema.foreign_tables;