# PostgreSQL Replication, Partitioning & FDW Reference ## Table of Contents 1. [Streaming Replication](#streaming-replication) - Primary Configuration - Replica Configuration - Synchronous vs Asynchronous - Monitoring Replication - Replication Slots 2. [Logical Replication](#logical-replication) - Publications - Subscriptions - Row Filters and Column Lists (PG15+) - Use Cases and Limitations 3. [Failover](#failover) - Promoting a Standby - Timeline Switches - Connection Routing 4. [Table Partitioning](#table-partitioning) - RANGE Partitioning - LIST Partitioning - HASH Partitioning - Sub-partitioning - Partition Maintenance - When to Partition 5. [Foreign Data Wrappers](#foreign-data-wrappers) - postgres_fdw Setup - IMPORT FOREIGN SCHEMA - Performance and Pushdown --- ## Streaming Replication ### Primary Configuration Edit `postgresql.conf` on the primary: ```ini # Minimum required for streaming replication wal_level = replica # or 'logical' if you also need logical replication max_wal_senders = 10 # number of concurrent standby connections wal_keep_size = 1GB # retain WAL to prevent standby falling behind # prefer replication slots over this setting # Optional but recommended hot_standby_feedback = on # prevents primary from vacuuming rows standby needs ``` Create a replication role on the primary: ```sql CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secret'; ``` Allow the standby in `pg_hba.conf` on the primary: ``` # TYPE DATABASE USER ADDRESS METHOD host replication replicator 192.168.1.0/24 scram-sha-256 ``` Reload after editing `pg_hba.conf`: ```sql SELECT pg_reload_conf(); ``` ### Replica Configuration Take a base backup from the primary (run on standby host): ```bash pg_basebackup \ --host=primary-host \ --username=replicator \ --pgdata=/var/lib/postgresql/data \ --wal-method=stream \ --checkpoint=fast \ --progress ``` Create `postgresql.conf` overrides or `postgresql.auto.conf` on the replica: ```ini primary_conninfo = 'host=primary-host port=5432 user=replicator password=secret' primary_slot_name = 'replica1_slot' # if using replication slots hot_standby = on # allow read queries on replica recovery_min_apply_delay = 0 # set to e.g. '30min' for delayed replica ``` Create the standby signal file (PG12+): ```bash touch /var/lib/postgresql/data/standby.signal ``` ### Synchronous vs Asynchronous Replication **Asynchronous** (default): primary commits without waiting for standby. Risk of data loss on primary failure equal to replication lag. **Synchronous**: primary waits for at least one standby to confirm WAL receipt before returning to client. ```ini # On primary postgresql.conf synchronous_standby_names = 'replica1' # or for ANY 1 of multiple standbys: synchronous_standby_names = 'ANY 1 (replica1, replica2, replica3)' # or require ALL listed: synchronous_standby_names = 'FIRST 2 (replica1, replica2, replica3)' ``` Standby names come from the `application_name` in `primary_conninfo`: ```ini primary_conninfo = 'host=primary port=5432 user=replicator application_name=replica1' ``` Trade-offs: | Mode | Durability | Write Latency | Throughput | |------|-----------|---------------|------------| | Async | Data loss possible | Low | Highest | | Sync (remote_write) | WAL received, not flushed | Medium | High | | Sync (on) | WAL flushed to disk | Higher | Lower | | Sync (remote_apply) | Changes applied | Highest | Lowest | ```ini # Control sync level (default is 'on' = flush to standby disk) synchronous_commit = remote_write # faster, slight durability trade-off ``` ### Monitoring Replication On the primary, query `pg_stat_replication`: ```sql SELECT application_name, client_addr, state, -- startup, catchup, streaming sync_state, -- async, sync, potential sent_lsn, write_lsn, flush_lsn, replay_lsn, -- Replication lag in bytes (sent_lsn - replay_lsn) AS replay_lag_bytes, -- Replication lag in time (PG10+) write_lag, flush_lag, replay_lag FROM pg_stat_replication; ``` On the replica, check if it is in recovery and its LSN position: ```sql SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(), -- Time lag (approximate) now() - pg_last_xact_replay_timestamp() AS replication_delay; ``` Alert when lag exceeds threshold: ```sql -- Alert if replay lag > 30 seconds SELECT application_name, replay_lag FROM pg_stat_replication WHERE replay_lag > interval '30 seconds'; ``` ### Replication Slots Replication slots prevent the primary from removing WAL segments needed by a standby, eliminating the need for `wal_keep_size` tuning. The risk is unbounded WAL accumulation if a slot is abandoned. Create a physical slot on the primary: ```sql SELECT pg_create_physical_replication_slot('replica1_slot'); ``` List all slots and check for lag: ```sql SELECT slot_name, slot_type, active, restart_lsn, confirmed_flush_lsn, -- WAL retained by this slot in bytes pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_size FROM pg_replication_slots; ``` Drop an abandoned slot to reclaim disk: ```sql SELECT pg_drop_replication_slot('replica1_slot'); ``` Set a safety limit to prevent disk exhaustion (PG13+): ```ini max_slot_wal_keep_size = 10GB # drop slot if WAL retention exceeds this ``` --- ## Logical Replication Logical replication decodes WAL into row-level change streams. It allows selective table sync and works across major versions. ### Publications A publication defines what changes to export: ```sql -- All tables, all operations CREATE PUBLICATION pub_all FOR ALL TABLES; -- Specific tables CREATE PUBLICATION pub_orders FOR TABLE orders, order_items; -- Specific operations only CREATE PUBLICATION pub_inserts FOR TABLE events WITH (publish = 'insert'); -- With row filter (PG15+): only published rows matching WHERE CREATE PUBLICATION pub_active_orders FOR TABLE orders WHERE (status != 'cancelled'); -- With column list (PG15+): only publish selected columns CREATE PUBLICATION pub_orders_summary FOR TABLE orders (id, status, total, created_at); ``` Manage publications: ```sql ALTER PUBLICATION pub_orders ADD TABLE shipments; ALTER PUBLICATION pub_orders DROP TABLE order_items; DROP PUBLICATION pub_orders; -- Inspect SELECT * FROM pg_publication; SELECT * FROM pg_publication_tables; ``` The publisher must have `wal_level = logical`: ```ini wal_level = logical max_replication_slots = 10 max_wal_senders = 10 ``` ### Subscriptions On the subscriber database: ```sql CREATE SUBSCRIPTION sub_orders CONNECTION 'host=primary-host dbname=mydb user=replicator password=secret' PUBLICATION pub_orders; ``` The subscriber creates a replication slot on the publisher automatically. The target tables must already exist with compatible schemas. ```sql -- Disable/re-enable a subscription ALTER SUBSCRIPTION sub_orders DISABLE; ALTER SUBSCRIPTION sub_orders ENABLE; -- Refresh after publisher adds tables ALTER SUBSCRIPTION sub_orders REFRESH PUBLICATION; -- Skip copying initial data (for ongoing sync only) CREATE SUBSCRIPTION sub_orders CONNECTION '...' PUBLICATION pub_orders WITH (copy_data = false); -- Drop subscription (also drops remote slot) DROP SUBSCRIPTION sub_orders; ``` Monitor subscriptions: ```sql -- On subscriber SELECT * FROM pg_stat_subscription; -- On publisher - logical slots SELECT slot_name, active, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_type = 'logical'; ``` ### Limitations of Logical Replication - DDL changes are not replicated. Schema changes must be applied manually to subscribers before altering the publisher. - Sequences are not replicated. After failover, reset sequences on the new primary. - Large objects (`pg_largeobject`) are not replicated. - Conflict resolution is basic: by default, subscriber errors on unique constraint conflicts. Use `ALTER SUBSCRIPTION ... SKIP` to advance past a conflict LSN. - Requires `REPLICA IDENTITY` on tables without primary keys: ```sql -- Full row image (slow, safe for tables without PK) ALTER TABLE events REPLICA IDENTITY FULL; -- Use a unique index as identity ALTER TABLE events REPLICA IDENTITY USING INDEX events_uuid_idx; ``` --- ## Failover ### Promoting a Standby Trigger promotion using `pg_promote()` (PG12+, no file touch needed): ```sql -- Connect to the standby and run: SELECT pg_promote(); ``` Or use `pg_ctl`: ```bash pg_ctl promote -D /var/lib/postgresql/data ``` After promotion, the former standby becomes a normal read-write primary. Update `primary_conninfo` on remaining standbys to point to the new primary and restart them. ### Timeline Switches Every promotion increments the timeline ID. PostgreSQL uses timelines to track branching histories, allowing standbys to follow the correct WAL history. ```sql -- Check current timeline on any server SELECT timeline_id FROM pg_control_checkpoint(); -- View WAL segment filenames: first 8 hex chars = timeline -- 000000020000000000000001 = timeline 2, segment 1 ``` When a former primary comes back, configure it as a new standby using `recovery_target_timeline = 'latest'` (the default), which lets it follow the new timeline. ### Connection Routing **HAProxy** (layer 4, health-check based): ``` frontend postgres_write bind *:5432 default_backend postgres_primary backend postgres_primary option httpchk GET /primary # Patroni health endpoint server pg1 192.168.1.1:5432 check port 8008 server pg2 192.168.1.2:5432 check port 8008 backend postgres_replica option httpchk GET /replica server pg1 192.168.1.1:5432 check port 8008 server pg2 192.168.1.2:5432 check port 8008 ``` **PgBouncer** target switch: update `[databases]` section and reload: ```ini [databases] mydb = host=new-primary-ip port=5432 dbname=mydb ``` ```bash psql -p 6432 pgbouncer -c "RELOAD" ``` **DNS-based**: Update the DNS record for `pg-primary.internal` to point to the new primary's IP. Works well with short TTLs (30s) and application-level retry logic. --- ## Table Partitioning Declarative partitioning (PG10+) uses `PARTITION BY` on the parent table. The parent table itself holds no rows. ### RANGE Partitioning Most common for time-series and log data: ```sql CREATE TABLE orders ( id bigserial, created_at timestamptz NOT NULL, customer_id bigint, total numeric(12,2) ) PARTITION BY RANGE (created_at); -- Create partitions for each month CREATE TABLE orders_2024_01 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); CREATE TABLE orders_2024_02 PARTITION OF orders FOR VALUES FROM ('2024-02-01') TO ('2024-03-01'); -- Catch-all default partition CREATE TABLE orders_default PARTITION OF orders DEFAULT; ``` ### LIST Partitioning Useful for discrete categorical values: ```sql CREATE TABLE products ( id bigserial, region text NOT NULL, name text ) PARTITION BY LIST (region); CREATE TABLE products_us PARTITION OF products FOR VALUES IN ('us', 'ca'); CREATE TABLE products_eu PARTITION OF products FOR VALUES IN ('de', 'fr', 'uk'); CREATE TABLE products_apac PARTITION OF products FOR VALUES IN ('au', 'jp', 'sg'); CREATE TABLE products_other PARTITION OF products DEFAULT; ``` ### HASH Partitioning Distributes rows evenly when there is no natural range or list split: ```sql CREATE TABLE sessions ( id uuid NOT NULL, user_id bigint, data jsonb ) PARTITION BY HASH (id); -- 8 partitions, modulus = total count, remainder = partition number CREATE TABLE sessions_0 PARTITION OF sessions FOR VALUES WITH (modulus 8, remainder 0); CREATE TABLE sessions_1 PARTITION OF sessions FOR VALUES WITH (modulus 8, remainder 1); -- ... through remainder 7 ``` ### Sub-partitioning Combine strategies: partition by month, then by region within each month: ```sql CREATE TABLE events ( id bigserial, created_at timestamptz NOT NULL, region text NOT NULL ) PARTITION BY RANGE (created_at); CREATE TABLE events_2024_01 PARTITION OF events FOR VALUES FROM ('2024-01-01') TO ('2024-02-01') PARTITION BY LIST (region); CREATE TABLE events_2024_01_us PARTITION OF events_2024_01 FOR VALUES IN ('us'); ``` ### Partition Pruning The planner eliminates irrelevant partitions at plan time (static) or execution time (dynamic): ```sql -- Enable/disable for debugging SET enable_partition_pruning = on; -- default on EXPLAIN SELECT * FROM orders WHERE created_at >= '2024-06-01' AND created_at < '2024-07-01'; -- Should show only orders_2024_06 in the plan, not all partitions ``` Each partition should have its own indexes. Indexes on the parent do not cascade automatically (they do in PG11+ for primary keys and unique constraints created on the parent): ```sql -- Create index on all existing partitions at once (PG11+ creates on parent + all children) CREATE INDEX ON orders (customer_id); ``` ### Partition Maintenance ```sql -- Add a new partition (no locking on existing data) CREATE TABLE orders_2025_01 PARTITION OF orders FOR VALUES FROM ('2025-01-01') TO ('2025-02-01'); -- Detach a partition (it becomes a standalone table, no data movement) ALTER TABLE orders DETACH PARTITION orders_2023_01; -- PG14+: detach concurrently (non-blocking) ALTER TABLE orders DETACH PARTITION orders_2023_01 CONCURRENTLY; -- Drop old data instantly (no vacuum needed) DROP TABLE orders_2023_01; -- Attach an existing table as a partition (verify constraint first) ALTER TABLE orders_old ADD CONSTRAINT orders_old_check CHECK (created_at >= '2022-01-01' AND created_at < '2023-01-01'); ALTER TABLE orders ATTACH PARTITION orders_old FOR VALUES FROM ('2022-01-01') TO ('2023-01-01'); ``` ### When to Partition Partition when: - Table exceeds ~100M rows or 100GB and queries frequently filter on the partition key - You need instant bulk deletes (drop a partition vs DELETE + VACUUM) - You want to spread data across tablespaces on different disks - Autovacuum cannot keep up with a single large table Do not partition just because a table is large. Partitioning adds overhead for queries that scan all partitions (no partition key filter). A well-indexed single table often outperforms a partitioned one for OLTP workloads. --- ## Foreign Data Wrappers FDWs allow PostgreSQL to query external data sources as if they were local tables. ### postgres_fdw Setup ```sql -- 1. Install extension CREATE EXTENSION postgres_fdw; -- 2. Define the remote server CREATE SERVER remote_analytics FOREIGN DATA WRAPPER postgres_fdw OPTIONS ( host 'analytics-db.internal', port '5432', dbname 'analytics' ); -- 3. Map local user to remote credentials CREATE USER MAPPING FOR current_user SERVER remote_analytics OPTIONS (user 'readonly_user', password 'secret'); -- 4. Create individual foreign tables CREATE FOREIGN TABLE remote_events ( id bigint, event_type text, created_at timestamptz, payload jsonb ) SERVER remote_analytics OPTIONS (schema_name 'public', table_name 'events'); ``` ### IMPORT FOREIGN SCHEMA Import all (or selected) tables from a remote schema at once: ```sql -- Import entire remote schema IMPORT FOREIGN SCHEMA public FROM SERVER remote_analytics INTO local_remote_schema; -- Import only specific tables IMPORT FOREIGN SCHEMA public LIMIT TO (events, pageviews, sessions) FROM SERVER remote_analytics INTO local_remote_schema; -- Exclude specific tables IMPORT FOREIGN SCHEMA public EXCEPT (internal_audit_log) FROM SERVER remote_analytics INTO local_remote_schema; ``` ### Performance and Pushdown postgres_fdw pushes WHERE clauses, ORDER BY, LIMIT, and aggregates to the remote server when possible, reducing data transfer. ```sql -- Check what gets pushed down with EXPLAIN VERBOSE EXPLAIN (VERBOSE, ANALYZE) SELECT event_type, count(*) FROM remote_events WHERE created_at > now() - interval '7 days' GROUP BY event_type; -- Look for "Remote SQL:" in the output ``` Join pushdown (PG14+): joins between two foreign tables on the same server are pushed down to a single remote query: ```sql -- Both tables on same server -> single remote query SELECT e.event_type, s.user_id FROM remote_events e JOIN remote_sessions s ON e.session_id = s.id WHERE e.created_at > now() - interval '1 day'; ``` Control pushdown behavior per server: ```sql ALTER SERVER remote_analytics OPTIONS ( use_remote_estimate 'true', -- fetch remote row estimates for better plans fetch_size '10000' -- rows fetched per round-trip (default 100) ); ``` Inspect all configured FDW objects: ```sql SELECT srvname, srvfdw, srvoptions FROM pg_foreign_server; SELECT * FROM pg_user_mappings; SELECT foreign_table_schema, foreign_table_name, foreign_server_name FROM information_schema.foreign_tables; ```