Skip to content

Commit 79e265d

Browse files
authored
Merge pull request #80 from wireapp/ha-postgres-cluster-setup
WPB-19318: Add postgresql-cluster setup guide
2 parents 556e301 + d79d0b9 commit 79e265d

File tree

2 files changed

+289
-3
lines changed

2 files changed

+289
-3
lines changed
Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
# PostgreSQL High Availability Cluster - Quick Setup
2+
3+
## What You're Building
4+
5+
A three-node PostgreSQL cluster with one primary node (handling writes) and two replicas (ready to take over on failure). The system includes automatic failover via [repmgr](https://www.repmgr.org/docs/current/index.html) and split-brain protection to prevent data corruption during network partitions.
6+
7+
## Prerequisites
8+
9+
Three Ubuntu servers with static IP addresses and SSH access configured.
10+
11+
## Step 1: Define Your Inventory
12+
13+
Create or edit your inventory file at `ansible/inventory/offline/hosts.ini` to define your PostgreSQL servers and their configuration.
14+
15+
```ini
16+
[all]
17+
postgresql1 ansible_host=192.168.122.236
18+
postgresql2 ansible_host=192.168.122.233
19+
postgresql3 ansible_host=192.168.122.206
20+
21+
[postgresql:vars]
22+
postgresql_network_interface = enp1s0
23+
wire_dbname = wire-server
24+
wire_user = wire-server
25+
26+
[postgresql]
27+
postgresql1
28+
postgresql2
29+
postgresql3
30+
31+
[postgresql_rw]
32+
postgresql1
33+
34+
[postgresql_ro]
35+
postgresql2
36+
postgresql3
37+
```
38+
39+
The `postgresql_rw` group designates your primary node (accepts writes), while `postgresql_ro` contains replica nodes (follow the primary and can be promoted if needed). The network interface variable specifies which adapter to use for cluster communication.
40+
41+
## Step 2: Test Connectivity
42+
43+
Verify Ansible can reach all three servers:
44+
45+
```bash
46+
d ansible all -i ansible/inventory/offline/hosts.ini -m ping
47+
```
48+
49+
You should see three successful responses. If any node fails, check your SSH configuration and network connectivity.
50+
51+
## Step 3: Deploy the Complete Cluster
52+
53+
Run the deployment playbook (takes 10-15 minutes):
54+
55+
```bash
56+
d ansible-playbook -i ansible/inventory/offline/hosts.ini ansible/postgresql-deploy.yml
57+
```
58+
59+
This playbook installs PostgreSQL 17 and repmgr on all nodes, configures the primary node, clones and registers the replicas, deploys split-brain detection, creates the Wire database with credentials, and runs health checks.
60+
61+
## Step 4: Verify the Cluster
62+
63+
Check cluster status from any node:
64+
65+
```bash
66+
sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
67+
```
68+
69+
You should see one primary node (marked with asterisk) and two standby nodes, all running. Standby nodes should list the primary as their upstream node.
70+
71+
Verify critical services are running:
72+
73+
```bash
74+
sudo systemctl status postgresql@17-main repmgrd@17-main detect-rogue-primary.timer
75+
```
76+
77+
All three services should be active: postgresql (database engine), repmgrd (cluster health and failover), and detect-rogue-primary timer (checks for conflicting primaries every 30 seconds).
78+
79+
## Step 5: Check Replication Status
80+
81+
On the primary node, verify both replicas are receiving data via streaming replication:
82+
83+
```bash
84+
sudo -u postgres psql -c "SELECT application_name, client_addr, state FROM pg_stat_replication;"
85+
```
86+
87+
You should see two rows (one per replica) with state "streaming", confirming continuous replication to both standby nodes.
88+
89+
## Step 6: Wire Database Credentials
90+
91+
The playbook generates a secure password and stores it in the `wire-postgresql-external-secret` Kubernetes secret. Running `bin/offline-deploy.sh` automatically syncs this password to `brig` and `galley` service secrets in `values/wire-server/secrets.yaml`.
92+
93+
If deploying/upgrading wire-server manually, use one of these methods:
94+
95+
### Option 1: Run the sync script in the adminhosts container:
96+
97+
```bash
98+
# Sync PostgreSQL password from K8s secret to secrets.yaml
99+
./bin/sync-k8s-secret-to-wire-secrets.sh \
100+
wire-postgresql-external-secret \
101+
password \
102+
values/wire-server/secrets.yaml \
103+
.brig.secrets.pgPassword \
104+
.galley.secrets.pgPassword
105+
```
106+
107+
This script retrieves the password from `wire-postgresql-external-secret`, updates multiple YAML paths, creates a backup at `secrets.yaml.bak`, verifies updates, and works with any Kubernetes secret and YAML file.
108+
109+
### Option 2: Manual Password Override
110+
111+
Override passwords during helm installation:
112+
113+
```bash
114+
# Retrieve password from Kubernetes secret
115+
PG_PASSWORD=$(kubectl get secret wire-postgresql-external-secret \
116+
-n default \
117+
-o jsonpath='{.data.password}' | base64 --decode)
118+
119+
# Install/upgrade with password override
120+
helm upgrade --install wire-server ./charts/wire-server \
121+
--namespace default \
122+
-f values/wire-server/values.yaml \
123+
-f values/wire-server/secrets.yaml \
124+
--set brig.secrets.pgPassword="${PG_PASSWORD}" \
125+
--set galley.secrets.pgPassword="${PG_PASSWORD}"
126+
```
127+
128+
## Optional: Test Automatic Failover
129+
130+
To verify automatic failover works, simulate a primary failure by stopping the PostgreSQL service on the primary node:
131+
132+
```bash
133+
sudo systemctl mask postgresql@17-main && sudo systemctl stop postgresql@17-main
134+
```
135+
136+
Wait 30 seconds, then check cluster status from a replica node:
137+
138+
```bash
139+
sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
140+
```
141+
142+
One replica should now be promoted to primary. The repmgrd daemon detected the failure, formed quorum, selected the best candidate based on replication lag and priority, and promoted it. The remaining replica automatically reconfigures to follow the new primary.
143+
144+
## What Happens During Failover
145+
146+
When the primary fails, repmgrd daemons retry connections every five seconds. After five failures (~25 seconds), the replicas reach consensus that the primary is down (requiring two-node quorum to prevent false positives). The system promotes the most up-to-date replica with the highest priority using PostgreSQL's native promotion function. The remaining replica detects the new primary and begins following it, while postgres-endpoint-manager updates Kubernetes services to point to the new primary.
147+
148+
## Recovery Time Expectations
149+
150+
The cluster recovers within 30 seconds of a primary failure. Applications running in Kubernetes may take up to 2 minutes to reconnect due to the postgres-endpoint-manager's polling cycle, resulting in 30 seconds to 2 minutes of database unavailability during unplanned failover.
151+
152+
## Troubleshooting
153+
154+
### Common Issues During Deployment
155+
156+
157+
#### PostgreSQL Service Won't Start
158+
If PostgreSQL fails to start after deployment:
159+
```bash
160+
# Check PostgreSQL logs
161+
sudo journalctl -u postgresql@17-main -f
162+
163+
# Verify configuration files exist and are readable
164+
sudo test -f /etc/postgresql/17/main/postgresql.conf && echo "Config file exists" || echo "Config file missing"
165+
sudo -u postgres test -r /etc/postgresql/17/main/postgresql.conf && echo "Config readable by postgres user" || echo "Config not readable"
166+
167+
# Check PostgreSQL configuration syntax
168+
sudo -u postgres /usr/lib/postgresql/17/bin/postgres --config-file=/etc/postgresql/17/main/postgresql.conf -C shared_preload_libraries
169+
170+
# Check disk space and permissions
171+
df -h /var/lib/postgresql/
172+
sudo ls -la /var/lib/postgresql/17/main/
173+
```
174+
175+
#### Replication Issues
176+
If standby nodes show "disconnected" status:
177+
```bash
178+
# On primary: Check if replicas are connecting
179+
sudo -u postgres psql -c "SELECT client_addr, state, sync_state FROM pg_stat_replication;"
180+
181+
182+
183+
# Verify repmgr configuration
184+
sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf node check
185+
```
186+
187+
### Post-Deployment Issues
188+
189+
#### Split-Brain Detection
190+
If you suspect multiple primaries exist, check the cluster status on each node:
191+
```bash
192+
sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
193+
194+
# Check detect-rogue-primary logs
195+
sudo journalctl -u detect-rogue-primary.timer -u detect-rogue-primary.service
196+
```
197+
198+
#### Failed Automatic Failover
199+
If failover doesn't happen automatically:
200+
```bash
201+
# Check repmgrd status and logs
202+
sudo systemctl status repmgrd@17-main
203+
sudo journalctl -u repmgrd@17-main -f
204+
205+
# Verify quorum requirements
206+
sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show --compact
207+
208+
# Manual failover if needed
209+
sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf standby promote
210+
```
211+
212+
#### Replication Lag Issues
213+
If standby nodes fall behind:
214+
```bash
215+
# Check replication lag
216+
sudo -u postgres psql -c "SELECT client_addr, sent_lsn, write_lsn, flush_lsn, replay_lsn, (sent_lsn - replay_lsn) AS lag FROM pg_stat_replication;"
217+
218+
```
219+
220+
#### Kubernetes Integration Issues
221+
If postgres-external chart fails to detect the primary:
222+
```bash
223+
# Check postgres-endpoint-manager logs
224+
d kubectl logs -l app=postgres-endpoint-manager
225+
226+
# Verify service endpoints
227+
d kubectl get endpoints postgresql-external-rw postgresql-external-ro
228+
229+
# Test connectivity from within cluster
230+
d kubectl run test-pg --rm -it --image=postgres:17 -- psql -h postgresql-external-rw -U wire-server -d wire-server
231+
```
232+
233+
### Recovery Scenarios
234+
235+
For detailed recovery procedures covering complex scenarios such as:
236+
- Complete cluster failure recovery
237+
- Corrupt data node replacement
238+
- Network partition recovery
239+
- Emergency manual intervention
240+
- Backup and restore procedures
241+
- Disaster recovery planning
242+
243+
Please refer to the [comprehensive PostgreSQL cluster recovery documentation](https://github.com/wireapp/wire-server-deploy/blob/main/offline/postgresql-cluster.md) in the wire-server-deploy repository.
244+
245+
## Next Steps
246+
247+
With your PostgreSQL HA cluster running, integrate it with your Wire server deployment. The cluster runs independently outside Kubernetes. The postgres-endpoint-manager component (deployed with postgres-external helm chart) keeps Kubernetes services pointed at the current primary, ensuring seamless connectivity during failover.
248+
249+
### Install postgres-external helm chart
250+
251+
From the wire-server-deploy directory:
252+
253+
```bash
254+
d helm upgrade --install postgresql-external ./charts/postgresql-external
255+
```
256+
257+
This configures `postgresql-external-rw` and `postgresql-external-ro` services with corresponding endpoints.
258+
259+
The helm chart deploys a postgres-endpoint-manager cronjob that runs every 2 minutes to check the current primary. On failover, it updates endpoints with the current primary and standbys. When stable, it runs as a health probe.
260+
261+
Check cronjob logs:
262+
263+
```bash
264+
# Get cronjob pods
265+
d kubectl get pods -A | grep postgres-endpoint-manager
266+
267+
# Inspect logs to see primary/standby detection and updates
268+
d kubectl logs postgres-endpoint-manager-29329300-6zphm # replace with actual pod name
269+
```
270+
271+
See the [postgres-endpoint-manager](https://github.com/wireapp/postgres-endpoint-manager) repository for endpoint update details.
272+
273+
274+

src/how-to/install/ansible-VMs.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,14 @@ kubespray, via ansible. This section covers installing VMs with ansible.
2222
| Cassandra | 3 | 2 | 4 | 80 |
2323
| MinIO | 3 | 1 | 2 | 400 |
2424
| ElasticSearch | 3 | 1 | 2 | 60 |
25+
| postgresql | 3 | 1 | 2 | 50 |
2526
| Kubernetes³ | 3 || 8 | 40 |
2627
| Restund⁴ | 2 | 1 | 2 | 10 |
27-
| **Per-Server Totals** || 11 CPU Cores | 18 GB Memory | 590 GB Disk Space |
28+
| **Per-Server Totals** || 12 CPU Cores | 20 GB Memory | 640 GB Disk Space |
2829
| Admin Host² | 1 | 1 | 4 | 40 |
2930
| Asset Host² | 1 | 1 | 4 | 100 |
30-
| **Per-Server Totals with<br/>Admin and Asset Hosts** || 13 CPU Cores | 26 GB Memory | 730 GB Disk Space |
31-
- ¹ Kubernetes hosts may need more ressources to support SFT (Conference Calling). See “Conference Calling Hardware Requirements” below.
31+
| **Per-Server Totals with<br/>Admin and Asset Hosts** || 14 CPU Cores | 28 GB Memory | 780 GB Disk Space |
32+
- ¹ Kubernetes hosts may need more resources to support SFT (Conference Calling). See “Conference Calling Hardware Requirements” below.
3233
- ² Admin and Asset Hosts can run on any one of the 3 servers, but that server must not allocate additional resources as indicated in the table above.
3334
- ³ Etcd is run inside of Kubernetes, hence no specific resource allocation
3435
- ⁴ Restund may be hosted on only 2 of the 3 servers, or all 3. Two nodes are enough to ensure high availability of Restund services
@@ -283,6 +284,17 @@ minio_network_interface=ens3
283284

284285
This configures the Cargohold service with its IAM user credentials to securely manage the `assets` bucket.
285286

287+
### Postgresql cluster
288+
289+
To set up a high-availability PostgreSQL cluster for Wire Server, refer to the [PostgreSQL High Availability Cluster - Quick Setup](../administrate/postgresql-cluster.md) guide. This will walk you through deploying a three-node PostgreSQL cluster with automatic failover capabilities using repmgr.
290+
291+
The setup includes:
292+
- Three PostgreSQL nodes (one primary, two replicas)
293+
- Automatic failover and split-brain protection
294+
- Integration with Kubernetes via the postgres-endpoint-manager
295+
296+
After deploying the PostgreSQL cluster, make sure to install the `postgresql-external` helm chart as described in the guide to integrate it with your Wire Server deployment.
297+
286298
### Restund
287299

288300
For instructions on how to install Restund, see [this page](restund.md#install-restund).

0 commit comments

Comments
 (0)