Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/operator.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,14 @@ Name of the pgBackRest repository in the primary cluster this standby cluster co
| ---------- | ------- |
| :material-code-string: string | `repo1` |

### `standby.maxAcceptableLag`

The maximum amount of WAL data that the standby cluster can be behind the primary cluster. It is measured in bytes of WAL data. When the WAL lag exceeds this value, the primary pod in the standby cluster is marked as unready, the cluster goes into the `initializing` state, and a `StandbyLagging` condition is set in the status. If unset, lag is not checked. Use Kubernetes quantity format (for example, `10Mi`, `1Gi`).

| Value type | Example |
| ---------- | ------- |
| :material-code-string: string | `10Mi` |

### `secrets.customRootCATLSSecret.name`

Name of the secret with the custom root CA certificate and key for secure connections to the PostgreSQL server, see [Transport Layer Security (TLS)](TLS.md) for details.
Expand Down
5 changes: 3 additions & 2 deletions docs/standby-backup.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ The pgBackRest repo-based standby is the simplest one. The following is the arch

### Configure DR site

The configuration of the disaster recovery site is similar [to that of the Main site](#configure-main-site), with the only difference in standby settings.
The configuration of the disaster recovery site is similar [to that of the Main site](#configure-main-site), with the difference in standby settings.

The following manifest has `standby.enabled` set to `true` and points to the `repoName` where backups are (GCS in our case):
The following manifest has `standby.enabled` set to `true` and points to the `repoName` where backups are (GCS in our case). Optionally, add `standby.maxAcceptableLag` to enable [replication lag detection](standby.md#detect-replication-lag-for-standby-cluster). Check the [known limitation for this standby type](standby.md#known-limitation-for-a-repo-based-standby-cluster).

```yaml
metadata:
Expand All @@ -80,6 +80,7 @@ spec:
standby:
enabled: true
repoName: repo1
# maxAcceptableLag: 10Mi # optional: enables replication lag detection
```

Deploy the standby cluster by applying the manifest:
Expand Down
2 changes: 2 additions & 0 deletions docs/standby-streaming.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,10 +80,12 @@ Apart from setting certificates correctly, you should also set standby configura
standby:
enabled: true
host: main-ha.main-pg.svc
maxAcceptableLag: 10Mi # optional: enables replication lag detection
```

* `standby.enabled` controls if it is a standby cluster or not
* `standby.host` must point to the primary node of a Main cluster. In this example it is a `main-ha` Service in another namespace.
* `standby.maxAcceptableLag` (optional) enables replication lag detection. When the WAL lag exceeds this value, the standby primary pod is marked as unready, the cluster goes into `initializing` state, and a `StandbyLagging` condition is set in the status.

Deploy the standby cluster by applying the manifest:

Expand Down
20 changes: 20 additions & 0 deletions docs/standby.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,23 @@ Operators automate routine tasks and remove toil. Percona Operator for PostgreSQ
2. A streaming standby receives WAL files by connecting to the primary over the network. The primary site must be accessible over the network and allow secure authentication with TLS. The standby cluster must securely authenticate to the primary. For this reason, both sites must have the same custom TLS certificates. For the setup, you provide the host and port of the primary cluster and the certificates. Learn more about the setup in the [Standby cluster deployment based on streaming replication](standby-streaming.md) tutorial.
3. Streaming standby with external repository is the combination of two previous types and is configured with the options from both types. In this setup, the standby cluster streams WAL records from the primary. If the streaming replication falls behind, the cluster recovers WAL from the backup repo.

## Detect replication lag for standby cluster

If your primary cluster has a large volume of WAL files, the standby cluster may not be able to apply them quickly enough. This may cause the standby to fall behind. This lag can result in replication issues and temporarily leave some data unavailable on the standby cluster.

You can enable replication lag detection for any standby type by setting the [`standby.maxAcceptableLag`](operator.md#standbymaxacceptablelag) option in the Custom Resource. When the WAL lag exceeds this value, the following occurs:

* The primary pod in the standby cluster is marked as `Unready`
* The cluster goes into the `initializing` state
* The `StandbyLagging` condition is set in the cluster status. You can check the conditions with the `kubectl describe pg <cluster-name> -n <namespace>` command.

This helps you understand if replication is lagging or broken. By surfacing the standby lag condition, you get a clear signal when your standby is not ready to serve traffic, enabling faster troubleshooting and preventing application downtime during disaster recovery scenarios.

### Known limitation for a repo-based standby cluster

For WAL lag detection to work in this standby type, the Operator must have access to the primary cluster. Therefore, WAL lag detection is available in these setups:

* Primary and standby clusters are deployed in the same namespace
* Primary and standby are deployed in different namespaces and the Operator is installed in cluster-wide mode.