diff --git a/docs/operator.md b/docs/operator.md index c7ece52c..648656cd 100755 --- a/docs/operator.md +++ b/docs/operator.md @@ -102,6 +102,14 @@ Name of the pgBackRest repository in the primary cluster this standby cluster co | ---------- | ------- | | :material-code-string: string | `repo1` | +### `standby.maxAcceptableLag` + +The maximum amount of WAL data that the standby cluster can be behind the primary cluster. It is measured in bytes of WAL data. When the WAL lag exceeds this value, the primary pod in the standby cluster is marked as unready, the cluster goes into the `initializing` state, and a `StandbyLagging` condition is set in the status. If unset, lag is not checked. Use Kubernetes quantity format (for example, `10Mi`, `1Gi`). + +| Value type | Example | +| ---------- | ------- | +| :material-code-string: string | `10Mi` | + ### `secrets.customRootCATLSSecret.name` Name of the secret with the custom root CA certificate and key for secure connections to the PostgreSQL server, see [Transport Layer Security (TLS)](TLS.md) for details. diff --git a/docs/standby-backup.md b/docs/standby-backup.md index 5c098ab1..db5bd945 100644 --- a/docs/standby-backup.md +++ b/docs/standby-backup.md @@ -59,9 +59,9 @@ The pgBackRest repo-based standby is the simplest one. The following is the arch ### Configure DR site -The configuration of the disaster recovery site is similar [to that of the Main site](#configure-main-site), with the only difference in standby settings. +The configuration of the disaster recovery site is similar [to that of the Main site](#configure-main-site), with the difference in standby settings. -The following manifest has `standby.enabled` set to `true` and points to the `repoName` where backups are (GCS in our case): +The following manifest has `standby.enabled` set to `true` and points to the `repoName` where backups are (GCS in our case). Optionally, add `standby.maxAcceptableLag` to enable [replication lag detection](standby.md#detect-replication-lag-for-standby-cluster). Check the [known limitation for this standby type](standby.md#known-limitation-for-a-repo-based-standby-cluster). ```yaml metadata: @@ -80,6 +80,7 @@ spec: standby: enabled: true repoName: repo1 + # maxAcceptableLag: 10Mi # optional: enables replication lag detection ``` Deploy the standby cluster by applying the manifest: diff --git a/docs/standby-streaming.md b/docs/standby-streaming.md index 1b4469f8..f5a15380 100644 --- a/docs/standby-streaming.md +++ b/docs/standby-streaming.md @@ -80,10 +80,12 @@ Apart from setting certificates correctly, you should also set standby configura standby: enabled: true host: main-ha.main-pg.svc + maxAcceptableLag: 10Mi # optional: enables replication lag detection ``` * `standby.enabled` controls if it is a standby cluster or not * `standby.host` must point to the primary node of a Main cluster. In this example it is a `main-ha` Service in another namespace. +* `standby.maxAcceptableLag` (optional) enables replication lag detection. When the WAL lag exceeds this value, the standby primary pod is marked as unready, the cluster goes into `initializing` state, and a `StandbyLagging` condition is set in the status. Deploy the standby cluster by applying the manifest: diff --git a/docs/standby.md b/docs/standby.md index 2ae20ec1..fb2cba5b 100644 --- a/docs/standby.md +++ b/docs/standby.md @@ -10,3 +10,23 @@ Operators automate routine tasks and remove toil. Percona Operator for PostgreSQ 2. A streaming standby receives WAL files by connecting to the primary over the network. The primary site must be accessible over the network and allow secure authentication with TLS. The standby cluster must securely authenticate to the primary. For this reason, both sites must have the same custom TLS certificates. For the setup, you provide the host and port of the primary cluster and the certificates. Learn more about the setup in the [Standby cluster deployment based on streaming replication](standby-streaming.md) tutorial. 3. Streaming standby with external repository is the combination of two previous types and is configured with the options from both types. In this setup, the standby cluster streams WAL records from the primary. If the streaming replication falls behind, the cluster recovers WAL from the backup repo. +## Detect replication lag for standby cluster + +If your primary cluster has a large volume of WAL files, the standby cluster may not be able to apply them quickly enough. This may cause the standby to fall behind. This lag can result in replication issues and temporarily leave some data unavailable on the standby cluster. + +You can enable replication lag detection for any standby type by setting the [`standby.maxAcceptableLag`](operator.md#standbymaxacceptablelag) option in the Custom Resource. When the WAL lag exceeds this value, the following occurs: + +* The primary pod in the standby cluster is marked as `Unready` +* The cluster goes into the `initializing` state +* The `StandbyLagging` condition is set in the cluster status. You can check the conditions with the `kubectl describe pg -n ` command. + +This helps you understand if replication is lagging or broken. By surfacing the standby lag condition, you get a clear signal when your standby is not ready to serve traffic, enabling faster troubleshooting and preventing application downtime during disaster recovery scenarios. + +### Known limitation for a repo-based standby cluster + +For WAL lag detection to work in this standby type, the Operator must have access to the primary cluster. Therefore, WAL lag detection is available in these setups: + +* Primary and standby clusters are deployed in the same namespace +* Primary and standby are deployed in different namespaces and the Operator is installed in cluster-wide mode. + +