Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/current/cockroachcloud/create-an-advanced-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ Click **Create cluster**. Your cluster will be created in approximately 20-30 mi
To start using your CockroachDB {{ site.data.products.advanced }} cluster, refer to:

- [Connect to your cluster]({% link cockroachcloud/connect-to-your-cluster.md %})
- Run the [fault tolerance demo]({% link {{ site.versions["stable"] }}/demo-cockroachdb-resilience.md %}#run-a-guided-demo-in-cockroachdb-cloud)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the headline that results in this page anchor is using the macro {{ site.data.products.cloud }} but it's hard coded here in the anchor. It means that if we ever changed cloud, the anchor would break. This is obviously unlikely and perhaps we'd check by some automated link scanner but it suggests that the current system for linking we have within pages is lacking. It would be better if we had some layer of indirection for each headline that would allow us to change its name without changing it's id. Then we use the id to look up the current name and generate the anchor on the fly. Obviously we wouldn't do any of that in this PR. Just food for thought.

Copy link
Contributor Author

@jhlodin jhlodin Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is a known limitation of the current way Jekyll/Liquid handles variables. The band-aid solution is that our CICD does an internal linkcheck for anchors based on the rendered HTML before publishing, so we'll get notified in this situation and be able to fix it in flight.

That said, this is another thing we're hoping our new site has better solutions for.

- [Manage access]({% link cockroachcloud/managing-access.md %})
- [Deploy a Python To-Do App with Flask, Kubernetes, and CockroachDB {{ site.data.products.cloud }}]({% link cockroachcloud/deploy-a-python-to-do-app-with-flask-kubernetes-and-cockroachcloud.md %})
- For a multi-region cluster, it is important to choose the most appropriate [survival goal]({% link {{site.current_cloud_version}}/multiregion-survival-goals.md %}) for each database and the most appropriate [table locality]({% link {{site.current_cloud_version}}/table-localities.md %}) for each table. Otherwise, your cluster may experience unexpected latency and reduced resiliency. For more information, refer to [Multi-Region Capabilities Overview]({% link {{ site.current_cloud_version}}/multiregion-overview.md %}).
4 changes: 4 additions & 0 deletions src/current/releases/cloud.md
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page is woefully out of date. There is an ongoing discussion to either remove this page altogether or commit to better maintenance, but for right now it's probably better to add the note than not.

Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ Get future release notes emailed to you:

{% include marketo.html formId=1083 %}

## Feb 24, 2026

CockroachDB {{ site.data.products.cloud }} {{ site.data.products.advanced }} users can now run a built-in [fault tolerance demo]({% link {{ site.versions["stable"] }}/demo-cockroachdb-resilience.md %}#run-a-guided-demo-in-cockroachdb-cloud) that allows you to monitor query execition during a simulated failure and recovery. The fault tolerance demo is in [Preview]({% link {{ site.versions["stable"] }}/cockroachdb-feature-availability.md %}).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on execution


## Aug 5, 2025

Console users with the [Billing Coordinator role]({% link cockroachcloud/authorization.md %}#billing-coordinator) can now [export invoices]({% link cockroachcloud/billing-management.md %}#export-invoices) in a PDF format, rendering billing information into a traditional invoice format for ease of distribution.
Expand Down
12 changes: 8 additions & 4 deletions src/current/v26.1/cockroachdb-feature-availability.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ The `metrics` Prometheus endpoint is commonly used and is the default in Prometh

### Value separation

[Value separation]({% link {{ page.version.version }}/architecture/storage-layer.md %}#value-separation) reduces write amplification by storing large values separately from the LSM in blob files. Value separation can reduce write amplification by up to 50% for large-value workloads, while introducing minor read overhead and a slight increase in disk space usage. This feature is available in Preview.
[Value separation]({% link {{ page.version.version }}/architecture/storage-layer.md %}#value-separation) reduces write amplification by storing large values separately from the LSM in blob files. Value separation can reduce write amplification by up to 50% for large-value workloads, while introducing minor read overhead and a slight increase in disk space usage. This feature is available in preview.

### `database` and `application_name` labels for certain metrics

Expand All @@ -86,7 +86,7 @@ The SQL built-in function [workload_index_recs]({% link {{ page.version.version

### Triggers

[Triggers]({% link {{ page.version.version }}/triggers.md %}) are in Preview. A trigger executes a function when one or more specified SQL operations is performed on a table. Triggers respond to data changes by adding logic within the database, rather than in an application. They can be used to modify data before it is inserted, maintain data consistency across rows or tables, or record an update to a row.
[Triggers]({% link {{ page.version.version }}/triggers.md %}) are in preview. A trigger executes a function when one or more specified SQL operations is performed on a table. Triggers respond to data changes by adding logic within the database, rather than in an application. They can be used to modify data before it is inserted, maintain data consistency across rows or tables, or record an update to a row.

### JWT authorization

Expand All @@ -98,16 +98,20 @@ The SQL built-in function [workload_index_recs]({% link {{ page.version.version

### Admission control for ingesting snapshots

The [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) `kvadmission.store.snapshot_ingest_bandwidth_control.enabled` is in Preview. When configured, it limits the disk impact of ingesting snapshots on a node.
The [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) `kvadmission.store.snapshot_ingest_bandwidth_control.enabled` is in preview. When configured, it limits the disk impact of ingesting snapshots on a node.

### Admission control to limit the bandwidth for a store

The [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) `kvadmission.store.provisioned_bandwidth` is in Preview. When configured, the store's bandwidth is limited to the configured bandwidth, expressed in bytes per second,
The [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) `kvadmission.store.provisioned_bandwidth` is in preview. When configured, the store's bandwidth is limited to the configured bandwidth, expressed in bytes per second,

### CockroachDB Standard

CockroachDB {{ site.data.products.standard }} is our new, [enterprise-ready plan](https://www.cockroachlabs.com/pricing), recommended for most applications. You can start small with [provisioned capacity that can scale on demand]({% link cockroachcloud/plan-your-cluster.md %}), along with enterprise-level security and availability. Compute for CockroachDB {{ site.data.products.standard }} is pre-provisioned and storage is usage-based. You can easily switch a CockroachDB {{ site.data.products.basic }} cluster to CockroachDB {{ site.data.products.standard }} in place.

### Fault tolerance demo

CockroachDB {{ site.data.products.advanced }} includes a [built-in fault tolerance demo]({% link {{ page.version.version }}/demo-cockroachdb-resilience.md %}#run-a-guided-demo-in-cockroachdb-cloud) that allows you to monitor query execution during a simulated failure and recovery. The fault tolerance demo is in preview.

### CockroachDB Cloud Folders

[Organizing CockroachDB {{ site.data.products.cloud }} clusters using folders]({% link cockroachcloud/folders.md %}) is in preview. Folders allow you to organize and manage access to your clusters according to your organization's requirements. For example, you can create top-level folders for each business unit in your organization, and within those folders, organize clusters by geographic location and then by level of maturity, such as production, staging, and testing.
Expand Down
49 changes: 35 additions & 14 deletions src/current/v26.1/demo-cockroachdb-resilience.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,38 @@
---
title: CockroachDB Resilience Demo
summary: Use a local cluster to explore how CockroachDB remains available during, and recovers after, failure.
summary: Run a demo to explore how CockroachDB remains available during a failure and recovery.
toc: true
docs_area: deploy
---

This page guides you through a simple demonstration of how CockroachDB remains available during, and recovers after, failure. Starting with a 6-node local cluster with the default 3-way replication, you'll run a sample workload, terminate a node to simulate failure, and see how the cluster continues uninterrupted. You'll then leave that node offline for long enough to watch the cluster repair itself by re-replicating missing data to other nodes. You'll then prepare the cluster for 2 simultaneous node failures by increasing to 5-way replication, then take two nodes offline at the same time, and again see how the cluster continues uninterrupted.
This page describes how to perform a hands-on demonstration of CockroachDB's fault-tolerant design allowing services to remain available during a failure and recovery.

## Before you begin
## Run a guided demo in CockroachDB {{ site.data.products.cloud }}

CockroachDB {{ site.data.products.cloud }} {{ site.data.products.advanced }} includes a built-in fault tolerance demo in the {{ site.data.products.cloud }} Console that automatically runs a sample workload and simulates a node failure on your cluster, showing real-time metrics of query latency and failure rate during the outage and recovery.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's technically an availability zone failure (could be multiple nodes) for larger cluster.


{{ site.data.alerts.callout_info }}
The CockroachDB {{ site.data.products.cloud }} fault tolerance demo is in [Preview]({% link {{ page.version.version }}/cockroachdb-feature-availability.md %}).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but it seems like we should have prebuilt macros for each of visibilities. Its annoying that we'd have to add the version and availability to each place we use this. Ideally we could do this and have it link up correctly.

fault tolerance demo is in {{site.data.visibility.preview}}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, some amount of these callouts can be turned into a macro/snippet. We're planning a migration to a new docs site that has its own tools, so rather than create a DOC ticket to do this in the current system we'll earmark it to investigate later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: suggest lower-case "preview" since we changed everything on the feature availability page to be lowercased (really embarrassed i wrote this, don't care that much)

{{ site.data.alerts.end }}

The following prerequisites are needed to run the fault tolerance demo:

- A [CockroachDB {{ site.data.products.advanced }} cluster]({% link cockroachcloud/create-an-advanced-cluster.md %}) with at least three nodes.
- All nodes are healthy.
- The cluster's CPU utilization is below 30%.
- The cluster is not currently in a locked state as a result of maintenance such as scaling.

To run the fault tolerance demo, open the {{ site.data.products.cloud }} Console and navigate to **Actions > Fault tolerance demo**.
Copy link

@jaiayu jaiayu Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a disclaimer here that we don't recommend running the fault tolerance demo on your production cluster? It is a live cluster demo.

Other things to consider:

  • You need cluster operator and/or cluster admin permissions to run the demo
  • The demo injects a temporary database and workload into your cluster, and cleans up after the demo is complete. The clean up step may take a few minutes after your demo ends.
  • You cannot run a second demo on a cluster if one is already running (this applies if you have multiple cluster admins for the same cluster).
  • The demo can take 10-15 mins to complete end to end.


## Run a manual demo on a local machine

This guide walks you through a simple demonstration of CockroachDB's resilience on a local cluster deployment. Starting with a 6-node local cluster with the default 3-way replication, you'll run a sample workload, terminate a node to simulate failure, and see how the cluster continues uninterrupted. You'll then leave that node offline for long enough to watch the cluster repair itself by re-replicating missing data to other nodes. You'll then prepare the cluster for 2 simultaneous node failures by increasing to 5-way replication, then take two nodes offline at the same time, and again see how the cluster continues uninterrupted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this paragraph is pretty dense. non-blocking suggest to rewrite for more scannability, e.g. something like


This guide walks you through a simple demonstration of CockroachDB’s resilience on a local cluster deployment.

Starting with a 6-node local cluster using the default 3-way replication, you will:

  • Run a sample workload
  • Terminate one node to simulate a failure
  • Observe the cluster continue serving traffic uninterrupted

Next, you’ll leave that node offline for long enough to watch the cluster repair itself by re-replicating missing data to other nodes.

Finally, you’ll prepare the cluster for 2 simultaneous node failures by increasing to 5-way replication, then:

  • Take two nodes offline at the same time
  • See how the cluster continues uninterrupted


### Before you begin

Make sure you have already [installed CockroachDB]({% link {{ page.version.version }}/install-cockroachdb.md %}).

## Step 1. Start a 6-node cluster
### Step 1. Start a 6-node cluster

1. In separate terminal windows, use the [`cockroach start`]({% link {{ page.version.version }}/cockroach-start.md %}) command to start six nodes:

Expand Down Expand Up @@ -82,7 +103,7 @@ Make sure you have already [installed CockroachDB]({% link {{ page.version.versi
$ cockroach init --insecure --host=localhost:26257
~~~

## Step 2. Set up load balancing
### Step 2. Set up load balancing

In this tutorial, you run a sample workload to simulate multiple client connections. Each node is an equally suitable SQL gateway for the load, but it's always recommended to [spread requests evenly across nodes]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing). This section shows how to set up the open-source [HAProxy](http://www.haproxy.org/) load balancer.

Expand Down Expand Up @@ -133,7 +154,7 @@ In this tutorial, you run a sample workload to simulate multiple client connecti
$ haproxy -f haproxy.cfg &
~~~

## Step 3. Run a sample workload
### Step 3. Run a sample workload

Use the [`cockroach workload`]({% link {{ page.version.version }}/cockroach-workload.md %}) command to run CockroachDB's built-in version of the YCSB benchmark, simulating multiple client connections, each performing mixed read/write operations.

Expand Down Expand Up @@ -179,7 +200,7 @@ Use the [`cockroach workload`]({% link {{ page.version.version }}/cockroach-work

After the specified duration (20 minutes in this case), the workload will stop and you'll see totals printed to standard output.

## Step 4. Check the workload
### Step 4. Check the workload

Initially, the workload creates a new database called `ycsb`, creates the table `public.usertable` in that database, and inserts rows into the table. Soon, the load generator starts executing approximately 95% reads and 5% writes.

Expand Down Expand Up @@ -207,7 +228,7 @@ Initially, the workload creates a new database called `ycsb`, creates the table

<img src="{{ 'images/v26.1/fault-tolerance-6.png' | relative_url }}" alt="DB Console Overview" style="border:1px solid #eee;max-width:100%" />

## Step 5. Simulate a single node failure
### Step 5. Simulate a single node failure

When a node fails, the cluster waits for the node to remain offline for 5 minutes by default before considering it dead, at which point the cluster automatically repairs itself by re-replicating any of the replicas on the down nodes to other available nodes.

Expand Down Expand Up @@ -242,23 +263,23 @@ When a node fails, the cluster waits for the node to remain offline for 5 minute
kill -TERM 53708
~~~

## Step 6. Check load continuity and cluster health
### Step 6. Check load continuity and cluster health

Go back to the DB Console, click **Metrics** on the left, and verify that the cluster as a whole continues serving data, despite one of the nodes being unavailable and marked as **Suspect**:

<img src="{{ 'images/v26.1/fault-tolerance-7.png' | relative_url }}" alt="DB Console Suspect Node" style="border:1px solid #eee;max-width:100%" />

This shows that when all ranges are replicated 3 times (the default), the cluster can tolerate a single node failure because the surviving nodes have a majority of each range's replicas (2/3).

## Step 7. Watch the cluster repair itself
### Step 7. Watch the cluster repair itself

Click **Overview** on the left:

<img src="{{ 'images/v26.1/fault-tolerance-5.png' | relative_url }}" alt="DB Console Cluster Repair" style="border:1px solid #eee;max-width:100%" />

Because you reduced the time it takes for the cluster to consider the down node dead, after 1 minute or so, the cluster will consider the down node "dead", and you'll see the replica count on the remaining nodes increase and the number of under-replicated ranges decrease to 0. This shows the cluster repairing itself by re-replicating missing replicas.

## Step 8. Prepare for two simultaneous node failures
### Step 8. Prepare for two simultaneous node failures

At this point, the cluster has recovered and is ready to handle another failure. However, the cluster cannot handle two _near-simultaneous_ failures in this configuration. Failures are "near-simultaneous" if they are closer together than the `server.time_until_store_dead` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) plus the time taken for the number of replicas on the dead node to drop to zero. If two failures occurred in this configuration, some ranges would become unavailable until one of the nodes recovers.

Expand Down Expand Up @@ -290,7 +311,7 @@ To be able to tolerate 2 of 5 nodes failing simultaneously without any service i

This shows the cluster up-replicating so that each range has 5 replicas, one on each node.

## Step 9. Simulate two simultaneous node failures
### Step 9. Simulate two simultaneous node failures

Gracefully shut down **2 nodes**, specifying the [process IDs you retrieved earlier](#step-5-simulate-a-single-node-failure):

Expand All @@ -299,7 +320,7 @@ Gracefully shut down **2 nodes**, specifying the [process IDs you retrieved earl
kill -TERM {process IDs}
~~~

## Step 10. Check cluster status and service continuity
### Step 10. Check cluster status and service continuity

1. Click **Overview** on the left, and verify the state of the cluster:

Expand Down Expand Up @@ -343,7 +364,7 @@ kill -TERM {process IDs}

This shows that when all ranges are replicated 5 times, the cluster can tolerate 2 simultaneous node outages because the surviving nodes have a majority of each range's replicas (3/5).

## Step 11. Clean up
### Step 11. Clean up

1. In the terminal where the YCSB workload is running, press **CTRL + c**.

Expand Down
Loading