Skip to content

Tq expunge and zfs change key#9843

Draft
andrewjstone wants to merge 25 commits intomainfrom
tq-expunge-and-zfs-change-key
Draft

Tq expunge and zfs change key#9843
andrewjstone wants to merge 25 commits intomainfrom
tq-expunge-and-zfs-change-key

Conversation

@andrewjstone
Copy link
Contributor

DO NOT MERGE! For testing purposes only.

andrewjstone and others added 25 commits January 29, 2026 02:22
I tested this out by first trying to abort and watching it fail because
there is no trust quorum configuration. Then I issued an LRTQ upgrade,
which will fail because I didn't restart the sled-agents to pick up the
LRTQ shares. Then I aborted that configuration stuck in prepare. Lastly,
I successfully issued a new LRTQ upgrade after restartng the sled agents
and watched it commit.

Here's the external API calls:

```
➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
error; status code: 404 Not Found
{
  "error_code": "Not Found",
  "message": "No trust quorum configuration exists for this rack",
  "request_id": "819eb6ab-3f04-401c-af5f-663bb15fb029"
}
error
➜  oxide.rs git:(main) ✗
➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
{
  "members": [
    {
      "part_number": "913-0000019",
      "serial_number": "20000000"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000001"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000003"
    }
  ],
  "rack_id": "ea7f612b-38ad-43b9-973c-5ce63ef0ddf6",
  "state": "aborted",
  "time_aborted": "2026-01-29T01:54:02.590683Z",
  "time_committed": null,
  "time_created": "2026-01-29T01:37:07.476451Z",
  "unacknowledged_members": [
    {
      "part_number": "913-0000019",
      "serial_number": "20000000"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000001"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000003"
    }
  ],
  "version": 2
}
```

Here's the omdb calls:

```
root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Error: lrtq upgrade

Caused by:
    Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "8503cd68-7ff4-4bf1-b358-0e70279c6347", "content-length": "124", "date": "Thu, 29 Jan 2026 01:37:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "8503cd68-7ff4-4bf1-b358-0e70279c6347" }

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        2,
    ),
    last_committed_epoch: None,
    state: PreparingLrtqUpgrade,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T01:37:07.476451Z,
    time_committing: None,
    time_committed: None,
    time_aborted: None,
    abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        2,
    ),
    last_committed_epoch: None,
    state: Aborted,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T01:37:07.476451Z,
    time_committing: None,
    time_committed: None,
    time_aborted: Some(
        2026-01-29T01:54:02.590683Z,
    ),
    abort_reason: Some(
        "Aborted via API request",
    ),
}

root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Started LRTQ upgrade at epoch 3

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        3,
    ),
    last_committed_epoch: None,
    state: PreparingLrtqUpgrade,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T02:20:03.848507Z,
    time_committing: None,
    time_committed: None,
    time_aborted: None,
    abort_reason: None,
}

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        3,
    ),
    last_committed_epoch: None,
    state: Committed,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: Some(
        EncryptedRackSecrets {
            salt: Salt(
                [
                    143,
                    198,
                    3,
                    63,
                    136,
                    48,
                    212,
                    180,
                    101,
                    106,
                    50,
                    2,
                    251,
                    84,
                    234,
                    25,
                    46,
                    39,
                    139,
                    46,
                    29,
                    99,
                    252,
                    166,
                    76,
                    146,
                    78,
                    238,
                    28,
                    146,
                    191,
                    126,
                ],
            ),
            data: [
                167,
                223,
                29,
                18,
                50,
                230,
                103,
                71,
                159,
                77,
                118,
                39,
                173,
                97,
                16,
                92,
                27,
                237,
                125,
                173,
                53,
                51,
                96,
                242,
                203,
                70,
                36,
                188,
                200,
                59,
                251,
                53,
                126,
                48,
                182,
                141,
                216,
                162,
                240,
                5,
                4,
                255,
                145,
                106,
                97,
                62,
                91,
                161,
                51,
                110,
                220,
                16,
                132,
                29,
                147,
                60,
            ],
        },
    ),
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: 13c0a6113e55963ed35b275e49df4c3f0b3221143ea674bb1bd5188f4dac84,
            ),
            time_prepared: Some(
                2026-01-29T02:20:46.792674Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:49.503179Z,
            ),
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: 8557d74f678fa4e8278714d917f14befd88ed1411f27c57d641d4bf6c77f3b,
            ),
            time_prepared: Some(
                2026-01-29T02:20:47.236089Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:49.503179Z,
            ),
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: d61888c42a1b5e83adcb5ebe29d8c6c66dc586d451652e4e1a92befe41719cd,
            ),
            time_prepared: Some(
                2026-01-29T02:20:46.809779Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:52.248351Z,
            ),
        },
    },
    time_created: 2026-01-29T02:20:03.848507Z,
    time_committing: Some(
        2026-01-29T02:20:47.597276Z,
    ),
    time_committed: Some(
        2026-01-29T02:21:52.263198Z,
    ),
    time_aborted: None,
    abort_reason: None,
}
```
After chatting with @davepacheco, I changed the authz checks in the
datastore to do lookups with Rack scope. This fixed the test bug, but is
only a shortcut. Trust quorum should have it's own authz object and I"m
going to open an issue for that.

Additionally, for methods that already took an authorized connection, I
removed the unnecessary authz checks and opctx parameter.
This commit adds a 3 phase mechanism for sled expungement.

The first phase is to remove the sled from the latest trust quorum
configuration via omdb. The second phase is to reboot the sled after
polling for commit the trust quorum removal. The third phase is to
issue the existing omdb expunge command, which changes the sled policy
as before.

The first and second phases remove the need to physically remove the
sled before expungement. They act as a software mechanism that gates the
sled-agent from restarting on the sled and doing work when it should be
treated as "absent". We've discussed this numerous times in the update
huddle and it is finally arriving!

The third phase is what informs reconfigurator that the sled is gone
and remains the same except for an extra sanity check that that the last
committed trust quorum configuration does not contain the sled that is
to be expunged.

The removed sled may be added back to this rack or another after being
clean slated. I tested this by deleting the files in the internal
"cluster" and "config" directories and rebooting the removed sled in
a4x2 and it worked.

This PR is marked draft because it changes the current
sled-expunge pathway to depend on real trust quorum. We
cannot safely merge it in until the key-rotation work from
#9737 is merged in.

This also builds on #9741 and should merge after that PR.
When Trust Quorum commits a new epoch, all U.2 crypt datasets must have
their encryption keys rotated to use the new epoch's derived key. This
change implements the key rotation flow triggered by epoch commits.

## Trust Quorum Integration

- Add watch channel to `NodeTaskHandle` for epoch change notifications
- Initialize channel with current committed epoch on startup
- Notify subscribers via `send_if_modified()` when epoch changes

## Config Reconciler Integration

- Accept `committed_epoch_rx` watch channel from trust quorum
- Trigger reconciliation when epoch changes
- Track per-disk encryption epoch in `ExternalDisks`
- Add `rekey_for_epoch()` to coordinate key rotation:
  - Filter disks needing rekey (cached epoch < target OR unknown)
  - Derive keys for each disk via `StorageKeyRequester`
  - Send batch request to dataset task
  - Update cached epochs on success
  - Retry on failure via normal reconciliation retry logic

## Dataset Task Changes

- Add `RekeyRequest`/`RekeyResult` types for batch rekey operations
- Add `datasets_rekey()` with idempotency check (skip if already at target)
- Use `Zfs::change_key()` for atomic key + epoch property update

## ZFS Utilities

- Add `Zfs::change_key()` using `zfs_atomic_change_key` crate
- Add `Zfs::load_key()`, `unload_key()`, `dataset_exists()`
- Add `epoch` field to `DatasetProperties`
- Add structured error types for key operations

## Crash Recovery

- Add trial decryption recovery in `sled-storage` for datasets with
  missing epoch property (e.g., crash during initial creation)
- Unload key before each trial attempt to handle crash-after-load-key
- Set epoch property after successful recovery

## Safety Properties

- Atomic: Key and epoch property set together via `zfs_atomic_change_key`
- Idempotent: Skip rekey if dataset already at target epoch
- Crash-safe: Epoch read from ZFS on restart rebuilds cache correctly
- Conservative: Unknown epochs (None) trigger rekey attempt
Create a new key-manager-types crate containing the disk encryption key
types (Aes256GcmDiskEncryptionKey and VersionedAes256GcmDiskEncryptionKey)
that were previously defined in key-manager. This breaks the dependency
from illumos-utils to key-manager, allowing illumos-utils to depend only
on the minimal types crate.

The key-manager crate re-exports VersionedAes256GcmDiskEncryptionKey for
backwards compatibility.
- Format ZFS_GET_PROPS const with concat! and clarify epoch field docs
- Preserve error source chain with anyhow::Error::from instead of formatting
- Convert KeyRotationError from enum to struct (single variant)
- Log current and new epochs in key rotation success and failure paths
- Change rekey_for_epoch to return ReconciliationResult instead of bool
- Add error log for unexpected epoch-ahead-of-target condition
- Simplify epoch filter using Option<Epoch> ordering
- Move dataset_name allocation into Ok branch to minimize scope
- Upgrade all-key-derivations-failed log from info to warn
- Inline dataset_exists helper, calling Zfs::dataset_exists directly
- Log warning on best-effort unload_key failure
The mark_unchanged() call before the reconciler loop was a no-op:
borrow_and_update() inside do_reconcilation() already reads and
processes the current epoch unconditionally, marking it as seen.
Also remove a stale comment about copying epoch out of the Ref,
which was informational and no longer adds clarity.
Now that illumos ZFS supports `zfs change-key -o oxide:epoch=N`,
we no longer need the zfs-atomic-change-key crate that embedded
key material in Lua scripts.

Zfs::change_key() now just runs the native command, taking
(dataset, epoch) instead of (dataset, key). Keyfile management
is lifted to the caller in datasets_rekey(), which uses
KeyFile::create + zero_and_unlink — matching the existing pattern
used for dataset creation and trial decryption. This ensures key
material is zeroed from tmpfs promptly after use.
@andrewjstone
Copy link
Contributor Author

This all worked like a charm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants