Updated RFC8 to current status #2637

thijsvanemmerik · 2026-01-15T14:35:59Z

Storage: Airtable → PostgreSQL
Slack channels: #dz-contributor-incidents / #dz-contributor-maintenance (no @mentions in MVP)
Contributor Onboarding: CLI ops-manager setup → wallet connect (Phantom/Solflare/Coinbase) → API keys → create records
Permission Model: Contributor keys (own devices only) vs Admin keys (network-wide); DZX link ownership by A-side
New fields: root_cause (incidents only), assignee, internal_reference
Status Lifecycle: Full incident/maintenance status flows with definitions
Severity Levels: sev1/sev2/sev3 defined
Root Cause Codes: 14 codes for incident resolution
Other: UTC timestamps, human-readable device/link dropdowns, simplified status tables

rfcs/rfc8-contributor-incident-maintenance-logging.md

ben-malbeclabs

LGTM

## Summary of Changes This renames the clickhouse column and go struct fields from device_code to device_pubkey to align with the actual data being stored (solana pubkey) and also updates the test data targets to use fake pubkey values. ## Testing Verification Existing unit/integration tests pass.

This pull request introduces new health and desired status fields for both devices and links in the `serviceability/state.go` file, enhancing the ability to track and serialize their operational states. These changes add new enums, fields, and JSON serialization logic to support more granular state management. **Device state enhancements:** - Added new enums: `DeviceHealth` and `DeviceDesiredStatus`, with corresponding string values and JSON serialization methods. - Updated the `Device` struct to include `DeviceHealth` and `DeviceDesiredStatus` fields. - Modified the `Device` JSON serialization to output the new health and desired status fields. [[1]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R399-R400) [[2]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R420-R421) - Added a new `DeviceStatusDrained` status to the existing `DeviceStatus` enum. **Link state enhancements:** - Added new enums: `LinkHealth` and `LinkDesiredStatus`, with corresponding string values and JSON serialization methods. - Updated the `Link` struct to include `LinkHealth` and `LinkDesiredStatus` fields. - Modified the `Link` JSON serialization to output the new health and desired status fields. [[1]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R557-R558) [[2]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R569-R570) --------- Co-authored-by: [email protected] <[email protected]>

## Summary of Changes - Start gNMI tunnel client in telemetry agent, that connects to a gRPC tunnel server and proxies sessions to the local gNMI service - Enables collection of device state via gNMI - Closes #2633 ## Testing Verification - Added test coverage for the new code - Deployed to and validated in devnet

#2638) ## Summary of Changes - add tools/gnmi-prototext-convert to convert raw gnmi get responses to subscribe response format for testdata files - update tests to handle interfaces with multiple subinterfaces - document conversion tool in readme ## Testing Verification ``` === RUN TestUnmarshal_InterfacesIfindex processor_test.go:810: successfully unmarshalled 198 interfaces --- PASS: TestUnmarshal_InterfacesIfindex (0.15s) PASS ok github.com/malbeclabs/doublezero/telemetry/gnmi-writer/internal/gnmi 0.693s === RUN TestIntegration === RUN TestIntegration/InterfaceIfindex processor_integration_test.go:294: published 25018 bytes to topic gnmi-notifications msg="wrote records to clickhouse" count=202 msg="processed notifications" count=202 processor_integration_test.go:442: found 202 interface ifindex records --- PASS: TestIntegration/InterfaceIfindex (15.31s) PASS ```

) ## Summary of Changes This is mainly a documentation update so we track the gnmi path that generated the associated prototext file used for integration tests. ## Testing Verification Existing tests pass.

#2646) ## Summary of Changes Add three new gNMI telemetry collections: - transceiver_state: optical power metrics (input/output power, laser bias current) - interface_state: admin/oper status and interface counters - transceiver_thresholds: alarm thresholds per severity level Changes: - regenerate oc package with openconfig-platform-transceiver module - add record types, extractors, and ClickHouse schemas for each collection - add integration tests and isolation unit tests - document extractor ordering requirements in code and README ## Testing Verification ``` Unit Tests (Isolation): === RUN TestExtractTransceiverState_Isolation processor_test.go:864: extracted 52 transceiver state records --- PASS: TestExtractTransceiverState_Isolation (0.15s) === RUN TestExtractInterfaceState_Isolation processor_test.go:925: extracted 198 interface state records --- PASS: TestExtractInterfaceState_Isolation (0.16s) === RUN TestExtractTransceiverThresholds_Isolation processor_test.go:974: extracted 600 transceiver threshold records --- PASS: TestExtractTransceiverThresholds_Isolation (0.16s) PASS Integration Tests (end-to-end with ClickHouse + Redpanda containers): --- PASS: TestIntegration (45.85s) --- PASS: TestIntegration/TransceiverState (16.44s) → 52 records written to ClickHouse --- PASS: TestIntegration/InterfaceState (14.52s) → 198 records written to ClickHouse --- PASS: TestIntegration/TransceiverThresholds (14.89s) → 600 records written to ClickHouse PASS ```

This pull request introduces several significant improvements and changes to device and link lifecycle management, telemetry, and CLI commands. The most notable updates include the addition of health management and status controls for devices and links, new CLI commands for internal health operations, and changes to device/link provisioning and activation flows. There are also some test fixture and e2e test updates to align with the new status model. **Device and Link Lifecycle & Status Management:** * Introduced health management for Devices and Links, adding explicit health states, authorized health updates, and related enhancements to state, processors, and tests. * Added support for a `desired-status` parameter when creating and updating devices and links, allowing explicit control over their activation state. This is reflected in both the code and e2e tests. [[1]](diffhunk://#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4edR76-R86) F6d596d0L61R61, [[2]](diffhunk://#diff-3c1eb3950d7513dd8bea87ae37cce2238dec3aedc2512fe2bef786c7213c7fc5L219-R224) * Updated device and link event processors to handle new provisioning and activation states, including logging and state transitions (e.g., `DeviceProvisioning`, `ReadyForService`). [[1]](diffhunk://#diff-0aed512054e126f7dc8b0bffbc2aad2bd5e2aade35d7e902e08138cc36f94341L39-R39) [[2]](diffhunk://#diff-0aed512054e126f7dc8b0bffbc2aad2bd5e2aade35d7e902e08138cc36f94341R53-R67) [[3]](diffhunk://#diff-d9c91ea17e4fe3e26302887a60fd0596f950bcc4949005b71ae4fbf2acfc5ebbL64-R64) **CLI Command Changes:** * Added hidden/internal CLI commands for setting the health status of device and link interfaces (`SetDeviceHealthCliCommand`, `SetLinkHealthCliCommand`). These are intended for operational or testing use and not exposed in the public CLI surface. [[1]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL12-R12) [[2]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL62-R70) [[3]](diffhunk://#diff-02e4eb858ebd20f998f552ba9c1116fa4d51afd1be85dbf71d3af49247a37e1aL2-R5) [[4]](diffhunk://#diff-02e4eb858ebd20f998f552ba9c1116fa4d51afd1be85dbf71d3af49247a37e1aL51-R55) [[5]](diffhunk://#diff-9b18c350b4f1b096b7187215002603fceaaa72368cbe6b66dacfd105ea8eace2R180) [[6]](diffhunk://#diff-9b18c350b4f1b096b7187215002603fceaaa72368cbe6b66dacfd105ea8eace2R193) * Removed the public `Suspend` and `Resume` device commands from both the user and admin CLIs, streamlining device status management. [[1]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL12-R12) [[2]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL62-R70) [[3]](diffhunk://#diff-4c553a5c541ace8b1cfe447e3d968ee7c71feb6357037b95d18b9665f174fd14L12-L13) [[4]](diffhunk://#diff-4c553a5c541ace8b1cfe447e3d968ee7c71feb6357037b95d18b9665f174fd14L62-L67) [[5]](diffhunk://#diff-9b18c350b4f1b096b7187215002603fceaaa72368cbe6b66dacfd105ea8eace2L172-L173) [[6]](diffhunk://#diff-901dffbcc0f09778ec585863ab718fec67ebd602c6fb8a98c47907d6bbf71f0cL133-L134) **Testing and Fixture Updates:** * Updated e2e tests to use the new `--desired-status activated` flag when creating devices and links, ensuring that test resources start in the correct state. (F6d596d0L61R61, [e2e/device_telemetry_test.goL219-R224](diffhunk://#diff-3c1eb3950d7513dd8bea87ae37cce2238dec3aedc2512fe2bef786c7213c7fc5L219-R224)) * Adjusted IBRL agent configuration test fixtures to use more realistic IS-IS metrics and interface states, improving test coverage and accuracy. [[1]](diffhunk://#diff-45da870eae10769bbc547bec872ab640e8bf1bfd290aeb24c7569d254de3adfcL94-R94) [[2]](diffhunk://#diff-45da870eae10769bbc547bec872ab640e8bf1bfd290aeb24c7569d254de3adfcL108-R108) [[3]](diffhunk://#diff-9136bc38a39707583a0387e3b75ddbb531ae5b1da05374e80ff26f8d0ebcb042L94-R94) [[4]](diffhunk://#diff-9136bc38a39707583a0387e3b75ddbb531ae5b1da05374e80ff26f8d0ebcb042L108-R108) [[5]](diffhunk://#diff-04dd3c3c0bcf9f5f4d7718b5133038dd573c7a00a1a0c876e859370bc5d69af3L94-R94) [[6]](diffhunk://#diff-04dd3c3c0bcf9f5f4d7718b5133038dd573c7a00a1a0c876e859370bc5d69af3L108-R108) [[7]](diffhunk://#diff-0c75c0e0e5b0c9fc5e214749f9cb1a8e7c3c04f9caf31876e3ae2d215a7a531bL94-R94) [[8]](diffhunk://#diff-0c75c0e0e5b0c9fc5e214749f9cb1a8e7c3c04f9caf31876e3ae2d215a7a531bL108-R108) **Changelog and Documentation:** * Updated the `CHANGELOG.md` to reflect the introduction of health management, desired status for devices and links, and telemetry improvements for entities in provisioning/draining states. [[1]](diffhunk://#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4edR55-R58) [[2]](diffhunk://#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4edR76-R86) These changes collectively improve the robustness, operational control, and observability of device and link management in the system. ## Testing Verification * all test ok

## Summary of Changes * DeactivateMulticastGroup now only succeeds when both publisher_count == 0 and subscriber_count == 0 (in addition to status == Deleting). * If either count is non zero, the instruction fails and the account is not closed. * Regression test added: New test verifies deactivation fails when publisher/subscriber counts are non-zero and passes when they’re zero. * Updated Changelog. ## Testing Verification * Existing rust + lint checks pass. Closes #2219 --------- Signed-off-by: ANISH-SR <[email protected]> Co-authored-by: Juan Olveira <[email protected]>

## Summary of Changes * Now only allows reactivation when multicastgroup.status is Suspended, otherwise returns InvalidStatus. * New test ensures reactivation fails when the group is not Suspended. * Changelog.md updated. ## Testing Verification * Rust checks + lint green. Closes #2227 Signed-off-by: ANISH-SR <[email protected]>

…are empty + add regression test (#2635) ## Summary of Changes * Added a safety guard to process_closeaccount_user so a user account can only be closed when both publishers and subscribers are empty. * Added a regression test. * Updated CHANGELOG.md ## Testing Verification * Existing rust checks + lint checks pass. Resolves #2218 Signed-off-by: ANISH-SR <[email protected]> Co-authored-by: Juan Olveira <[email protected]>

## Summary of Changes This fixes two issues: - Devices send transceiver descriptions as separate gnmi update messages, which were being parsed as individual records and written as a duplicate row with zero values. - Transceiver thresholds are sent as individual updates for each severity. These were being written as it's own row per severity and are now aggregated into a single row per interface/severity. ## Testing Verification Transceiver state before containing a duplicate row w/ zero values: ``` 53. │ 2026-01-16 04:08:51.197958931 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet1 │ 0 │ -1.95 │ -2.35 │ 6.25 │ 54. │ 2026-01-16 04:08:51.197958931 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet1 │ 0 │ 0 │ 0 │ 0 │ ``` Transceiver state now: ``` SELECT * FROM transceiver_state_latest WHERE (device_pubkey = '9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r') AND (interface_name = 'Ethernet1') ┌─────────────────────timestamp─┬─device_pubkey────────────────────────────────┬─interface_name─┬─channel_index─┬─input_power─┬─output_power─┬─laser_bias_current─┐ 1. │ 2026-01-16 04:20:56.145597378 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet1 │ 0 │ -1.94 │ -2.33 │ 6.25 │ └───────────────────────────────┴──────────────────────────────────────────────┴────────────────┴───────────────┴─────────────┴──────────────┴────────────────────┘ ``` Transceiver thresholds before: ``` 41551. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 2.97 │ 0 │ 41552. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 3.6300000000000003 │ 41553. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 2 │ 0 │ 0 │ 0 │ 0 │ 0 │ 41554. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 0 │ 14 │ 0 │ 0 │ 0 │ 0 │ 41555. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ -13.904055907747797 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 41556. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 3.000082025538127 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 41557. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ -11.301817920206716 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 41558. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 3.000082025538127 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 41559. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ -5 │ 0 │ 0 │ 0 │ 41560. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 75 │ 0 │ 0 │ ``` Transceiver thresholds now (single record per interface/severity): ``` SELECT * FROM transceiver_thresholds_latest WHERE (device_pubkey = '9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r') AND (interface_name = 'Ethernet54') AND (severity = 'CRITICAL') Row 1: ────── timestamp: 2026-01-16 04:24:56.624986659 device_pubkey: 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r interface_name: Ethernet54 severity: CRITICAL input_power_lower: -13.904055907747797 input_power_upper: 3.000082025538127 output_power_lower: -11.301817920206716 output_power_upper: 3.000082025538127 laser_bias_current_lower: 2 laser_bias_current_upper: 14 module_temperature_lower: -5 module_temperature_upper: 75 supply_voltage_lower: 2.97 supply_voltage_upper: 3.6300000000000003 ```

Resolves: #2660 ## Summary of Changes * Rename ReadyForService LinkStatus field to Provisioning

# Summary Fix #2657 This PR addresses the bug by basically ensuring that we call `contains` method in the `ip_to_index` helper function; this makes certain that the ip is actually within ranges; without this safety check, when deleting the activator would essentially panic and fail to restart. ## Without Fix ``` activator_interface_delete_test.go:89: Error Trace: /home/rahul/malbec-labs/doublezero/e2e/activator_interface_delete_test.go:89 Error: Should be true Test: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_activator_running Messages: activator container is not running - it likely crashed --- FAIL: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP (181.13s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/create_loopback_with_out_of_pool_ip (0.53s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/wait_for_interface_activation (14.01s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/delete_interface_with_out_of_pool_ip (5.53s) --- FAIL: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_activator_running (0.00s) FAIL exit status 1 FAIL github.com/malbeclabs/doublezero/e2e 227.649s make: *** [Makefile:21: test] Error 1 ``` ## With Fix ``` 2026-01-17 01:37:41 INF ==> Verifying activator container is still running 2026-01-17 01:37:41 INF --> Activator is still running === RUN TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_interface_removed 2026-01-17 01:37:41 INF ==> Verifying interface is removed from chain 2026-01-17 01:37:49 INF --> Interface successfully removed from chain --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP (190.70s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/create_loopback_with_out_of_pool_ip (0.53s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/wait_for_interface_activation (14.01s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/delete_interface_with_out_of_pool_ip (6.04s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_activator_running (0.00s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_interface_removed (8.01s) PASS ok github.com/malbeclabs/doublezero/e2e 240.147s ```

## Summary of Changes * e2e: add influxdb and device-health-oracle containers * These containers will be used to validate functionality described in rfcs/rfc12-device-provisioning.md ## Testing Verification * Includes a check to verify that each device is writing to the Influxdb intfCounters table

Resolves: #2513 ## Summary of Changes * Run telemetry agent on pending and drained links ## Testing Verification * Added tests

…2630) Resolves: #2622 ## Summary of Changes * Removed unknown status * Renamed status labels to be more descriptive: * pending → Pending BGP Session * initializing → Initializing BGP Session * down → BGP Session Down * up → BGP Session Up * Added BGP Session Failed (TCP connected but BGP handshake timed out after 5 seconds) * Added Network Unreachable (TCP connection failed, likely firewall issue) ## Testing Verification * Fixed existing tests * Added new tests

) ## Summary of Changes * Adds as_id/as_ip method to IpOrId enum. Returns Option<> ## Testing Verification * Adds unit tests

## Summary of Changes * SetGlobalConfig instruction changed to create global resource accounts * ActivateDevice instruction changed to create device resource accounts * UpdateDevice instruction changed to create/update device resource accounts * CloseAccountDevice instruction changed to close device resource accounts * SDK updated for these instructions * Added a resource close cli command/instruction ## Testing Verification * Tests updated * New tests for smart contract as well as sdk commands Closes #2623 Closes #2624 Closes #2625

…requests_total metric (#2674) ## Summary of Changes * e2e: add prometheus container and validate controller_grpc_getconfig_requests_total metric * controller and device-health-oracle containers now include alloy, which forwards metrics to prometheus container * e2e/main_test.go validates that each device published controller_grpc_getconfig_requests_total metrics * Required for rfcs/rfc-network-provisioning.md

## Summary of Changes * e2e: check for old status up string for backward compatibility - This avoids failures due to the QA tests expecting the new status string even though the clients on QA hosts have not yet been upgraded to a version that outputs the new string ## Testing Verification * Ran a successful test from the command line

# Summary Fix #2401 Fix #2404 Implement on-chain resource allocation for User activation and deallocation via ResourceExtension bitmaps, making the activator stateless for User lifecycle operations. Changes: - ActivateUser: optionally allocate tunnel_net, tunnel_id, dz_ip from ResourceExtension bitmaps (8-account layout) or use legacy args (5-account) - CloseAccountUser: optionally deallocate resources back to bitmaps (9-account layout) or use legacy behavior (6-account) - Extend authorization to allow foundation_allowlist members - DZ IP allocation follows UserType logic (IBRL uses client_ip, others allocate) - SDK commands add use_onchain_allocation/use_onchain_deallocation flags

## Summary of Changes * device-health-oracle: calculate burn-in period * Required by rfcs/rfc12-network-provisioning.md ## Testing Verification * Unit tests

Resolves: #2485 ## Summary of Changes * Added logic to `tunnel.tmpl` that shuts down user BGP, IBGP sessions, MSDP neighbors, and ISIS when `device.status` is `Drained` * Added test case and fixture file to verify drained device configuration * This implements the device maintenance workflow from RFC-12 (Network Provisioning Framework) ## Testing Verification * Added unit test `render_drained_device_config_successfully` that verifies shutdown commands are included in rendered config when device status is Drained I went through the classic process of draining and undraining a device by establishing an IBRL tunnel from one of our bm hosts to a dn dzd. This was the verification: ``` (base) ubuntu@chi-dn-bm1:~$ doublezero status Tunnel Status | Last Session Update | Tunnel Name | Tunnel Src | Tunnel Dst | Doublezero IP | User Type | Current Device | Lowest Latency Device | Metro | Network unknown | 1970-01-01 00:00:00 UTC | doublezero0 | 137.174.145.138 | 100.0.0.1 | 137.174.145.138 | IBRL | chi-dn-dzd1 | N/A | Chicago | devnet (base) ubuntu@chi-dn-bm1:~$ ---------------------------------------------------------------------- (base) ubuntu@chi-dn-bm1:~$ doublezero latency Pubkey | Code | IP | Min | Max | Avg | reachable JATksU22Uc6uwJ5bQvEisf3XWFJAtJrdh3n7eSNmrK7C | test123 | 1.2.3.4 | 0.00ms | 0.00ms | 0.00ms | false 4CkvmyquGN4qtXLNj3hpJcqYbb7PCanLbU1rQHHdp6xp | chi-dn-dzd3-delete-me | 0.1.2.3 | 0.00ms | 0.00ms | 0.00ms | false 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r | chi-dn-dzd3 | 100.0.0.33 | 0.00ms | 0.00ms | 0.00ms | false 3JT6EPj4ESTRevv6MadpLYLvijBVDTQXhuHWuZzFgNfV | dz-test | 1.2.3.7 | 0.00ms | 0.00ms | 0.00ms | false 7g6TT8RU2iBKaWzAxBx87S4aYq5HMztTA1vedQmMpREZ | test789 | 1.2.3.6 | 0.00ms | 0.00ms | 0.00ms | false 7sk4SevuKLWNDLDjCy8m9bMk9MtXPDxmL5TQrchDPeca | chi-dn-dzd4 | 100.0.0.49 | 0.00ms | 0.00ms | 0.00ms | false Cu9n4EreVz2iUieSAyLxbLMtcKCTzggLomn68oUge5ww | test456 | 1.2.3.5 | 0.00ms | 0.00ms | 0.00ms | false 3cSe5iowxN1tzTXKHS9DH8PofiyuHyLg5X3GXD5aF6ri | chi-dn-dzd2 | 100.0.0.17 | 0.00ms | 0.00ms | 0.00ms | false ---------------------------------------------------------------------- chi-dn-dzd1#show running-config | section isis schedule isis-upload interval 360 timeout 2 max-log-files 100 command bash /mnt/flash/upload-wrapper.sh management api models provider isis link-state-database interface Loopback255 isis enable 1 interface Loopback256 isis enable 1 interface Switch1/11/2 isis enable 1 isis circuit-type level-2 isis hello-interval 1 isis metric 1000 isis hello padding isis network point-to-point interface Switch1/11/4 isis enable 1 isis circuit-type level-2 isis hello-interval 1 isis metric 1000 isis hello padding isis network point-to-point interface Switch1/12/3 isis circuit-type level-2 isis metric 1 isis network point-to-point router isis 1 net 49.0000.ac10.0006.0000.00 router-id ipv4 172.16.0.6 log-adjacency-changes set-overload-bit ! address-family ipv4 unicast ! segment-routing mpls no shutdown ``` After I undrain overload-bit is no longer there: ``` router isis 1 net 49.0000.ac10.0006.0000.00 router-id ipv4 172.16.0.6 log-adjacency-changes ! address-family ipv4 unicast ! segment-routing mpls no shutdown ```

# Title Enforce Activated status before suspend + add negative tests # Summary This PR makes the suspend logic in the serviceability program a bit stricter and more consistent. Right now, some *_suspend handlers don’t check the current status of the entity before allowing a suspend. That means it’s possible to try to suspend something that is still Pending or already Suspended, while link/suspend.rs already enforces that only Activated entities can be suspended. Here I align all suspend handlers with that same rule and add one negative test per entity type to make sure we don’t regress in the future. Closes : #2221. ## What changed 1. Enforce Activated before suspend For the following handlers: `location/suspend.rs` `exchange/suspend.rs` `contributor/suspend.rs` `device/suspend.rs` `user/suspend.rs` `multicastgroup/suspend.rs` I now: Deserialize the entity Check that `entity.status == EntityStatus::Activated` Return` DoubleZeroError::InvalidStatus` otherwise (with a small debug log under `#[cfg(test)]`) The check follows the same pattern already used in link/suspend.rs: ``` if entity.status != EntityStatus::Activated { #[cfg(test)] msg!("{:?}", entity); return Err(DoubleZeroError::InvalidStatus.into()); } ``` This doesn’t change behaviour for valid suspends (entities already Activated). It just tightens validation when the status is not valid for a suspend. 2. New negative tests I added one negative test per entity type to make the new checks explicit and guarded by tests: `test_suspend_location_from_suspended_fails` `test_suspend_exchange_from_suspended_fails` `test_suspend_contributor_from_suspended_fails` `test_suspend_device_from_pending_fails` `test_suspend_user_from_suspended_fails` `test_suspend_multicastgroup_from_pending_fails` Each test: Puts the entity in an invalid state for suspend (Pending or Suspended depending on the case), Calls the corresponding *_suspend processor, Asserts that it fails with `DoubleZeroError::InvalidStatus`. This way each handler’s status check is covered and future changes will be forced to keep this behaviour. 3. Fix in device activation test setup While working on the tests, I also fixed the account list passed to device activation. device/activate expects: `device` `globalstate` `payer` `system_program` The previous test setup was missing some of these, so it didn’t really mirror the on-chain invocation. I updated the test to provide the full, correct account list. This doesn’t change behaviour, but it makes the test more realistic and closer to what actually happens at runtime. Testing From the smartcontract workspace: ``` # During development: specific suspend tests cargo test -p doublezero-serviceability --test user_tests test_suspend --test-threads=1 # All suspend-related negative tests cargo test -p doublezero-serviceability "test_suspend.*fails" --test-threads=1 # Full test suite for the serviceability crate cargo test -p doublezero-serviceability # Lint cargo clippy -p doublezero-serviceability ``` # Results: All 6 new negative tests pass All existing tests in doublezero-serviceability pass cargo clippy clean (no new warnings)

## Summary of Changes - Fix global monitor release workflow `go.mod` config ## Testing Verification * Show evidence of testing the change

…-users (#2683) Resolves: #2486 ## Summary of Changes Added QA identity allowlist to bypass device capacity and status checks for QA testing: **Smart Contract (Solana)**: - Added `qa_allowlist: Vec<Pubkey>` to `GlobalState` - Created `AddQaAllowlist` and `RemoveQaAllowlist` instructions - Modified `user/create.rs` and `user/create_subscribe.rs` to bypass `device.status` and `max_users` checks when payer is in `qa_allowlist` - Only `foundation_allowlist` members can manage the `qa_allowlist` **CLI (`doublezero`)**: - Added `doublezero global-config qa-allowlist {list|add|remove}` commands to manage QA allowlist **E2E Tests**: - Added `--skip-capacity-check` flag to `ValidDevices()` function - When enabled, skips client-side device capacity filtering - Flag defaults to `false` for backward compatibility **Two-Phase Rollout**: 1. **Phase 1 (this PR)**: Merge with flag disabled - tests run as before 2. **Phase 2 (post-deployment)**: Add QA host pubkeys to allowlist, then enable `--skip-capacity-check` in CI workflows ## Testing Verification ``` cd /home/martin/Documents/malbec/doublezero && cargo run -p doublezero -- global-config qa-allowlist --help 2>&1 Manage the QA allowlist Usage: doublezero global-config qa-allowlist [OPTIONS] <COMMAND> Commands: list List QA allowlist add Add a pubkey to the QA allowlist remove Remove a pubkey from the QA allowlist help Print this message or the help of the given subcommand(s) Options: -e, --env <ENV> DZ env (testnet, devnet, or mainnet-beta) --url <RPC_URL> DZ ledger RPC URL --ws <WEBSOCKET_URL> DZ ledger WebSocket URL --program-id <PROGRAM_ID> DZ program ID (testnet or devnet) --keypair <KEYPAIR> Path to the keypair file --no-version-warning Suppress version warning output -h, --help Print help ```

thijsvanemmerik added the skip-changelog label Jan 15, 2026

thijsvanemmerik requested a review from ben-malbeclabs January 19, 2026 16:39

ben-malbeclabs reviewed Jan 20, 2026

View reviewed changes

rfcs/rfc8-contributor-incident-maintenance-logging.md Outdated Show resolved Hide resolved

ben-malbeclabs reviewed Jan 20, 2026

View reviewed changes

rfcs/rfc8-contributor-incident-maintenance-logging.md Outdated Show resolved Hide resolved

ben-malbeclabs reviewed Jan 27, 2026

View reviewed changes

rfcs/rfc8-contributor-incident-maintenance-logging.md Outdated Show resolved Hide resolved

ben-malbeclabs approved these changes Feb 2, 2026

View reviewed changes

thijsvanemmerik force-pushed the rfc8-enhancements branch 4 times, most recently from 216edcf to 67ef180 Compare February 3, 2026 15:32

thijsvanemmerik and others added 20 commits February 3, 2026 16:38

Updated RFC8 to current status

8a4e622

RFC8 - update to current status

cdf1d38

gnmi-writer: add gnmiPath field to document testdata source paths (#2641

5dee40d

) ## Summary of Changes This is mainly a documentation update so we track the gnmi path that generated the associated prototext file used for integration tests. ## Testing Verification Existing tests pass.

gnmi-writer: use interface-level ifindex path (#2645)

28468e0

docs: clarify DZ env shorthand options in SetConfigCliCommand (#2589)

90ac979

Rename ReadyForService LinkStatus field to Provisioning (#2661)

7df3e9a

Resolves: #2660 ## Summary of Changes * Rename ReadyForService LinkStatus field to Provisioning

Run telemetry agent on pending and drained (#2619)

8378919

Resolves: #2513 ## Summary of Changes * Run telemetry agent on pending and drained links ## Testing Verification * Added tests

elitegreg and others added 12 commits February 3, 2026 16:38

Resource Extension: add methods for getting ip/id off IpOrId enum (#2670

ed0bd1e

) ## Summary of Changes * Adds as_id/as_ip method to IpOrId enum. Returns Option<> ## Testing Verification * Adds unit tests

device-health-oracle: calculate burn-in period (#2672)

13ed6ab

## Summary of Changes * device-health-oracle: calculate burn-in period * Required by rfcs/rfc12-network-provisioning.md ## Testing Verification * Unit tests

global-monitor: fix release workflow (#2686)

592bbb0

## Summary of Changes - Fix global monitor release workflow `go.mod` config ## Testing Verification * Show evidence of testing the change

Updated Severity Levels

9b6af10

update severity section

6364a59

thijsvanemmerik force-pushed the rfc8-enhancements branch from 67ef180 to 6364a59 Compare February 3, 2026 16:42

Merge main into rfc8-enhancements

1fb892a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated RFC8 to current status #2637

Updated RFC8 to current status #2637

Uh oh!

thijsvanemmerik commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ben-malbeclabs left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Updated RFC8 to current status #2637

Are you sure you want to change the base?

Updated RFC8 to current status #2637

Uh oh!

Conversation

thijsvanemmerik commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ben-malbeclabs left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants