Skip to content

Conversation

@thijsvanemmerik
Copy link
Contributor

  • Storage: Airtable → PostgreSQL
  • Slack channels: #dz-contributor-incidents / #dz-contributor-maintenance (no @mentions in MVP)
  • Contributor Onboarding: CLI ops-manager setup → wallet connect (Phantom/Solflare/Coinbase) → API keys → create records
  • Permission Model: Contributor keys (own devices only) vs Admin keys (network-wide); DZX link ownership by A-side
  • New fields: root_cause (incidents only), assignee, internal_reference
  • Status Lifecycle: Full incident/maintenance status flows with definitions
  • Severity Levels: sev1/sev2/sev3 defined
  • Root Cause Codes: 14 codes for incident resolution
  • Other: UTC timestamps, human-readable device/link dropdowns, simplified status tables

Copy link
Contributor

@ben-malbeclabs ben-malbeclabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thijsvanemmerik thijsvanemmerik force-pushed the rfc8-enhancements branch 4 times, most recently from 216edcf to 67ef180 Compare February 3, 2026 15:32
thijsvanemmerik and others added 20 commits February 3, 2026 16:38
## Summary of Changes
This renames the clickhouse column and go struct fields from device_code
to device_pubkey to align with the actual data being stored (solana
pubkey) and also updates the test data targets to use fake pubkey
values.

## Testing Verification
Existing unit/integration tests pass.
This pull request introduces new health and desired status fields for
both devices and links in the `serviceability/state.go` file, enhancing
the ability to track and serialize their operational states. These
changes add new enums, fields, and JSON serialization logic to support
more granular state management.

**Device state enhancements:**
- Added new enums: `DeviceHealth` and `DeviceDesiredStatus`, with
corresponding string values and JSON serialization methods.
- Updated the `Device` struct to include `DeviceHealth` and
`DeviceDesiredStatus` fields.
- Modified the `Device` JSON serialization to output the new health and
desired status fields.
[[1]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R399-R400)
[[2]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R420-R421)
- Added a new `DeviceStatusDrained` status to the existing
`DeviceStatus` enum.

**Link state enhancements:**
- Added new enums: `LinkHealth` and `LinkDesiredStatus`, with
corresponding string values and JSON serialization methods.
- Updated the `Link` struct to include `LinkHealth` and
`LinkDesiredStatus` fields.
- Modified the `Link` JSON serialization to output the new health and
desired status fields.
[[1]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R557-R558)
[[2]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R569-R570)

---------

Co-authored-by: [email protected] <[email protected]>
## Summary of Changes
- Start gNMI tunnel client in telemetry agent, that connects to a gRPC
tunnel server and proxies sessions to the local gNMI service
- Enables collection of device state via gNMI
- Closes #2633

## Testing Verification
- Added test coverage for the new code
- Deployed to and validated in devnet
#2638)

## Summary of Changes
- add tools/gnmi-prototext-convert to convert raw gnmi get responses to
subscribe response format for testdata files
- update tests to handle interfaces with multiple subinterfaces
- document conversion tool in readme

## Testing Verification
```
  === RUN   TestUnmarshal_InterfacesIfindex
      processor_test.go:810: successfully unmarshalled 198 interfaces
  --- PASS: TestUnmarshal_InterfacesIfindex (0.15s)
  PASS
  ok    github.com/malbeclabs/doublezero/telemetry/gnmi-writer/internal/gnmi    0.693s

  === RUN   TestIntegration
  === RUN   TestIntegration/InterfaceIfindex
      processor_integration_test.go:294: published 25018 bytes to topic gnmi-notifications
      msg="wrote records to clickhouse" count=202
      msg="processed notifications" count=202
      processor_integration_test.go:442: found 202 interface ifindex records
  --- PASS: TestIntegration/InterfaceIfindex (15.31s)
  PASS
  ```
)

## Summary of Changes
This is mainly a documentation update so we track the gnmi path that
generated the associated prototext file used for integration tests.

## Testing Verification
Existing tests pass.
#2646)

## Summary of Changes
Add three new gNMI telemetry collections:
- transceiver_state: optical power metrics (input/output power, laser
bias current)
- interface_state: admin/oper status and interface counters
- transceiver_thresholds: alarm thresholds per severity level

Changes:
- regenerate oc package with openconfig-platform-transceiver module
- add record types, extractors, and ClickHouse schemas for each
collection
- add integration tests and isolation unit tests
- document extractor ordering requirements in code and README

## Testing Verification
```
Unit Tests (Isolation):
  === RUN   TestExtractTransceiverState_Isolation
      processor_test.go:864: extracted 52 transceiver state records
  --- PASS: TestExtractTransceiverState_Isolation (0.15s)
  === RUN   TestExtractInterfaceState_Isolation
      processor_test.go:925: extracted 198 interface state records
  --- PASS: TestExtractInterfaceState_Isolation (0.16s)
  === RUN   TestExtractTransceiverThresholds_Isolation
      processor_test.go:974: extracted 600 transceiver threshold records
  --- PASS: TestExtractTransceiverThresholds_Isolation (0.16s)
  PASS

  Integration Tests (end-to-end with ClickHouse + Redpanda containers):
  --- PASS: TestIntegration (45.85s)
      --- PASS: TestIntegration/TransceiverState (16.44s)    → 52 records written to ClickHouse
      --- PASS: TestIntegration/InterfaceState (14.52s)      → 198 records written to ClickHouse
      --- PASS: TestIntegration/TransceiverThresholds (14.89s) → 600 records written to ClickHouse
  PASS
```
This pull request introduces several significant improvements and
changes to device and link lifecycle management, telemetry, and CLI
commands. The most notable updates include the addition of health
management and status controls for devices and links, new CLI commands
for internal health operations, and changes to device/link provisioning
and activation flows. There are also some test fixture and e2e test
updates to align with the new status model.

**Device and Link Lifecycle & Status Management:**

* Introduced health management for Devices and Links, adding explicit
health states, authorized health updates, and related enhancements to
state, processors, and tests.
* Added support for a `desired-status` parameter when creating and
updating devices and links, allowing explicit control over their
activation state. This is reflected in both the code and e2e tests.
[[1]](diffhunk://#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4edR76-R86)
F6d596d0L61R61,
[[2]](diffhunk://#diff-3c1eb3950d7513dd8bea87ae37cce2238dec3aedc2512fe2bef786c7213c7fc5L219-R224)
* Updated device and link event processors to handle new provisioning
and activation states, including logging and state transitions (e.g.,
`DeviceProvisioning`, `ReadyForService`).
[[1]](diffhunk://#diff-0aed512054e126f7dc8b0bffbc2aad2bd5e2aade35d7e902e08138cc36f94341L39-R39)
[[2]](diffhunk://#diff-0aed512054e126f7dc8b0bffbc2aad2bd5e2aade35d7e902e08138cc36f94341R53-R67)
[[3]](diffhunk://#diff-d9c91ea17e4fe3e26302887a60fd0596f950bcc4949005b71ae4fbf2acfc5ebbL64-R64)

**CLI Command Changes:**

* Added hidden/internal CLI commands for setting the health status of
device and link interfaces (`SetDeviceHealthCliCommand`,
`SetLinkHealthCliCommand`). These are intended for operational or
testing use and not exposed in the public CLI surface.
[[1]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL12-R12)
[[2]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL62-R70)
[[3]](diffhunk://#diff-02e4eb858ebd20f998f552ba9c1116fa4d51afd1be85dbf71d3af49247a37e1aL2-R5)
[[4]](diffhunk://#diff-02e4eb858ebd20f998f552ba9c1116fa4d51afd1be85dbf71d3af49247a37e1aL51-R55)
[[5]](diffhunk://#diff-9b18c350b4f1b096b7187215002603fceaaa72368cbe6b66dacfd105ea8eace2R180)
[[6]](diffhunk://#diff-9b18c350b4f1b096b7187215002603fceaaa72368cbe6b66dacfd105ea8eace2R193)
* Removed the public `Suspend` and `Resume` device commands from both
the user and admin CLIs, streamlining device status management.
[[1]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL12-R12)
[[2]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL62-R70)
[[3]](diffhunk://#diff-4c553a5c541ace8b1cfe447e3d968ee7c71feb6357037b95d18b9665f174fd14L12-L13)
[[4]](diffhunk://#diff-4c553a5c541ace8b1cfe447e3d968ee7c71feb6357037b95d18b9665f174fd14L62-L67)
[[5]](diffhunk://#diff-9b18c350b4f1b096b7187215002603fceaaa72368cbe6b66dacfd105ea8eace2L172-L173)
[[6]](diffhunk://#diff-901dffbcc0f09778ec585863ab718fec67ebd602c6fb8a98c47907d6bbf71f0cL133-L134)

**Testing and Fixture Updates:**

* Updated e2e tests to use the new `--desired-status activated` flag
when creating devices and links, ensuring that test resources start in
the correct state. (F6d596d0L61R61,
[e2e/device_telemetry_test.goL219-R224](diffhunk://#diff-3c1eb3950d7513dd8bea87ae37cce2238dec3aedc2512fe2bef786c7213c7fc5L219-R224))
* Adjusted IBRL agent configuration test fixtures to use more realistic
IS-IS metrics and interface states, improving test coverage and
accuracy.
[[1]](diffhunk://#diff-45da870eae10769bbc547bec872ab640e8bf1bfd290aeb24c7569d254de3adfcL94-R94)
[[2]](diffhunk://#diff-45da870eae10769bbc547bec872ab640e8bf1bfd290aeb24c7569d254de3adfcL108-R108)
[[3]](diffhunk://#diff-9136bc38a39707583a0387e3b75ddbb531ae5b1da05374e80ff26f8d0ebcb042L94-R94)
[[4]](diffhunk://#diff-9136bc38a39707583a0387e3b75ddbb531ae5b1da05374e80ff26f8d0ebcb042L108-R108)
[[5]](diffhunk://#diff-04dd3c3c0bcf9f5f4d7718b5133038dd573c7a00a1a0c876e859370bc5d69af3L94-R94)
[[6]](diffhunk://#diff-04dd3c3c0bcf9f5f4d7718b5133038dd573c7a00a1a0c876e859370bc5d69af3L108-R108)
[[7]](diffhunk://#diff-0c75c0e0e5b0c9fc5e214749f9cb1a8e7c3c04f9caf31876e3ae2d215a7a531bL94-R94)
[[8]](diffhunk://#diff-0c75c0e0e5b0c9fc5e214749f9cb1a8e7c3c04f9caf31876e3ae2d215a7a531bL108-R108)

**Changelog and Documentation:**

* Updated the `CHANGELOG.md` to reflect the introduction of health
management, desired status for devices and links, and telemetry
improvements for entities in provisioning/draining states.
[[1]](diffhunk://#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4edR55-R58)
[[2]](diffhunk://#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4edR76-R86)

These changes collectively improve the robustness, operational control,
and observability of device and link management in the system.

## Testing Verification
* all test ok
## Summary of Changes
* DeactivateMulticastGroup now only succeeds when both publisher_count
== 0 and subscriber_count == 0 (in addition to status == Deleting).
* If either count is non zero, the instruction fails and the account is
not closed.
* Regression test added: New test verifies deactivation fails when
publisher/subscriber counts are non-zero and passes when they’re zero.
* Updated Changelog.

## Testing Verification
* Existing rust + lint checks pass.

Closes #2219

---------

Signed-off-by: ANISH-SR <[email protected]>
Co-authored-by: Juan Olveira <[email protected]>
## Summary of Changes
* Now only allows reactivation when multicastgroup.status is Suspended,
otherwise returns InvalidStatus.
* New test ensures reactivation fails when the group is not Suspended.
* Changelog.md updated.

## Testing Verification
* Rust checks + lint green.

Closes #2227

Signed-off-by: ANISH-SR <[email protected]>
…are empty + add regression test (#2635)

## Summary of Changes
* Added a safety guard to process_closeaccount_user so a user account
can only be closed when both publishers and subscribers are empty.
* Added a regression test.
* Updated CHANGELOG.md

## Testing Verification
* Existing rust checks + lint checks pass.

Resolves #2218

Signed-off-by: ANISH-SR <[email protected]>
Co-authored-by: Juan Olveira <[email protected]>
## Summary of Changes
This fixes two issues:
- Devices send transceiver descriptions as separate gnmi update
messages, which were being parsed as individual records and written as a
duplicate row with zero values.
- Transceiver thresholds are sent as individual updates for each
severity. These were being written as it's own row per severity and are
now aggregated into a single row per interface/severity.

## Testing Verification
Transceiver state before containing a duplicate row w/ zero values:
```
 53. │ 2026-01-16 04:08:51.197958931 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet1      │             0 │       -1.95 │        -2.35 │               6.25 │
 54. │ 2026-01-16 04:08:51.197958931 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet1      │             0 │           0 │            0 │                  0 │
```
Transceiver state now:
```
SELECT *
FROM transceiver_state_latest
WHERE (device_pubkey = '9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r') AND (interface_name = 'Ethernet1')

   ┌─────────────────────timestamp─┬─device_pubkey────────────────────────────────┬─interface_name─┬─channel_index─┬─input_power─┬─output_power─┬─laser_bias_current─┐
1. │ 2026-01-16 04:20:56.145597378 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet1      │             0 │       -1.94 │        -2.33 │               6.25 │
   └───────────────────────────────┴──────────────────────────────────────────────┴────────────────┴───────────────┴─────────────┴──────────────┴────────────────────┘
```

Transceiver thresholds before:
```
41551. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54     │ CRITICAL │                   0 │                   0 │                   0 │                   0 │                        0 │                        0 │                        0 │                        0 │                 2.97 │                    0 │
41552. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54     │ CRITICAL │                   0 │                   0 │                   0 │                   0 │                        0 │                        0 │                        0 │                        0 │                    0 │   3.6300000000000003 │
41553. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54     │ CRITICAL │                   0 │                   0 │                   0 │                   0 │                        2 │                        0 │                        0 │                        0 │                    0 │                    0 │
41554. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54     │ CRITICAL │                   0 │                   0 │                   0 │                   0 │                        0 │                       14 │                        0 │                        0 │                    0 │                    0 │
41555. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54     │ CRITICAL │ -13.904055907747797 │                   0 │                   0 │                   0 │                        0 │                        0 │                        0 │                        0 │                    0 │                    0 │
41556. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54     │ CRITICAL │                   0 │   3.000082025538127 │                   0 │                   0 │                        0 │                        0 │                        0 │                        0 │                    0 │                    0 │
41557. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54     │ CRITICAL │                   0 │                   0 │ -11.301817920206716 │                   0 │                        0 │                        0 │                        0 │                        0 │                    0 │                    0 │
41558. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54     │ CRITICAL │                   0 │                   0 │                   0 │   3.000082025538127 │                        0 │                        0 │                        0 │                        0 │                    0 │                    0 │
41559. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54     │ CRITICAL │                   0 │                   0 │                   0 │                   0 │                        0 │                        0 │                       -5 │                        0 │                    0 │                    0 │
41560. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54     │ CRITICAL │                   0 │                   0 │                   0 │                   0 │                        0 │                        0 │                        0 │                       75 │                    0 │                    0 │
```

 Transceiver thresholds now (single record per interface/severity):
```
SELECT *
FROM transceiver_thresholds_latest
WHERE (device_pubkey = '9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r') AND (interface_name = 'Ethernet54') AND (severity = 'CRITICAL')

Row 1:
──────
timestamp:                2026-01-16 04:24:56.624986659
device_pubkey:            9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r
interface_name:           Ethernet54
severity:                 CRITICAL
input_power_lower:        -13.904055907747797
input_power_upper:        3.000082025538127
output_power_lower:       -11.301817920206716
output_power_upper:       3.000082025538127
laser_bias_current_lower: 2
laser_bias_current_upper: 14
module_temperature_lower: -5
module_temperature_upper: 75
supply_voltage_lower:     2.97
supply_voltage_upper:     3.6300000000000003
```
Resolves: #2660

## Summary of Changes
* Rename ReadyForService LinkStatus field to Provisioning
# Summary

Fix #2657

This PR addresses the bug by basically ensuring that we call `contains`
method in the `ip_to_index` helper function; this makes certain that the
ip is actually within ranges; without this safety check, when deleting
the activator would essentially panic and fail to restart.

## Without Fix

```
    activator_interface_delete_test.go:89:
        	Error Trace:	/home/rahul/malbec-labs/doublezero/e2e/activator_interface_delete_test.go:89
        	Error:      	Should be true
        	Test:       	TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_activator_running
        	Messages:   	activator container is not running - it likely crashed
--- FAIL: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP (181.13s)
    --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/create_loopback_with_out_of_pool_ip (0.53s)
    --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/wait_for_interface_activation (14.01s)
    --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/delete_interface_with_out_of_pool_ip (5.53s)
    --- FAIL: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_activator_running (0.00s)
FAIL
exit status 1
FAIL	github.com/malbeclabs/doublezero/e2e	227.649s
make: *** [Makefile:21: test] Error 1
```

## With Fix

```
2026-01-17 01:37:41 INF ==> Verifying activator container is still running
2026-01-17 01:37:41 INF --> Activator is still running
=== RUN   TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_interface_removed
2026-01-17 01:37:41 INF ==> Verifying interface is removed from chain
2026-01-17 01:37:49 INF --> Interface successfully removed from chain
--- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP (190.70s)
    --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/create_loopback_with_out_of_pool_ip (0.53s)
    --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/wait_for_interface_activation (14.01s)
    --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/delete_interface_with_out_of_pool_ip (6.04s)
    --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_activator_running (0.00s)
    --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_interface_removed (8.01s)
PASS
ok  	github.com/malbeclabs/doublezero/e2e	240.147s
```
## Summary of Changes
* e2e: add influxdb and device-health-oracle containers
* These containers will be used to validate functionality described in
rfcs/rfc12-device-provisioning.md

## Testing Verification
* Includes a check to verify that each device is writing to the Influxdb
intfCounters table
Resolves: #2513

## Summary of Changes
* Run telemetry agent on pending and drained links

## Testing Verification
* Added tests
…2630)

Resolves: #2622

## Summary of Changes
* Removed unknown status
* Renamed status labels to be more descriptive:
  * pending → Pending BGP Session
  * initializing → Initializing BGP Session
  * down → BGP Session Down
  * up → BGP Session Up
* Added BGP Session Failed (TCP connected but BGP handshake timed out
after 5 seconds)
* Added Network Unreachable (TCP connection failed, likely firewall
issue)


## Testing Verification
* Fixed existing tests
* Added new tests
elitegreg and others added 12 commits February 3, 2026 16:38
)

## Summary of Changes
* Adds as_id/as_ip method to IpOrId enum. Returns Option<>

## Testing Verification
* Adds unit tests
## Summary of Changes
* SetGlobalConfig instruction changed to create global resource accounts
* ActivateDevice instruction changed to create device resource accounts
* UpdateDevice instruction changed to create/update device resource
accounts
* CloseAccountDevice instruction changed to close device resource
accounts
* SDK updated for these instructions
* Added a resource close cli command/instruction

## Testing Verification
* Tests updated
* New tests for smart contract as well as sdk commands

Closes #2623
Closes #2624
Closes #2625
…requests_total metric (#2674)

## Summary of Changes
* e2e: add prometheus container and validate
controller_grpc_getconfig_requests_total metric
* controller and device-health-oracle containers now include alloy,
which forwards metrics to prometheus container
* e2e/main_test.go validates that each device published
controller_grpc_getconfig_requests_total metrics
* Required for rfcs/rfc-network-provisioning.md
## Summary of Changes
* e2e: check for old status up string for backward compatibility
- This avoids failures due to the QA tests expecting the new status
string even though the clients on QA hosts have not yet been upgraded to
a version that outputs the new string

## Testing Verification
* Ran a successful test from the command line
# Summary

Fix #2401
Fix #2404

Implement on-chain resource allocation for User activation and
deallocation via ResourceExtension bitmaps, making the activator
stateless for User lifecycle operations.

Changes:
- ActivateUser: optionally allocate tunnel_net, tunnel_id, dz_ip from
ResourceExtension bitmaps (8-account layout) or use legacy args
(5-account)
- CloseAccountUser: optionally deallocate resources back to bitmaps
(9-account layout) or use legacy behavior (6-account)
- Extend authorization to allow foundation_allowlist members
- DZ IP allocation follows UserType logic (IBRL uses client_ip, others
allocate)
- SDK commands add use_onchain_allocation/use_onchain_deallocation flags
## Summary of Changes
* device-health-oracle: calculate burn-in period
* Required by rfcs/rfc12-network-provisioning.md

## Testing Verification
* Unit tests
Resolves: #2485

## Summary of Changes
* Added logic to `tunnel.tmpl` that shuts down user BGP, IBGP sessions,
MSDP neighbors, and ISIS when `device.status` is `Drained`
* Added test case and fixture file to verify drained device
configuration
* This implements the device maintenance workflow from RFC-12 (Network
Provisioning Framework)

## Testing Verification
* Added unit test `render_drained_device_config_successfully` that
verifies shutdown commands are included in rendered config when device
status is Drained

I went through the classic process of draining and undraining a device
by establishing an IBRL tunnel from one of our bm hosts to a dn dzd.
This was the verification:

```
(base) ubuntu@chi-dn-bm1:~$ doublezero status
 Tunnel Status | Last Session Update     | Tunnel Name | Tunnel Src      | Tunnel Dst | Doublezero IP   | User Type | Current Device | Lowest Latency Device | Metro   | Network 
 unknown       | 1970-01-01 00:00:00 UTC | doublezero0 | 137.174.145.138 | 100.0.0.1  | 137.174.145.138 | IBRL      | chi-dn-dzd1    | N/A                   | Chicago | devnet  
(base) ubuntu@chi-dn-bm1:~$ 

----------------------------------------------------------------------

(base) ubuntu@chi-dn-bm1:~$ doublezero latency
 Pubkey                                       | Code                  | IP         | Min    | Max    | Avg    | reachable 
 JATksU22Uc6uwJ5bQvEisf3XWFJAtJrdh3n7eSNmrK7C | test123               | 1.2.3.4    | 0.00ms | 0.00ms | 0.00ms | false     
 4CkvmyquGN4qtXLNj3hpJcqYbb7PCanLbU1rQHHdp6xp | chi-dn-dzd3-delete-me | 0.1.2.3    | 0.00ms | 0.00ms | 0.00ms | false     
 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r | chi-dn-dzd3           | 100.0.0.33 | 0.00ms | 0.00ms | 0.00ms | false     
 3JT6EPj4ESTRevv6MadpLYLvijBVDTQXhuHWuZzFgNfV | dz-test               | 1.2.3.7    | 0.00ms | 0.00ms | 0.00ms | false     
 7g6TT8RU2iBKaWzAxBx87S4aYq5HMztTA1vedQmMpREZ | test789               | 1.2.3.6    | 0.00ms | 0.00ms | 0.00ms | false     
 7sk4SevuKLWNDLDjCy8m9bMk9MtXPDxmL5TQrchDPeca | chi-dn-dzd4           | 100.0.0.49 | 0.00ms | 0.00ms | 0.00ms | false     
 Cu9n4EreVz2iUieSAyLxbLMtcKCTzggLomn68oUge5ww | test456               | 1.2.3.5    | 0.00ms | 0.00ms | 0.00ms | false     
 3cSe5iowxN1tzTXKHS9DH8PofiyuHyLg5X3GXD5aF6ri | chi-dn-dzd2           | 100.0.0.17 | 0.00ms | 0.00ms | 0.00ms | false  

----------------------------------------------------------------------

chi-dn-dzd1#show running-config | section isis
schedule isis-upload interval 360 timeout 2 max-log-files 100 command bash /mnt/flash/upload-wrapper.sh
management api models
   provider isis
      link-state-database
interface Loopback255
   isis enable 1
interface Loopback256
   isis enable 1
interface Switch1/11/2
   isis enable 1
   isis circuit-type level-2
   isis hello-interval 1
   isis metric 1000
   isis hello padding
   isis network point-to-point
interface Switch1/11/4
   isis enable 1
   isis circuit-type level-2
   isis hello-interval 1
   isis metric 1000
   isis hello padding
   isis network point-to-point
interface Switch1/12/3
   isis circuit-type level-2
   isis metric 1
   isis network point-to-point
router isis 1
   net 49.0000.ac10.0006.0000.00
   router-id ipv4 172.16.0.6
   log-adjacency-changes
   set-overload-bit
   !
   address-family ipv4 unicast
   !
   segment-routing mpls
      no shutdown
```

After I undrain overload-bit is no longer there:
```
router isis 1
   net 49.0000.ac10.0006.0000.00
   router-id ipv4 172.16.0.6
   log-adjacency-changes
   !
   address-family ipv4 unicast
   !
   segment-routing mpls
      no shutdown
```
# Title

Enforce Activated status before suspend + add negative tests
# Summary

This PR makes the suspend logic in the serviceability program a bit
stricter and more consistent.

Right now, some *_suspend handlers don’t check the current status of the
entity before allowing a suspend. That means it’s possible to try to
suspend something that is still Pending or already Suspended, while
link/suspend.rs already enforces that only Activated entities can be
suspended.

Here I align all suspend handlers with that same rule and add one
negative test per entity type to make sure we don’t regress in the
future.

Closes : #2221.

## What changed
1. Enforce Activated before suspend

For the following handlers:

`location/suspend.rs`

`exchange/suspend.rs`

`contributor/suspend.rs`

`device/suspend.rs`

`user/suspend.rs`

`multicastgroup/suspend.rs`

I now:

Deserialize the entity

Check that `entity.status == EntityStatus::Activated`

Return` DoubleZeroError::InvalidStatus` otherwise (with a small debug
log under `#[cfg(test)]`)

The check follows the same pattern already used in link/suspend.rs:
```
if entity.status != EntityStatus::Activated {
    #[cfg(test)]
    msg!("{:?}", entity);
    return Err(DoubleZeroError::InvalidStatus.into());
}
```


This doesn’t change behaviour for valid suspends (entities already
Activated). It just tightens validation when the status is not valid for
a suspend.

2. New negative tests

I added one negative test per entity type to make the new checks
explicit and guarded by tests:

`test_suspend_location_from_suspended_fails`

`test_suspend_exchange_from_suspended_fails`

`test_suspend_contributor_from_suspended_fails`

`test_suspend_device_from_pending_fails`

`test_suspend_user_from_suspended_fails`

`test_suspend_multicastgroup_from_pending_fails`

Each test:

Puts the entity in an invalid state for suspend (Pending or Suspended
depending on the case),

Calls the corresponding *_suspend processor,

Asserts that it fails with `DoubleZeroError::InvalidStatus`.

This way each handler’s status check is covered and future changes will
be forced to keep this behaviour.

3. Fix in device activation test setup

While working on the tests, I also fixed the account list passed to
device activation.

device/activate expects:

`device`

`globalstate`

`payer`

`system_program`

The previous test setup was missing some of these, so it didn’t really
mirror the on-chain invocation. I updated the test to provide the full,
correct account list.

This doesn’t change behaviour, but it makes the test more realistic and
closer to what actually happens at runtime.

Testing

From the smartcontract workspace:
```
# During development: specific suspend tests
cargo test -p doublezero-serviceability --test user_tests test_suspend --test-threads=1

# All suspend-related negative tests
cargo test -p doublezero-serviceability "test_suspend.*fails" --test-threads=1

# Full test suite for the serviceability crate
cargo test -p doublezero-serviceability

# Lint
cargo clippy -p doublezero-serviceability
```

# Results:

All 6 new negative tests pass

All existing tests in doublezero-serviceability pass

cargo clippy clean (no new warnings)
## Summary of Changes
- Fix global monitor release workflow `go.mod` config

## Testing Verification
* Show evidence of testing the change
…-users (#2683)

Resolves: #2486

## Summary of Changes

Added QA identity allowlist to bypass device capacity and status checks
for QA testing:

**Smart Contract (Solana)**:
- Added `qa_allowlist: Vec<Pubkey>` to `GlobalState`
- Created `AddQaAllowlist` and `RemoveQaAllowlist` instructions
- Modified `user/create.rs` and `user/create_subscribe.rs` to bypass
`device.status` and `max_users` checks when payer is in `qa_allowlist`
- Only `foundation_allowlist` members can manage the `qa_allowlist`

**CLI (`doublezero`)**:
- Added `doublezero global-config qa-allowlist {list|add|remove}`
commands to manage QA allowlist

**E2E Tests**:
- Added `--skip-capacity-check` flag to `ValidDevices()` function
- When enabled, skips client-side device capacity filtering
- Flag defaults to `false` for backward compatibility

**Two-Phase Rollout**:
1. **Phase 1 (this PR)**: Merge with flag disabled - tests run as before
2. **Phase 2 (post-deployment)**: Add QA host pubkeys to allowlist, then
enable `--skip-capacity-check` in CI workflows

## Testing Verification
```
cd /home/martin/Documents/malbec/doublezero && cargo run -p doublezero -- global-config qa-allowlist --help 2>&1

Manage the QA allowlist

Usage: doublezero global-config qa-allowlist [OPTIONS] <COMMAND>

Commands:
  list    List QA allowlist
  add     Add a pubkey to the QA allowlist
  remove  Remove a pubkey from the QA allowlist
  help    Print this message or the help of the given subcommand(s)

Options:
  -e, --env <ENV>                DZ env (testnet, devnet, or mainnet-beta)
      --url <RPC_URL>            DZ ledger RPC URL
      --ws <WEBSOCKET_URL>       DZ ledger WebSocket URL
      --program-id <PROGRAM_ID>  DZ program ID (testnet or devnet)
      --keypair <KEYPAIR>        Path to the keypair file
      --no-version-warning       Suppress version warning output
  -h, --help                     Print help
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.