-
Notifications
You must be signed in to change notification settings - Fork 6
Updated RFC8 to current status #2637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
thijsvanemmerik
wants to merge
33
commits into
main
Choose a base branch
from
rfc8-enhancements
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
thijsvanemmerik
commented
Jan 15, 2026
- Storage: Airtable → PostgreSQL
- Slack channels: #dz-contributor-incidents / #dz-contributor-maintenance (no @mentions in MVP)
- Contributor Onboarding: CLI ops-manager setup → wallet connect (Phantom/Solflare/Coinbase) → API keys → create records
- Permission Model: Contributor keys (own devices only) vs Admin keys (network-wide); DZX link ownership by A-side
- New fields: root_cause (incidents only), assignee, internal_reference
- Status Lifecycle: Full incident/maintenance status flows with definitions
- Severity Levels: sev1/sev2/sev3 defined
- Root Cause Codes: 14 codes for incident resolution
- Other: UTC timestamps, human-readable device/link dropdowns, simplified status tables
ben-malbeclabs
approved these changes
Feb 2, 2026
Contributor
ben-malbeclabs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
216edcf to
67ef180
Compare
## Summary of Changes This renames the clickhouse column and go struct fields from device_code to device_pubkey to align with the actual data being stored (solana pubkey) and also updates the test data targets to use fake pubkey values. ## Testing Verification Existing unit/integration tests pass.
This pull request introduces new health and desired status fields for both devices and links in the `serviceability/state.go` file, enhancing the ability to track and serialize their operational states. These changes add new enums, fields, and JSON serialization logic to support more granular state management. **Device state enhancements:** - Added new enums: `DeviceHealth` and `DeviceDesiredStatus`, with corresponding string values and JSON serialization methods. - Updated the `Device` struct to include `DeviceHealth` and `DeviceDesiredStatus` fields. - Modified the `Device` JSON serialization to output the new health and desired status fields. [[1]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R399-R400) [[2]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R420-R421) - Added a new `DeviceStatusDrained` status to the existing `DeviceStatus` enum. **Link state enhancements:** - Added new enums: `LinkHealth` and `LinkDesiredStatus`, with corresponding string values and JSON serialization methods. - Updated the `Link` struct to include `LinkHealth` and `LinkDesiredStatus` fields. - Modified the `Link` JSON serialization to output the new health and desired status fields. [[1]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R557-R558) [[2]](diffhunk://#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1R569-R570) --------- Co-authored-by: [email protected] <[email protected]>
## Summary of Changes - Start gNMI tunnel client in telemetry agent, that connects to a gRPC tunnel server and proxies sessions to the local gNMI service - Enables collection of device state via gNMI - Closes #2633 ## Testing Verification - Added test coverage for the new code - Deployed to and validated in devnet
#2638) ## Summary of Changes - add tools/gnmi-prototext-convert to convert raw gnmi get responses to subscribe response format for testdata files - update tests to handle interfaces with multiple subinterfaces - document conversion tool in readme ## Testing Verification ``` === RUN TestUnmarshal_InterfacesIfindex processor_test.go:810: successfully unmarshalled 198 interfaces --- PASS: TestUnmarshal_InterfacesIfindex (0.15s) PASS ok github.com/malbeclabs/doublezero/telemetry/gnmi-writer/internal/gnmi 0.693s === RUN TestIntegration === RUN TestIntegration/InterfaceIfindex processor_integration_test.go:294: published 25018 bytes to topic gnmi-notifications msg="wrote records to clickhouse" count=202 msg="processed notifications" count=202 processor_integration_test.go:442: found 202 interface ifindex records --- PASS: TestIntegration/InterfaceIfindex (15.31s) PASS ```
#2646) ## Summary of Changes Add three new gNMI telemetry collections: - transceiver_state: optical power metrics (input/output power, laser bias current) - interface_state: admin/oper status and interface counters - transceiver_thresholds: alarm thresholds per severity level Changes: - regenerate oc package with openconfig-platform-transceiver module - add record types, extractors, and ClickHouse schemas for each collection - add integration tests and isolation unit tests - document extractor ordering requirements in code and README ## Testing Verification ``` Unit Tests (Isolation): === RUN TestExtractTransceiverState_Isolation processor_test.go:864: extracted 52 transceiver state records --- PASS: TestExtractTransceiverState_Isolation (0.15s) === RUN TestExtractInterfaceState_Isolation processor_test.go:925: extracted 198 interface state records --- PASS: TestExtractInterfaceState_Isolation (0.16s) === RUN TestExtractTransceiverThresholds_Isolation processor_test.go:974: extracted 600 transceiver threshold records --- PASS: TestExtractTransceiverThresholds_Isolation (0.16s) PASS Integration Tests (end-to-end with ClickHouse + Redpanda containers): --- PASS: TestIntegration (45.85s) --- PASS: TestIntegration/TransceiverState (16.44s) → 52 records written to ClickHouse --- PASS: TestIntegration/InterfaceState (14.52s) → 198 records written to ClickHouse --- PASS: TestIntegration/TransceiverThresholds (14.89s) → 600 records written to ClickHouse PASS ```
This pull request introduces several significant improvements and changes to device and link lifecycle management, telemetry, and CLI commands. The most notable updates include the addition of health management and status controls for devices and links, new CLI commands for internal health operations, and changes to device/link provisioning and activation flows. There are also some test fixture and e2e test updates to align with the new status model. **Device and Link Lifecycle & Status Management:** * Introduced health management for Devices and Links, adding explicit health states, authorized health updates, and related enhancements to state, processors, and tests. * Added support for a `desired-status` parameter when creating and updating devices and links, allowing explicit control over their activation state. This is reflected in both the code and e2e tests. [[1]](diffhunk://#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4edR76-R86) F6d596d0L61R61, [[2]](diffhunk://#diff-3c1eb3950d7513dd8bea87ae37cce2238dec3aedc2512fe2bef786c7213c7fc5L219-R224) * Updated device and link event processors to handle new provisioning and activation states, including logging and state transitions (e.g., `DeviceProvisioning`, `ReadyForService`). [[1]](diffhunk://#diff-0aed512054e126f7dc8b0bffbc2aad2bd5e2aade35d7e902e08138cc36f94341L39-R39) [[2]](diffhunk://#diff-0aed512054e126f7dc8b0bffbc2aad2bd5e2aade35d7e902e08138cc36f94341R53-R67) [[3]](diffhunk://#diff-d9c91ea17e4fe3e26302887a60fd0596f950bcc4949005b71ae4fbf2acfc5ebbL64-R64) **CLI Command Changes:** * Added hidden/internal CLI commands for setting the health status of device and link interfaces (`SetDeviceHealthCliCommand`, `SetLinkHealthCliCommand`). These are intended for operational or testing use and not exposed in the public CLI surface. [[1]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL12-R12) [[2]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL62-R70) [[3]](diffhunk://#diff-02e4eb858ebd20f998f552ba9c1116fa4d51afd1be85dbf71d3af49247a37e1aL2-R5) [[4]](diffhunk://#diff-02e4eb858ebd20f998f552ba9c1116fa4d51afd1be85dbf71d3af49247a37e1aL51-R55) [[5]](diffhunk://#diff-9b18c350b4f1b096b7187215002603fceaaa72368cbe6b66dacfd105ea8eace2R180) [[6]](diffhunk://#diff-9b18c350b4f1b096b7187215002603fceaaa72368cbe6b66dacfd105ea8eace2R193) * Removed the public `Suspend` and `Resume` device commands from both the user and admin CLIs, streamlining device status management. [[1]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL12-R12) [[2]](diffhunk://#diff-f5900422e2e0b00392068b9cb949ab7ad2a8f162e87546f952cfe0212a6c518aL62-R70) [[3]](diffhunk://#diff-4c553a5c541ace8b1cfe447e3d968ee7c71feb6357037b95d18b9665f174fd14L12-L13) [[4]](diffhunk://#diff-4c553a5c541ace8b1cfe447e3d968ee7c71feb6357037b95d18b9665f174fd14L62-L67) [[5]](diffhunk://#diff-9b18c350b4f1b096b7187215002603fceaaa72368cbe6b66dacfd105ea8eace2L172-L173) [[6]](diffhunk://#diff-901dffbcc0f09778ec585863ab718fec67ebd602c6fb8a98c47907d6bbf71f0cL133-L134) **Testing and Fixture Updates:** * Updated e2e tests to use the new `--desired-status activated` flag when creating devices and links, ensuring that test resources start in the correct state. (F6d596d0L61R61, [e2e/device_telemetry_test.goL219-R224](diffhunk://#diff-3c1eb3950d7513dd8bea87ae37cce2238dec3aedc2512fe2bef786c7213c7fc5L219-R224)) * Adjusted IBRL agent configuration test fixtures to use more realistic IS-IS metrics and interface states, improving test coverage and accuracy. [[1]](diffhunk://#diff-45da870eae10769bbc547bec872ab640e8bf1bfd290aeb24c7569d254de3adfcL94-R94) [[2]](diffhunk://#diff-45da870eae10769bbc547bec872ab640e8bf1bfd290aeb24c7569d254de3adfcL108-R108) [[3]](diffhunk://#diff-9136bc38a39707583a0387e3b75ddbb531ae5b1da05374e80ff26f8d0ebcb042L94-R94) [[4]](diffhunk://#diff-9136bc38a39707583a0387e3b75ddbb531ae5b1da05374e80ff26f8d0ebcb042L108-R108) [[5]](diffhunk://#diff-04dd3c3c0bcf9f5f4d7718b5133038dd573c7a00a1a0c876e859370bc5d69af3L94-R94) [[6]](diffhunk://#diff-04dd3c3c0bcf9f5f4d7718b5133038dd573c7a00a1a0c876e859370bc5d69af3L108-R108) [[7]](diffhunk://#diff-0c75c0e0e5b0c9fc5e214749f9cb1a8e7c3c04f9caf31876e3ae2d215a7a531bL94-R94) [[8]](diffhunk://#diff-0c75c0e0e5b0c9fc5e214749f9cb1a8e7c3c04f9caf31876e3ae2d215a7a531bL108-R108) **Changelog and Documentation:** * Updated the `CHANGELOG.md` to reflect the introduction of health management, desired status for devices and links, and telemetry improvements for entities in provisioning/draining states. [[1]](diffhunk://#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4edR55-R58) [[2]](diffhunk://#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4edR76-R86) These changes collectively improve the robustness, operational control, and observability of device and link management in the system. ## Testing Verification * all test ok
## Summary of Changes * DeactivateMulticastGroup now only succeeds when both publisher_count == 0 and subscriber_count == 0 (in addition to status == Deleting). * If either count is non zero, the instruction fails and the account is not closed. * Regression test added: New test verifies deactivation fails when publisher/subscriber counts are non-zero and passes when they’re zero. * Updated Changelog. ## Testing Verification * Existing rust + lint checks pass. Closes #2219 --------- Signed-off-by: ANISH-SR <[email protected]> Co-authored-by: Juan Olveira <[email protected]>
## Summary of Changes * Now only allows reactivation when multicastgroup.status is Suspended, otherwise returns InvalidStatus. * New test ensures reactivation fails when the group is not Suspended. * Changelog.md updated. ## Testing Verification * Rust checks + lint green. Closes #2227 Signed-off-by: ANISH-SR <[email protected]>
…are empty + add regression test (#2635) ## Summary of Changes * Added a safety guard to process_closeaccount_user so a user account can only be closed when both publishers and subscribers are empty. * Added a regression test. * Updated CHANGELOG.md ## Testing Verification * Existing rust checks + lint checks pass. Resolves #2218 Signed-off-by: ANISH-SR <[email protected]> Co-authored-by: Juan Olveira <[email protected]>
## Summary of Changes This fixes two issues: - Devices send transceiver descriptions as separate gnmi update messages, which were being parsed as individual records and written as a duplicate row with zero values. - Transceiver thresholds are sent as individual updates for each severity. These were being written as it's own row per severity and are now aggregated into a single row per interface/severity. ## Testing Verification Transceiver state before containing a duplicate row w/ zero values: ``` 53. │ 2026-01-16 04:08:51.197958931 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet1 │ 0 │ -1.95 │ -2.35 │ 6.25 │ 54. │ 2026-01-16 04:08:51.197958931 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet1 │ 0 │ 0 │ 0 │ 0 │ ``` Transceiver state now: ``` SELECT * FROM transceiver_state_latest WHERE (device_pubkey = '9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r') AND (interface_name = 'Ethernet1') ┌─────────────────────timestamp─┬─device_pubkey────────────────────────────────┬─interface_name─┬─channel_index─┬─input_power─┬─output_power─┬─laser_bias_current─┐ 1. │ 2026-01-16 04:20:56.145597378 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet1 │ 0 │ -1.94 │ -2.33 │ 6.25 │ └───────────────────────────────┴──────────────────────────────────────────────┴────────────────┴───────────────┴─────────────┴──────────────┴────────────────────┘ ``` Transceiver thresholds before: ``` 41551. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 2.97 │ 0 │ 41552. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 3.6300000000000003 │ 41553. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 2 │ 0 │ 0 │ 0 │ 0 │ 0 │ 41554. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 0 │ 14 │ 0 │ 0 │ 0 │ 0 │ 41555. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ -13.904055907747797 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 41556. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 3.000082025538127 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 41557. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ -11.301817920206716 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 41558. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 3.000082025538127 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 41559. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ -5 │ 0 │ 0 │ 0 │ 41560. │ 2026-01-16 03:19:02.908902719 │ 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r │ Ethernet54 │ CRITICAL │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 75 │ 0 │ 0 │ ``` Transceiver thresholds now (single record per interface/severity): ``` SELECT * FROM transceiver_thresholds_latest WHERE (device_pubkey = '9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r') AND (interface_name = 'Ethernet54') AND (severity = 'CRITICAL') Row 1: ────── timestamp: 2026-01-16 04:24:56.624986659 device_pubkey: 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r interface_name: Ethernet54 severity: CRITICAL input_power_lower: -13.904055907747797 input_power_upper: 3.000082025538127 output_power_lower: -11.301817920206716 output_power_upper: 3.000082025538127 laser_bias_current_lower: 2 laser_bias_current_upper: 14 module_temperature_lower: -5 module_temperature_upper: 75 supply_voltage_lower: 2.97 supply_voltage_upper: 3.6300000000000003 ```
Resolves: #2660 ## Summary of Changes * Rename ReadyForService LinkStatus field to Provisioning
# Summary Fix #2657 This PR addresses the bug by basically ensuring that we call `contains` method in the `ip_to_index` helper function; this makes certain that the ip is actually within ranges; without this safety check, when deleting the activator would essentially panic and fail to restart. ## Without Fix ``` activator_interface_delete_test.go:89: Error Trace: /home/rahul/malbec-labs/doublezero/e2e/activator_interface_delete_test.go:89 Error: Should be true Test: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_activator_running Messages: activator container is not running - it likely crashed --- FAIL: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP (181.13s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/create_loopback_with_out_of_pool_ip (0.53s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/wait_for_interface_activation (14.01s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/delete_interface_with_out_of_pool_ip (5.53s) --- FAIL: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_activator_running (0.00s) FAIL exit status 1 FAIL github.com/malbeclabs/doublezero/e2e 227.649s make: *** [Makefile:21: test] Error 1 ``` ## With Fix ``` 2026-01-17 01:37:41 INF ==> Verifying activator container is still running 2026-01-17 01:37:41 INF --> Activator is still running === RUN TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_interface_removed 2026-01-17 01:37:41 INF ==> Verifying interface is removed from chain 2026-01-17 01:37:49 INF --> Interface successfully removed from chain --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP (190.70s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/create_loopback_with_out_of_pool_ip (0.53s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/wait_for_interface_activation (14.01s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/delete_interface_with_out_of_pool_ip (6.04s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_activator_running (0.00s) --- PASS: TestE2E_ActivatorInterfaceDeleteOutOfPoolIP/verify_interface_removed (8.01s) PASS ok github.com/malbeclabs/doublezero/e2e 240.147s ```
## Summary of Changes * e2e: add influxdb and device-health-oracle containers * These containers will be used to validate functionality described in rfcs/rfc12-device-provisioning.md ## Testing Verification * Includes a check to verify that each device is writing to the Influxdb intfCounters table
Resolves: #2513 ## Summary of Changes * Run telemetry agent on pending and drained links ## Testing Verification * Added tests
…2630) Resolves: #2622 ## Summary of Changes * Removed unknown status * Renamed status labels to be more descriptive: * pending → Pending BGP Session * initializing → Initializing BGP Session * down → BGP Session Down * up → BGP Session Up * Added BGP Session Failed (TCP connected but BGP handshake timed out after 5 seconds) * Added Network Unreachable (TCP connection failed, likely firewall issue) ## Testing Verification * Fixed existing tests * Added new tests
## Summary of Changes * SetGlobalConfig instruction changed to create global resource accounts * ActivateDevice instruction changed to create device resource accounts * UpdateDevice instruction changed to create/update device resource accounts * CloseAccountDevice instruction changed to close device resource accounts * SDK updated for these instructions * Added a resource close cli command/instruction ## Testing Verification * Tests updated * New tests for smart contract as well as sdk commands Closes #2623 Closes #2624 Closes #2625
…requests_total metric (#2674) ## Summary of Changes * e2e: add prometheus container and validate controller_grpc_getconfig_requests_total metric * controller and device-health-oracle containers now include alloy, which forwards metrics to prometheus container * e2e/main_test.go validates that each device published controller_grpc_getconfig_requests_total metrics * Required for rfcs/rfc-network-provisioning.md
## Summary of Changes * e2e: check for old status up string for backward compatibility - This avoids failures due to the QA tests expecting the new status string even though the clients on QA hosts have not yet been upgraded to a version that outputs the new string ## Testing Verification * Ran a successful test from the command line
# Summary Fix #2401 Fix #2404 Implement on-chain resource allocation for User activation and deallocation via ResourceExtension bitmaps, making the activator stateless for User lifecycle operations. Changes: - ActivateUser: optionally allocate tunnel_net, tunnel_id, dz_ip from ResourceExtension bitmaps (8-account layout) or use legacy args (5-account) - CloseAccountUser: optionally deallocate resources back to bitmaps (9-account layout) or use legacy behavior (6-account) - Extend authorization to allow foundation_allowlist members - DZ IP allocation follows UserType logic (IBRL uses client_ip, others allocate) - SDK commands add use_onchain_allocation/use_onchain_deallocation flags
## Summary of Changes * device-health-oracle: calculate burn-in period * Required by rfcs/rfc12-network-provisioning.md ## Testing Verification * Unit tests
Resolves: #2485 ## Summary of Changes * Added logic to `tunnel.tmpl` that shuts down user BGP, IBGP sessions, MSDP neighbors, and ISIS when `device.status` is `Drained` * Added test case and fixture file to verify drained device configuration * This implements the device maintenance workflow from RFC-12 (Network Provisioning Framework) ## Testing Verification * Added unit test `render_drained_device_config_successfully` that verifies shutdown commands are included in rendered config when device status is Drained I went through the classic process of draining and undraining a device by establishing an IBRL tunnel from one of our bm hosts to a dn dzd. This was the verification: ``` (base) ubuntu@chi-dn-bm1:~$ doublezero status Tunnel Status | Last Session Update | Tunnel Name | Tunnel Src | Tunnel Dst | Doublezero IP | User Type | Current Device | Lowest Latency Device | Metro | Network unknown | 1970-01-01 00:00:00 UTC | doublezero0 | 137.174.145.138 | 100.0.0.1 | 137.174.145.138 | IBRL | chi-dn-dzd1 | N/A | Chicago | devnet (base) ubuntu@chi-dn-bm1:~$ ---------------------------------------------------------------------- (base) ubuntu@chi-dn-bm1:~$ doublezero latency Pubkey | Code | IP | Min | Max | Avg | reachable JATksU22Uc6uwJ5bQvEisf3XWFJAtJrdh3n7eSNmrK7C | test123 | 1.2.3.4 | 0.00ms | 0.00ms | 0.00ms | false 4CkvmyquGN4qtXLNj3hpJcqYbb7PCanLbU1rQHHdp6xp | chi-dn-dzd3-delete-me | 0.1.2.3 | 0.00ms | 0.00ms | 0.00ms | false 9rSYq2eyR5sPiu5Ug5bBP8AVNtXf1rD59pbAgrT6Yx5r | chi-dn-dzd3 | 100.0.0.33 | 0.00ms | 0.00ms | 0.00ms | false 3JT6EPj4ESTRevv6MadpLYLvijBVDTQXhuHWuZzFgNfV | dz-test | 1.2.3.7 | 0.00ms | 0.00ms | 0.00ms | false 7g6TT8RU2iBKaWzAxBx87S4aYq5HMztTA1vedQmMpREZ | test789 | 1.2.3.6 | 0.00ms | 0.00ms | 0.00ms | false 7sk4SevuKLWNDLDjCy8m9bMk9MtXPDxmL5TQrchDPeca | chi-dn-dzd4 | 100.0.0.49 | 0.00ms | 0.00ms | 0.00ms | false Cu9n4EreVz2iUieSAyLxbLMtcKCTzggLomn68oUge5ww | test456 | 1.2.3.5 | 0.00ms | 0.00ms | 0.00ms | false 3cSe5iowxN1tzTXKHS9DH8PofiyuHyLg5X3GXD5aF6ri | chi-dn-dzd2 | 100.0.0.17 | 0.00ms | 0.00ms | 0.00ms | false ---------------------------------------------------------------------- chi-dn-dzd1#show running-config | section isis schedule isis-upload interval 360 timeout 2 max-log-files 100 command bash /mnt/flash/upload-wrapper.sh management api models provider isis link-state-database interface Loopback255 isis enable 1 interface Loopback256 isis enable 1 interface Switch1/11/2 isis enable 1 isis circuit-type level-2 isis hello-interval 1 isis metric 1000 isis hello padding isis network point-to-point interface Switch1/11/4 isis enable 1 isis circuit-type level-2 isis hello-interval 1 isis metric 1000 isis hello padding isis network point-to-point interface Switch1/12/3 isis circuit-type level-2 isis metric 1 isis network point-to-point router isis 1 net 49.0000.ac10.0006.0000.00 router-id ipv4 172.16.0.6 log-adjacency-changes set-overload-bit ! address-family ipv4 unicast ! segment-routing mpls no shutdown ``` After I undrain overload-bit is no longer there: ``` router isis 1 net 49.0000.ac10.0006.0000.00 router-id ipv4 172.16.0.6 log-adjacency-changes ! address-family ipv4 unicast ! segment-routing mpls no shutdown ```
# Title Enforce Activated status before suspend + add negative tests # Summary This PR makes the suspend logic in the serviceability program a bit stricter and more consistent. Right now, some *_suspend handlers don’t check the current status of the entity before allowing a suspend. That means it’s possible to try to suspend something that is still Pending or already Suspended, while link/suspend.rs already enforces that only Activated entities can be suspended. Here I align all suspend handlers with that same rule and add one negative test per entity type to make sure we don’t regress in the future. Closes : #2221. ## What changed 1. Enforce Activated before suspend For the following handlers: `location/suspend.rs` `exchange/suspend.rs` `contributor/suspend.rs` `device/suspend.rs` `user/suspend.rs` `multicastgroup/suspend.rs` I now: Deserialize the entity Check that `entity.status == EntityStatus::Activated` Return` DoubleZeroError::InvalidStatus` otherwise (with a small debug log under `#[cfg(test)]`) The check follows the same pattern already used in link/suspend.rs: ``` if entity.status != EntityStatus::Activated { #[cfg(test)] msg!("{:?}", entity); return Err(DoubleZeroError::InvalidStatus.into()); } ``` This doesn’t change behaviour for valid suspends (entities already Activated). It just tightens validation when the status is not valid for a suspend. 2. New negative tests I added one negative test per entity type to make the new checks explicit and guarded by tests: `test_suspend_location_from_suspended_fails` `test_suspend_exchange_from_suspended_fails` `test_suspend_contributor_from_suspended_fails` `test_suspend_device_from_pending_fails` `test_suspend_user_from_suspended_fails` `test_suspend_multicastgroup_from_pending_fails` Each test: Puts the entity in an invalid state for suspend (Pending or Suspended depending on the case), Calls the corresponding *_suspend processor, Asserts that it fails with `DoubleZeroError::InvalidStatus`. This way each handler’s status check is covered and future changes will be forced to keep this behaviour. 3. Fix in device activation test setup While working on the tests, I also fixed the account list passed to device activation. device/activate expects: `device` `globalstate` `payer` `system_program` The previous test setup was missing some of these, so it didn’t really mirror the on-chain invocation. I updated the test to provide the full, correct account list. This doesn’t change behaviour, but it makes the test more realistic and closer to what actually happens at runtime. Testing From the smartcontract workspace: ``` # During development: specific suspend tests cargo test -p doublezero-serviceability --test user_tests test_suspend --test-threads=1 # All suspend-related negative tests cargo test -p doublezero-serviceability "test_suspend.*fails" --test-threads=1 # Full test suite for the serviceability crate cargo test -p doublezero-serviceability # Lint cargo clippy -p doublezero-serviceability ``` # Results: All 6 new negative tests pass All existing tests in doublezero-serviceability pass cargo clippy clean (no new warnings)
## Summary of Changes - Fix global monitor release workflow `go.mod` config ## Testing Verification * Show evidence of testing the change
…-users (#2683) Resolves: #2486 ## Summary of Changes Added QA identity allowlist to bypass device capacity and status checks for QA testing: **Smart Contract (Solana)**: - Added `qa_allowlist: Vec<Pubkey>` to `GlobalState` - Created `AddQaAllowlist` and `RemoveQaAllowlist` instructions - Modified `user/create.rs` and `user/create_subscribe.rs` to bypass `device.status` and `max_users` checks when payer is in `qa_allowlist` - Only `foundation_allowlist` members can manage the `qa_allowlist` **CLI (`doublezero`)**: - Added `doublezero global-config qa-allowlist {list|add|remove}` commands to manage QA allowlist **E2E Tests**: - Added `--skip-capacity-check` flag to `ValidDevices()` function - When enabled, skips client-side device capacity filtering - Flag defaults to `false` for backward compatibility **Two-Phase Rollout**: 1. **Phase 1 (this PR)**: Merge with flag disabled - tests run as before 2. **Phase 2 (post-deployment)**: Add QA host pubkeys to allowlist, then enable `--skip-capacity-check` in CI workflows ## Testing Verification ``` cd /home/martin/Documents/malbec/doublezero && cargo run -p doublezero -- global-config qa-allowlist --help 2>&1 Manage the QA allowlist Usage: doublezero global-config qa-allowlist [OPTIONS] <COMMAND> Commands: list List QA allowlist add Add a pubkey to the QA allowlist remove Remove a pubkey from the QA allowlist help Print this message or the help of the given subcommand(s) Options: -e, --env <ENV> DZ env (testnet, devnet, or mainnet-beta) --url <RPC_URL> DZ ledger RPC URL --ws <WEBSOCKET_URL> DZ ledger WebSocket URL --program-id <PROGRAM_ID> DZ program ID (testnet or devnet) --keypair <KEYPAIR> Path to the keypair file --no-version-warning Suppress version warning output -h, --help Print help ```
67ef180 to
6364a59
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.