fix: deflake //rs/execution_environment:dts_test by basvandijk · Pull Request #9312 · dfinity/ic

basvandijk · 2026-03-11T12:10:48Z

Root Cause

The background_sending_thread in UnixStreamMessageWriter (rs/canister_sandbox/src/transport.rs) has a sending loop:

while !buf.is_empty() {
    send_message(&self.socket, &mut buf, &mut fds, 0);
}

When the remote end closes the socket (e.g., sandbox process exits after receiving a TerminateRequest), sendmsg() fails with EPIPE or EBADF and returns -1. However, send_message() only drains the buffer when num_bytes_sent > 0, so on error the buffer is never drained and buf.is_empty() stays false — causing the loop to spin forever.

This stuck background thread prevents the test process from exiting, leading to the 60-second timeout. The "Failed to update sandbox IPC socket timeout: Bad file descriptor (os error 9)" message in the logs is a symptom of the same underlying race — the socket fd becoming invalid when the sandbox process terminates while the controller still holds a reference.

Fix

Check the return value of send_message() in the background sending loop. If it returns a negative value (error), exit the thread immediately since the socket is broken and all subsequent sends would also fail.

Verification

All unit tests in //rs/canister_sandbox:backend_lib_test pass
//rs/execution_environment:dts_test passes with --runs_per_test=3 --jobs=3 (3 parallel runs)
All other affected tests (//rs/replay:player_integration_tests, //rs/replay:replay_test) pass

This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

Break out of the background sending loop in UnixStreamMessageWriter when send_message() returns an error (negative value). Previously, when the remote end closed the socket (e.g. sandbox process exiting), sendmsg() would fail with EPIPE or EBADF but the buffer was never drained, causing the while-loop to spin forever. This stuck thread prevented the test from completing within the timeout. The fix checks the return value of send_message() and exits the background sending thread when the socket is broken, since all subsequent sends would also fail.

Copilot

Pull request overview

Fixes a hang/flaky timeout in the canister sandbox IPC writer by ensuring the background sending thread stops when the underlying unix socket is broken (e.g., peer exits and sendmsg() fails), preventing an infinite spin loop that can block test process termination.

Changes:

Add error handling in UnixStreamMessageWriter::background_sending_thread to exit the thread when send_message() returns an error.
Document why exiting is necessary to avoid spinning when the buffer isn’t drained on send errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-11T12:15:34Z

rs/canister_sandbox/src/transport.rs

+                if send_message(&self.socket, &mut buf, &mut fds, 0) < 0 {
+                    // The socket is broken (e.g., EPIPE or EBADF because the
+                    // other end closed the connection). There is no point in
+                    // retrying because all subsequent sends will also fail.
+                    // Without this check the loop would spin forever since
+                    // `send_message` does not drain the buffer on error.
+                    return;
+                }


On send error this return exits the background thread without updating shared writer state. In particular, sending_in_background may still be true, so subsequent write_frame() calls will only append to state.buf and never attempt to send or notify (since the background thread is gone), potentially leading to unbounded buffering and stuck IPC writes. Consider taking the mutex before returning to mark the writer as closed/broken (e.g., set quit_requested = true, set sending_in_background = false, and/or store an error flag that write_frame() checks to fail/drop writes).

Copilot · 2026-03-11T12:15:34Z

rs/canister_sandbox/src/transport.rs

+                if send_message(&self.socket, &mut buf, &mut fds, 0) < 0 {
+                    // The socket is broken (e.g., EPIPE or EBADF because the
+                    // other end closed the connection). There is no point in
+                    // retrying because all subsequent sends will also fail.
+                    // Without this check the loop would spin forever since
+                    // `send_message` does not drain the buffer on error.
+                    return;
+                }


This change fixes an important failure mode (send error -> exit thread), but there’s no regression test covering it. Since this file already has unit tests, consider adding a test that closes the peer socket, triggers a background send, and asserts the sender thread doesn’t spin/hang (e.g., the test completes quickly / stop() returns promptly).

github-actions bot added the fix label Mar 11, 2026

basvandijk requested a review from Copilot March 11, 2026 12:11

Copilot started reviewing on behalf of basvandijk March 11, 2026 12:11 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deflake //rs/execution_environment:dts_test#9312

fix: deflake //rs/execution_environment:dts_test#9312
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-dts_test-2026-03-11

basvandijk commented Mar 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

basvandijk commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Fix

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

basvandijk commented Mar 11, 2026 •

edited

Loading