Skip to content

fix: deflake //rs/execution_environment:dts_test#9312

Draft
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-dts_test-2026-03-11
Draft

fix: deflake //rs/execution_environment:dts_test#9312
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-dts_test-2026-03-11

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Mar 11, 2026

Root Cause

The background_sending_thread in UnixStreamMessageWriter (rs/canister_sandbox/src/transport.rs) has a sending loop:

while !buf.is_empty() {
    send_message(&self.socket, &mut buf, &mut fds, 0);
}

When the remote end closes the socket (e.g., sandbox process exits after receiving a TerminateRequest), sendmsg() fails with EPIPE or EBADF and returns -1. However, send_message() only drains the buffer when num_bytes_sent > 0, so on error the buffer is never drained and buf.is_empty() stays false — causing the loop to spin forever.

This stuck background thread prevents the test process from exiting, leading to the 60-second timeout. The "Failed to update sandbox IPC socket timeout: Bad file descriptor (os error 9)" message in the logs is a symptom of the same underlying race — the socket fd becoming invalid when the sandbox process terminates while the controller still holds a reference.

Fix

Check the return value of send_message() in the background sending loop. If it returns a negative value (error), exit the thread immediately since the socket is broken and all subsequent sends would also fail.

Verification

  • All unit tests in //rs/canister_sandbox:backend_lib_test pass
  • //rs/execution_environment:dts_test passes with --runs_per_test=3 --jobs=3 (3 parallel runs)
  • All other affected tests (//rs/replay:player_integration_tests, //rs/replay:replay_test) pass

This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

Break out of the background sending loop in UnixStreamMessageWriter when
send_message() returns an error (negative value). Previously, when the
remote end closed the socket (e.g. sandbox process exiting), sendmsg()
would fail with EPIPE or EBADF but the buffer was never drained, causing
the while-loop to spin forever. This stuck thread prevented the test
from completing within the timeout.

The fix checks the return value of send_message() and exits the
background sending thread when the socket is broken, since all
subsequent sends would also fail.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a hang/flaky timeout in the canister sandbox IPC writer by ensuring the background sending thread stops when the underlying unix socket is broken (e.g., peer exits and sendmsg() fails), preventing an infinite spin loop that can block test process termination.

Changes:

  • Add error handling in UnixStreamMessageWriter::background_sending_thread to exit the thread when send_message() returns an error.
  • Document why exiting is necessary to avoid spinning when the buffer isn’t drained on send errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +183 to +190
if send_message(&self.socket, &mut buf, &mut fds, 0) < 0 {
// The socket is broken (e.g., EPIPE or EBADF because the
// other end closed the connection). There is no point in
// retrying because all subsequent sends will also fail.
// Without this check the loop would spin forever since
// `send_message` does not drain the buffer on error.
return;
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On send error this return exits the background thread without updating shared writer state. In particular, sending_in_background may still be true, so subsequent write_frame() calls will only append to state.buf and never attempt to send or notify (since the background thread is gone), potentially leading to unbounded buffering and stuck IPC writes. Consider taking the mutex before returning to mark the writer as closed/broken (e.g., set quit_requested = true, set sending_in_background = false, and/or store an error flag that write_frame() checks to fail/drop writes).

Copilot uses AI. Check for mistakes.
Comment on lines +183 to +190
if send_message(&self.socket, &mut buf, &mut fds, 0) < 0 {
// The socket is broken (e.g., EPIPE or EBADF because the
// other end closed the connection). There is no point in
// retrying because all subsequent sends will also fail.
// Without this check the loop would spin forever since
// `send_message` does not drain the buffer on error.
return;
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes an important failure mode (send error -> exit thread), but there’s no regression test covering it. Since this file already has unit tests, consider adding a test that closes the peer socket, triggers a background send, and asserts the sender thread doesn’t spin/hang (e.g., the test completes quickly / stop() returns promptly).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants