fix: deflake //rs/execution_environment:dts_test#9312
fix: deflake //rs/execution_environment:dts_test#9312basvandijk wants to merge 1 commit intomasterfrom
Conversation
Break out of the background sending loop in UnixStreamMessageWriter when send_message() returns an error (negative value). Previously, when the remote end closed the socket (e.g. sandbox process exiting), sendmsg() would fail with EPIPE or EBADF but the buffer was never drained, causing the while-loop to spin forever. This stuck thread prevented the test from completing within the timeout. The fix checks the return value of send_message() and exits the background sending thread when the socket is broken, since all subsequent sends would also fail.
There was a problem hiding this comment.
Pull request overview
Fixes a hang/flaky timeout in the canister sandbox IPC writer by ensuring the background sending thread stops when the underlying unix socket is broken (e.g., peer exits and sendmsg() fails), preventing an infinite spin loop that can block test process termination.
Changes:
- Add error handling in
UnixStreamMessageWriter::background_sending_threadto exit the thread whensend_message()returns an error. - Document why exiting is necessary to avoid spinning when the buffer isn’t drained on send errors.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if send_message(&self.socket, &mut buf, &mut fds, 0) < 0 { | ||
| // The socket is broken (e.g., EPIPE or EBADF because the | ||
| // other end closed the connection). There is no point in | ||
| // retrying because all subsequent sends will also fail. | ||
| // Without this check the loop would spin forever since | ||
| // `send_message` does not drain the buffer on error. | ||
| return; | ||
| } |
There was a problem hiding this comment.
On send error this return exits the background thread without updating shared writer state. In particular, sending_in_background may still be true, so subsequent write_frame() calls will only append to state.buf and never attempt to send or notify (since the background thread is gone), potentially leading to unbounded buffering and stuck IPC writes. Consider taking the mutex before returning to mark the writer as closed/broken (e.g., set quit_requested = true, set sending_in_background = false, and/or store an error flag that write_frame() checks to fail/drop writes).
| if send_message(&self.socket, &mut buf, &mut fds, 0) < 0 { | ||
| // The socket is broken (e.g., EPIPE or EBADF because the | ||
| // other end closed the connection). There is no point in | ||
| // retrying because all subsequent sends will also fail. | ||
| // Without this check the loop would spin forever since | ||
| // `send_message` does not drain the buffer on error. | ||
| return; | ||
| } |
There was a problem hiding this comment.
This change fixes an important failure mode (send error -> exit thread), but there’s no regression test covering it. Since this file already has unit tests, consider adding a test that closes the peer socket, triggers a background send, and asserts the sender thread doesn’t spin/hang (e.g., the test completes quickly / stop() returns promptly).
Root Cause
The
background_sending_threadinUnixStreamMessageWriter(rs/canister_sandbox/src/transport.rs) has a sending loop:When the remote end closes the socket (e.g., sandbox process exits after receiving a
TerminateRequest),sendmsg()fails withEPIPEorEBADFand returns-1. However,send_message()only drains the buffer whennum_bytes_sent > 0, so on error the buffer is never drained andbuf.is_empty()staysfalse— causing the loop to spin forever.This stuck background thread prevents the test process from exiting, leading to the 60-second timeout. The
"Failed to update sandbox IPC socket timeout: Bad file descriptor (os error 9)"message in the logs is a symptom of the same underlying race — the socket fd becoming invalid when the sandbox process terminates while the controller still holds a reference.Fix
Check the return value of
send_message()in the background sending loop. If it returns a negative value (error), exit the thread immediately since the socket is broken and all subsequent sends would also fail.Verification
//rs/canister_sandbox:backend_lib_testpass//rs/execution_environment:dts_testpasses with--runs_per_test=3 --jobs=3(3 parallel runs)//rs/replay:player_integration_tests,//rs/replay:replay_test) passThis PR was created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.