Fix race condition causing crash when setting content filter during spin#1372
Fix race condition causing crash when setting content filter during spin#1372minggangw merged 2 commits intoRobotWebTools:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition that was causing crashes when setting content filters on subscriptions while the node is spinning. The fix involves stopping the subscriber node before modifying the content filter, then restarting it afterwards.
Changes:
- Modified C++ error handling to capture error strings before calling cleanup operations
- Updated test cases to stop spinning before calling
setContentFilter()to prevent race conditions - Added
stop()calls at the end of async test promises to clean up properly
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/rcl_subscription_bindings.cpp | Fixed error handling by capturing error string before cleanup operations that may overwrite it |
| test/test-subscription-content-filter.js | Added stop/spin pattern around setContentFilter() calls and cleanup stop() calls in async promises |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/rcl_subscription_bindings.cpp
Outdated
| std::string error_string; | ||
| if (ret != RCL_RET_OK) { | ||
| error_string = rcl_get_error_string().str; | ||
| rcl_reset_error(); | ||
| } |
There was a problem hiding this comment.
The variable error_string is declared without initialization and is only assigned when ret != RCL_RET_OK. If fini_ret != RCL_RET_OK but ret == RCL_RET_OK (line 300-305), the code will reuse this uninitialized error_string. Consider initializing it to an empty string or using a separate variable for the fini_ret error.
|
|
||
| const p2 = new Promise((resolve) => | ||
| setTimeout(() => { | ||
| this.subscriberNode.stop(); |
There was a problem hiding this comment.
This test is stopping the node inside an async promise timeout without restarting it before the test ends. Unlike lines 299-301 and 372-374, this creates an inconsistent pattern where the node remains stopped. This could cause issues if the test framework expects proper cleanup. Consider whether the node should be left spinning or if this is intentional cleanup.
|
|
||
| const p2 = new Promise((resolve) => | ||
| setTimeout(() => { | ||
| this.subscriberNode.stop(); |
There was a problem hiding this comment.
Similar to the previous issue, this test stops the node inside an async promise timeout without restarting it before the test ends. This creates an inconsistent pattern with the stop/spin/stop pattern used around setContentFilter() calls (lines 299-301 and 372-374). Consider whether the node should be left spinning or if this is intentional cleanup.
…pin (#1372) Investigated and resolved an intermittent crash (`SIGABRT` / "pure virtual method called") occurring in the underlying FastDDS layer. **Crash log** ```bash ✔ setContentFilter (2003ms) pure virtual method called terminate called without an active exception Aborted (core dumped) Error: Process completed with exit code 134. ``` **Callstack** ```bash #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007cfa4744527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007cfa474288ff in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007cfa478a5ff5 in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95 #6 0x00007cfa478bb0da in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48 #7 0x00007cfa478a5a55 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:58 #8 0x00007cfa478bbfc3 in __cxxabiv1::__cxa_pure_virtual () at ../../../../src/libstdc++-v3/libsupc++/pure.cc:50 #9 0x00007cf9f689e916 in eprosima::fastdds::dds::detail::DataReaderHistory::get_first_untaken_info (this=0x2d825968, info=...) at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/eProsima/Fast-DDS/src/cpp/fastdds/subscriber/history/DataReaderHistory.cpp:364 #10 0x00007cf9f688d9fb in eprosima::fastdds::dds::DataReaderImpl::get_first_untaken_info (this=<optimized out>, info=info@entry=0x7cf9e57f8d80) at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/eProsima/Fast-DDS/src/cpp/fastdds/subscriber/DataReaderImpl.cpp:793 #11 0x00007cf9f688a2dd in eprosima::fastdds::dds::DataReader::get_first_untaken_info (this=<optimized out>, info=info@entry=0x7cf9e57f8d80) at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/eProsima/Fast-DDS/src/cpp/fastdds/subscriber/DataReader.cpp:278 #12 0x00007cfa44644c46 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=<optimized out>, subscriptions=<optimized out>, guard_conditions=<optimized out>, services=0x7cf9dc000bc0, clients=0x7cf9dc000ba8, events=0x7cf9dc000bd8, wait_set=0x7cf9dc000f60, wait_timeout=0x7cf9e57f8ea0) at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:235 #13 0x00007cfa446ac27f in rmw_wait (subscriptions=<optimized out>, guard_conditions=<optimized out>, services=<optimized out>, clients=<optimized out>, events=<optimized out>, wait_set=<optimized out>, wait_timeout=0x7cf9e57f8ea0) at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:33 #14 0x00007cfa44adf742 in rcl_wait (wait_set=0x7cf9e57f9890, timeout=<optimized out>) at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/wait.c:670 #15 0x00007cfa44b54625 in rclnodejs::Executor::WaitForReadyCallbacks(rcl_wait_set_s*, int) () from /home/minggang/proj/rclnodejs/build/Release/rclnodejs.node #16 0x00007cfa44b561aa in rclnodejs::Executor::Run(void*) () from /home/minggang/proj/rclnodejs/build/Release/rclnodejs.node #17 0x00007cfa4749caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447 #18 0x00007cfa47529c6c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 ``` **Root Cause Analysis:** The crash was caused by a race condition between the main JavaScript thread and the background worker thread: 1. The node creates a background thread when `node.spin()` is called to run the `rcl` executor and wait for messages. 2. Calling `subscription.setContentFilter()` from the main thread invokes `rcl_subscription_set_content_filter`, which modifies the underlying DDS DataReader. 3. If the background thread is simultaneously accessing that same DataReader (e.g., in `rcl_wait` or checking for new data), it leads to memory corruption or accessing an object in an invalid state, resulting in the "pure virtual method called" termination. **Fix Implementation:** Updated tests, examples, and documentation to enforce a thread-safe pattern. The node must now explicitly stop the background executor before modifying the content filter, and restart it afterwards: ```javascript node.stop(); // Stop background thread subscription.setContentFilter(filter); // Safe modification node.spin(); // Restart background thread ``` Fix: #1368
Investigated and resolved an intermittent crash (
SIGABRT/ "pure virtual method called") occurring in the underlying FastDDS layer.Crash log
✔ setContentFilter (2003ms) pure virtual method called terminate called without an active exception Aborted (core dumped) Error: Process completed with exit code 134.Callstack
Root Cause Analysis:
The crash was caused by a race condition between the main JavaScript thread and the background worker thread:
node.spin()is called to run therclexecutor and wait for messages.subscription.setContentFilter()from the main thread invokesrcl_subscription_set_content_filter, which modifies the underlying DDS DataReader.rcl_waitor checking for new data), it leads to memory corruption or accessing an object in an invalid state, resulting in the "pure virtual method called" termination.Fix Implementation:
Updated tests, examples, and documentation to enforce a thread-safe pattern. The node must now explicitly stop the background executor before modifying the content filter, and restart it afterwards:
Fix: #1368