Skip to content

Use C++ shared_ptr for IPC file descriptor cleanup#1832

Merged
Andy-Jost merged 4 commits intoNVIDIA:mainfrom
Andy-Jost:fd-handle-ipc
Mar 30, 2026
Merged

Use C++ shared_ptr for IPC file descriptor cleanup#1832
Andy-Jost merged 4 commits intoNVIDIA:mainfrom
Andy-Jost:fd-handle-ipc

Conversation

@Andy-Jost
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost commented Mar 30, 2026

Summary

When a DeviceMemoryResource is destroyed after Python finalization has begun, its IPC allocation handle (a POSIX file descriptor on Linux) must still be closed. Previously, IPCAllocationHandle.__dealloc__ called os.close() to do this, but during late interpreter shutdown the os module may already be partially torn down, producing cryptic "Error in sys.excepthook" messages about unraisable exceptions.

This PR fixes the problem by managing the file descriptor through a C++ shared_ptr with a custom deleter that calls POSIX close() directly, bypassing Python entirely. The new FileDescriptorHandle follows the same pattern as other handle types in the resource_handles module.

Changes

  • resource_handles.hpp/cpp — add FileDescriptorHandle typedef, create_fd_handle (owning, calls ::close), create_fd_handle_ref (non-owning), as_intptr and as_py overloads
  • _resource_handles.pxd/pyx — declare and expose the new handle type and factory functions
  • _ipc.pxd/pyx — replace int _handle with FileDescriptorHandle _h_fd; close() becomes reset(); remove __dealloc__

Test Plan

  • Full IPC test suite (73 passed, 3 skipped on single-GPU)
  • No sys.excepthook errors after test run
  • CI

Made with Cursor

Replace Python-level os.close() in IPCAllocationHandle with a C++
shared_ptr custom deleter that calls POSIX close() directly, avoiding
unraisable exception errors during late interpreter shutdown.

Made-with: Cursor
@Andy-Jost Andy-Jost added this to the cuda.core v0.7.0 milestone Mar 30, 2026
@Andy-Jost Andy-Jost added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Mar 30, 2026
@Andy-Jost Andy-Jost self-assigned this Mar 30, 2026
@Andy-Jost Andy-Jost requested review from cpcloud, leofang, mdboom, rparolin and rwgk and removed request for leofang March 30, 2026 17:16
@github-actions
Copy link
Copy Markdown

@Andy-Jost Andy-Jost marked this pull request as draft March 30, 2026 17:34
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot bot commented Mar 30, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

create_fd_handle and create_fd_handle_ref throw std::runtime_error
on Windows since POSIX file descriptors are Linux-only.

Made-with: Cursor
@Andy-Jost Andy-Jost marked this pull request as ready for review March 30, 2026 18:16
Copy link
Copy Markdown
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (manual review + Cursor).

Cursor GPT-5.3 Codex High flagged: I don’t see a dedicated automated test for the specific regression (“late interpreter shutdown / unraisable os.close path”).

I think that's fine to ignore. I've seen the "Error in sys.excepthook" messages many times, we definitely have plenty of indirect coverage.

Copy link
Copy Markdown
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ fd-handle cleanup approach looks correct and preserves the IPC ownership model while avoiding late-interpreter-shutdown issues.

@Andy-Jost Andy-Jost merged commit 3c22874 into NVIDIA:main Mar 30, 2026
88 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants