perf(profiling): read greenlet frames at sample time via offset discovery#17518
perf(profiling): read greenlet frames at sample time via offset discovery#17518taegyunkim wants to merge 7 commits intomainfrom
Conversation
Phase 1 of unwind_greenlets() previously traversed parent chains and allocated a visited set per leaf greenlet while holding greenlet_info_map_lock. This caused update_greenlet_frame() (called on every greenlet switch) to stall, leading to resource exhaustion in gevent applications (e.g. PostgreSQL connection pool growth). Phase 1 is now a pure O(N) flat copy of greenlet state and the parent map; all chain traversal happens in Phase 2 outside the lock. Refs SCP-1141 / SCP-1039. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…very Eliminate ALL per-switch C calls from the greenlet tracer by reading gr_frame and stack_stop directly from greenlet object memory at sample time (~100Hz). Previously, the profiler installed a greenlet.settrace() callback that called update_greenlet_frame() twice per switch, each crossing into C with GIL release, mutex acquire, and hash map lookup. This added a ~3x latency multiplier to every greenlet switch. Now, at init time we discover three byte offsets in the greenlet struct via ctypes memory probing (accounting for the pimpl indirection pattern introduced in greenlet 2.0): PyGreenlet + pimpl_offset -> Greenlet* (pimpl) Greenlet* + frame_offset -> struct _frame* (gr_frame) Greenlet* + stack_stop_offset -> char* (non-NULL = started) At sample time, the sampler thread reads these values using copy_generic (two pointer hops, no GIL). stack_stop distinguishes on-CPU greenlets (NULL frame + started) from unstarted ones (NULL frame + not started). Falls back to per-switch update_greenlet_frame() if offset discovery fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add return type annotations to inner functions (_probe_fn, _stack_probe_fn) - Wrap ctypes buffer slices with bytes() for int.from_bytes compatibility - Check both pimpl_offset and frame_offset for None before proceeding - Add set_greenlet_offsets to the stack module type stub Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract _read_ptr() helper with type: ignore for the ctypes buffer slice, avoiding repeated type-ignore annotations on every call site. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 089f00e85e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| gl_size = sys.getsizeof(g) | ||
| type_addr = id(type(g)) | ||
|
|
||
| buf = (ctypes.c_char * gl_size).from_address(id(g)) |
There was a problem hiding this comment.
Use object basicsize for greenlet memory probing
_discover_greenlet_offsets() uses sys.getsizeof(g) as the byte length for from_address(id(g)), but getsizeof includes GC overhead for GC-tracked objects while id(g) points to the object body. That makes the scan read past the greenlet struct into unrelated memory, and those out-of-bounds words are then treated as candidate pointers in ctypes.from_address(...); invalid addresses can segfault the interpreter instead of raising a Python exception. In production this can crash process startup when profiling patches gevent.
Useful? React with 👍 / 👎.
…ilable When offset discovery succeeds, don't install a greenlet.settrace() callback at all. This eliminates all per-switch Python code, not just the C calls. Greenlet tracking relies on the patched spawn/joinall; dead detection uses rawlink. The settrace callback is only installed as a fallback when offset discovery fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codeowners resolved as |
Description
Eliminates ALL per-switch C calls from the greenlet tracer by reading
gr_frameandstack_stopdirectly from greenlet object memory at sample time (~100Hz).Depends on #17467 (lock optimization + try_lock), which this PR includes as base commits.
Problem: Even with #17467's lock optimization, the profiler's
greenlet.settrace()callback still calledupdate_greenlet_frame()twice per greenlet switch, each crossing Python-to-C with GIL release, mutex acquire, and hash map lookup. This added a ~3x latency multiplier to every greenlet switch.Solution: At init time, discover three byte offsets in the greenlet struct via ctypes memory probing (accounting for the pimpl indirection pattern in greenlet 2.0+):
At sample time, the sampler thread reads these values using
copy_generic(two pointer hops, no GIL).stack_stopdistinguishes on-CPU greenlets (NULL frame + started) from unstarted ones (NULL frame + not started).Falls back gracefully to per-switch
update_greenlet_frame()if offset discovery fails.Benchmark (Apple M3 Max, Python 3.14, 256 greenlets):
settrace())Upstream: Filed greenlet#505 proposing debug offsets so profilers don't need runtime probing.
Testing
_discover_greenlet_offsets()fails.Risks
copy_genericreads from greenlet memory at sample time could read briefly stale data if a greenlet dies between Phase 1 and Phase 2. Same safety model as existingcopy_type()path.Additional Notes
scp-1141-repro/(not committed).🤖 Generated with Claude Code