Skip to content

perf(profiling): read greenlet frames at sample time via offset discovery#17518

Draft
taegyunkim wants to merge 7 commits intomainfrom
taegyunkim/scp-1141-no-switch
Draft

perf(profiling): read greenlet frames at sample time via offset discovery#17518
taegyunkim wants to merge 7 commits intomainfrom
taegyunkim/scp-1141-no-switch

Conversation

@taegyunkim
Copy link
Copy Markdown
Contributor

Description

Eliminates ALL per-switch C calls from the greenlet tracer by reading gr_frame and stack_stop directly from greenlet object memory at sample time (~100Hz).

Depends on #17467 (lock optimization + try_lock), which this PR includes as base commits.

Problem: Even with #17467's lock optimization, the profiler's greenlet.settrace() callback still called update_greenlet_frame() twice per greenlet switch, each crossing Python-to-C with GIL release, mutex acquire, and hash map lookup. This added a ~3x latency multiplier to every greenlet switch.

Solution: At init time, discover three byte offsets in the greenlet struct via ctypes memory probing (accounting for the pimpl indirection pattern in greenlet 2.0+):

PyGreenlet + pimpl_offset      -> Greenlet* (pimpl)
Greenlet*  + frame_offset      -> struct _frame* (gr_frame)
Greenlet*  + stack_stop_offset -> char* (non-NULL = started)

At sample time, the sampler thread reads these values using copy_generic (two pointer hops, no GIL). stack_stop distinguishes on-CPU greenlets (NULL frame + started) from unstarted ones (NULL frame + not started).

Falls back gracefully to per-switch update_greenlet_frame() if offset discovery fails.

Benchmark (Apple M3 Max, Python 3.14, 256 greenlets):

Upstream: Filed greenlet#505 proposing debug offsets so profilers don't need runtime probing.

Testing

  • All 284 profiling tests pass (Python 3.14 gevent venv).
  • Offset discovery validated across greenlet states (paused, active, unstarted) with cross-validation on a second probe greenlet.
  • Graceful fallback if _discover_greenlet_offsets() fails.

Risks

  • ctypes memory probing depends on greenlet's pimpl struct layout. Validated at runtime and falls back on failure, so a layout change degrades to old behavior rather than crashing.
  • copy_generic reads from greenlet memory at sample time could read briefly stale data if a greenlet dies between Phase 1 and Phase 2. Same safety model as existing copy_type() path.

Additional Notes

  • Benchmark script and upstream proposal draft are in scp-1141-repro/ (not committed).

🤖 Generated with Claude Code

taegyunkim and others added 6 commits April 13, 2026 21:35
Phase 1 of unwind_greenlets() previously traversed parent chains and
allocated a visited set per leaf greenlet while holding
greenlet_info_map_lock. This caused update_greenlet_frame() (called on
every greenlet switch) to stall, leading to resource exhaustion in
gevent applications (e.g. PostgreSQL connection pool growth).

Phase 1 is now a pure O(N) flat copy of greenlet state and the parent
map; all chain traversal happens in Phase 2 outside the lock.

Refs SCP-1141 / SCP-1039.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…very

Eliminate ALL per-switch C calls from the greenlet tracer by reading
gr_frame and stack_stop directly from greenlet object memory at sample
time (~100Hz).

Previously, the profiler installed a greenlet.settrace() callback that
called update_greenlet_frame() twice per switch, each crossing into C
with GIL release, mutex acquire, and hash map lookup. This added a ~3x
latency multiplier to every greenlet switch.

Now, at init time we discover three byte offsets in the greenlet struct
via ctypes memory probing (accounting for the pimpl indirection pattern
introduced in greenlet 2.0):

  PyGreenlet + pimpl_offset      -> Greenlet* (pimpl)
  Greenlet*  + frame_offset      -> struct _frame* (gr_frame)
  Greenlet*  + stack_stop_offset -> char* (non-NULL = started)

At sample time, the sampler thread reads these values using copy_generic
(two pointer hops, no GIL). stack_stop distinguishes on-CPU greenlets
(NULL frame + started) from unstarted ones (NULL frame + not started).

Falls back to per-switch update_greenlet_frame() if offset discovery
fails.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add return type annotations to inner functions (_probe_fn, _stack_probe_fn)
- Wrap ctypes buffer slices with bytes() for int.from_bytes compatibility
- Check both pimpl_offset and frame_offset for None before proceeding
- Add set_greenlet_offsets to the stack module type stub

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract _read_ptr() helper with type: ignore for the ctypes buffer
slice, avoiding repeated type-ignore annotations on every call site.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 089f00e85e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +122 to +125
gl_size = sys.getsizeof(g)
type_addr = id(type(g))

buf = (ctypes.c_char * gl_size).from_address(id(g))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use object basicsize for greenlet memory probing

_discover_greenlet_offsets() uses sys.getsizeof(g) as the byte length for from_address(id(g)), but getsizeof includes GC overhead for GC-tracked objects while id(g) points to the object body. That makes the scan read past the greenlet struct into unrelated memory, and those out-of-bounds words are then treated as candidate pointers in ctypes.from_address(...); invalid addresses can segfault the interpreter instead of raising a Python exception. In production this can crash process startup when profiling patches gevent.

Useful? React with 👍 / 👎.

…ilable

When offset discovery succeeds, don't install a greenlet.settrace()
callback at all.  This eliminates all per-switch Python code, not just
the C calls.  Greenlet tracking relies on the patched spawn/joinall;
dead detection uses rawlink.

The settrace callback is only installed as a fallback when offset
discovery fails.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cit-pr-commenter-54b7da
Copy link
Copy Markdown

Codeowners resolved as

ddtrace/profiling/_gevent.py                                            @DataDog/profiling-python

@datadog-prod-us1-3

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant