Skip to content

Shrink the inline storage in connect_awaitable#1974

Merged
ericniebler merged 26 commits intoNVIDIA:mainfrom
ispeters:inline_coroutine_allocation
Apr 3, 2026
Merged

Shrink the inline storage in connect_awaitable#1974
ericniebler merged 26 commits intoNVIDIA:mainfrom
ispeters:inline_coroutine_allocation

Conversation

@ispeters
Copy link
Copy Markdown
Contributor

@ispeters ispeters commented Mar 30, 2026

This diff adapts an idea originally due to @lewissbaker, originally shared at https://godbolt.org/z/zGG9fsPrz,
and combines it with the synthetic coroutine frame idea from @sgerbino and @vinniefalco in Capy.

Changes to Lewis's original include:

  • never invoke coro.destroy() since it's a no-op anyway
  • support unhandled_stopped() and get_env()
  • no need for a real coroutine since we're synthesizing a frame from a single function pointer

Changes to the synthetic coroutine frame:

  • don't store a void* data and use reinterpret_cast and known offsets instead

While we're in here optimizing for size, also make storage of the optional awaitable objects conditional on
their existence (i.e. don't store the result of as_awaitable(auto& promise) or operator co_await if we're
falling back to the defaults).

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment on lines +125 to +127
// the first implementation of storing the coroutine frame inline in __opstate using the
// technique in this file is due to Lewis Baker <lewissbaker@gmail.com>, and was first
// shared at https://godbolt.org/z/zGG9fsPrz
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewissbaker, do have any feedback on this? Placement, content, whatever?

Copy link
Copy Markdown
Collaborator

@ericniebler ericniebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 4f59b3c

@ericniebler
Copy link
Copy Markdown
Collaborator

so, it looks like CI has been vacuously green for a couple of weeks, and in that time i managed to break coroutine support for msvc. i can't fix the CI issue until i get the coro tests passing on msvc again. i'll try to get to that soon.

sorry for the delay.

@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test a678007

@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 2d64a93

@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 065b81a

@ispeters ispeters force-pushed the inline_coroutine_allocation branch from 065b81a to 5d3ba11 Compare April 1, 2026 16:54
@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 5d3ba11

@ispeters ispeters force-pushed the inline_coroutine_allocation branch from 5d3ba11 to e8baca6 Compare April 1, 2026 20:50
@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test e8baca6

@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 2e49c6f

@ispeters
Copy link
Copy Markdown
Contributor Author

ispeters commented Apr 2, 2026

Oh blarg, I upgraded Clang and I guess I got a new clang-format, which has ruined the formatting. Uhh… that's annoying. I'll see what I can do to fix my local tools.

@ericniebler
Copy link
Copy Markdown
Collaborator

Oh blarg, I upgraded Clang and I guess I got a new clang-format, which has ruined the formatting. Uhh… that's annoying. I'll see what I can do to fix my local tools.

the whole formatting thing has been a bit of a pain. CI uses clang-format-21 version 21.1.5. i also have clang-format-21, but i have 21.0.0, which formats slightly differently. it's crazy-making.

@ispeters
Copy link
Copy Markdown
Contributor Author

ispeters commented Apr 2, 2026

Whoops, I missed the template-template bug; I'll have to iterate. How in the heck did any of this work locally?

@ispeters ispeters force-pushed the inline_coroutine_allocation branch from 2e49c6f to 1636833 Compare April 2, 2026 20:51
@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 1636833

@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 661283a

@ericniebler
Copy link
Copy Markdown
Collaborator

this looks interesting:

30: test.stdexec: /home/coder/stdexec/include/stdexec/__detail/__connect_await
able.hpp:419: static constexpr void* std::execution::__connect_await::__promis
e<_Awaitable, _Receiver>::operator new(std::size_t, __opstate_t&) [with _Await
able = {anonymous}::ready_awaitable<int>; _Receiver = {anonymous}::expect_valu
e_receiver<std::execution::env<>, int>; std::size_t = long unsigned int; __ops
tate_t = std::execution::__connect_await::__promise<{anonymous}::ready_awaitab
le<int>, {anonymous}::expect_value_receiver<std::execution::env<>, int> >::__
opstate_t]: Assertion `__bytes == __storage_size + 1' failed.

@ispeters
Copy link
Copy Markdown
Contributor Author

ispeters commented Apr 2, 2026

Kinda looks like a bunch of the failures are compiler-specific size mismatches. I can't run GCC with ASAN on Apple silicon, nor can I run MSVC, so I plan to add a commit that prints the values of the various asserted-upon variables to stdout so the failure log is more illuminating.

As for the GCC 11 failure, it's an ICE that repros in debug mode but not release mode, so that's a bit wild.

@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 3163082

@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test bdd539a

ispeters and others added 9 commits April 3, 2026 09:01
This ought to fix the MSVC build; maybe it'll fix the GCC ICE.
I figured out how to run `clang-format` v21.1.6, which appears to be
compatible with v21.1.5.

I updated the decorators in `test_cpo_connect_awaitable.cpp` to fix the
template-template build error.
Also fix the throw-on-as_awaitable wrapper's template-template bug.
I can't figure out the CI failures on the previous commit because I
can't repro them locally. This diff adds some verbose debug logging in
hopes of using the resulting spew to diagnose the problem(s) from the
failure report in the CI runner.
Update the required coroutine frame sizes based on the results of the
previous CI run, and exclude the `connect_awaitable` tests from
compilers that lack coroutine support.
Maybe this will make the error logs more useful while I debug the frame
size.
I typo'd `STDEXEC_CONNECT_AWAITABLE_NUM_POINTERS` for the Clang + CUDA 12.0 case; this diff fixes it.
@ispeters ispeters force-pushed the inline_coroutine_allocation branch from 939341a to 562648b Compare April 3, 2026 16:07
This diff ought to fix the coroutine frame size constants for MSVC, and
it emits more debug info for CUDA-on-Clang builds in hopes of fixing the
constants for that combo.
@ispeters ispeters force-pushed the inline_coroutine_allocation branch from 562648b to 2f38e0b Compare April 3, 2026 16:11
@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 2f38e0b

MSVC didn't like me using `#if/#endif` inside the invocation of a
funciton-like macro, which alerted me to the fact I might be misusing
the preprocessor, which might be the reason I didn't see the desired
debug output in the CUDA builds. This diff fixes the preprocessor usage.
@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 76bebc4

@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 68d0f53

This diff is inspired by the synthetic coroutine frame in @sgerbino's
fork of Capy, which I looked at here:
https://github.com/sgerbino/capy/blob/pr/beman-bench/bench/beman/awaitable_sender.hpp#L87-L92

The idea is to fake out a `coroutine_handle<__promise>` from a struct
like this:
```
struct __synthetic_coro_frame
{
  void (*__resume)(void*) noexcept;
  void (*__destroy_)(void*) noexcept;
};
```
and lie to the compiler with code like this:
```
// in __opstate<...>::__co_impl, which is a static member function
__std::coroutine_handle<__promise_t>::from_address(&__op.__synthetic_frame_);
```

With the right pointer math and a sprinkling of `reinterpret_cast`, this
code builds and passes all tests with Clang 22 and GCC 15 on Apple
silicon. Let's see what CI says.
@ispeters ispeters force-pushed the inline_coroutine_allocation branch from 68d0f53 to d1c3308 Compare April 3, 2026 19:27
@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test d1c3308

Make `__synthetic_coro_frame` a byte smaller so the `bool __started_`
member in `__opstate` can sneak into the trailing padding, saving a
pointer.
Replace `_A` with `_Awaitable2` and use `__not_decay_to` as recommended.
@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test f9f3181

The synthesized coroutine handle is passed to user code when it connects
an awaitable that suspends (it's received as the argument to
`await_suspend`), which means that user code can invoke `coro.destroy()`
even though we never do that in library code. The previous memory layout
stored `__opstate<...>::__started_` in the last byte of its synthesized
frame by artificially shortening the struct by a byte and marking the
`__started_` member with `[[no_unique_address]]`, which means that
`destroy` could point to a somewhat arbitrary address, depending on how
a `true` `bool` in the last byte is interpreted. This is potentially a
security hole so, to be safe, let's not squeeze so hard and just make
room for a function pointer and assign it the address of a no-op
function.
@ericniebler
Copy link
Copy Markdown
Collaborator

/ok to test 0995025

@ericniebler ericniebler merged commit 0646567 into NVIDIA:main Apr 3, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants