Shrink the inline storage in connect_awaitable#1974
Shrink the inline storage in connect_awaitable#1974ericniebler merged 26 commits intoNVIDIA:mainfrom
Conversation
| // the first implementation of storing the coroutine frame inline in __opstate using the | ||
| // technique in this file is due to Lewis Baker <lewissbaker@gmail.com>, and was first | ||
| // shared at https://godbolt.org/z/zGG9fsPrz |
There was a problem hiding this comment.
@lewissbaker, do have any feedback on this? Placement, content, whatever?
|
/ok to test 4f59b3c |
|
so, it looks like CI has been vacuously green for a couple of weeks, and in that time i managed to break coroutine support for msvc. i can't fix the CI issue until i get the coro tests passing on msvc again. i'll try to get to that soon. sorry for the delay. |
|
/ok to test a678007 |
|
/ok to test 2d64a93 |
|
/ok to test 065b81a |
065b81a to
5d3ba11
Compare
|
/ok to test 5d3ba11 |
5d3ba11 to
e8baca6
Compare
|
/ok to test e8baca6 |
|
/ok to test 2e49c6f |
|
Oh blarg, I upgraded Clang and I guess I got a new |
the whole formatting thing has been a bit of a pain. CI uses clang-format-21 version 21.1.5. i also have clang-format-21, but i have 21.0.0, which formats slightly differently. it's crazy-making. |
|
Whoops, I missed the template-template bug; I'll have to iterate. How in the heck did any of this work locally? |
2e49c6f to
1636833
Compare
|
/ok to test 1636833 |
|
/ok to test 661283a |
|
this looks interesting: |
|
Kinda looks like a bunch of the failures are compiler-specific size mismatches. I can't run GCC with ASAN on Apple silicon, nor can I run MSVC, so I plan to add a commit that prints the values of the various asserted-upon variables to stdout so the failure log is more illuminating. As for the GCC 11 failure, it's an ICE that repros in debug mode but not release mode, so that's a bit wild. |
|
/ok to test 3163082 |
|
/ok to test bdd539a |
This ought to fix the MSVC build; maybe it'll fix the GCC ICE.
I figured out how to run `clang-format` v21.1.6, which appears to be compatible with v21.1.5. I updated the decorators in `test_cpo_connect_awaitable.cpp` to fix the template-template build error.
Also fix the throw-on-as_awaitable wrapper's template-template bug.
I can't figure out the CI failures on the previous commit because I can't repro them locally. This diff adds some verbose debug logging in hopes of using the resulting spew to diagnose the problem(s) from the failure report in the CI runner.
Update the required coroutine frame sizes based on the results of the previous CI run, and exclude the `connect_awaitable` tests from compilers that lack coroutine support.
Maybe this will make the error logs more useful while I debug the frame size.
I typo'd `STDEXEC_CONNECT_AWAITABLE_NUM_POINTERS` for the Clang + CUDA 12.0 case; this diff fixes it.
939341a to
562648b
Compare
This diff ought to fix the coroutine frame size constants for MSVC, and it emits more debug info for CUDA-on-Clang builds in hopes of fixing the constants for that combo.
562648b to
2f38e0b
Compare
|
/ok to test 2f38e0b |
MSVC didn't like me using `#if/#endif` inside the invocation of a funciton-like macro, which alerted me to the fact I might be misusing the preprocessor, which might be the reason I didn't see the desired debug output in the CUDA builds. This diff fixes the preprocessor usage.
|
/ok to test 76bebc4 |
|
/ok to test 68d0f53 |
This diff is inspired by the synthetic coroutine frame in @sgerbino's fork of Capy, which I looked at here: https://github.com/sgerbino/capy/blob/pr/beman-bench/bench/beman/awaitable_sender.hpp#L87-L92 The idea is to fake out a `coroutine_handle<__promise>` from a struct like this: ``` struct __synthetic_coro_frame { void (*__resume)(void*) noexcept; void (*__destroy_)(void*) noexcept; }; ``` and lie to the compiler with code like this: ``` // in __opstate<...>::__co_impl, which is a static member function __std::coroutine_handle<__promise_t>::from_address(&__op.__synthetic_frame_); ``` With the right pointer math and a sprinkling of `reinterpret_cast`, this code builds and passes all tests with Clang 22 and GCC 15 on Apple silicon. Let's see what CI says.
68d0f53 to
d1c3308
Compare
|
/ok to test d1c3308 |
Make `__synthetic_coro_frame` a byte smaller so the `bool __started_` member in `__opstate` can sneak into the trailing padding, saving a pointer.
|
/ok to test f9f3181 |
The synthesized coroutine handle is passed to user code when it connects an awaitable that suspends (it's received as the argument to `await_suspend`), which means that user code can invoke `coro.destroy()` even though we never do that in library code. The previous memory layout stored `__opstate<...>::__started_` in the last byte of its synthesized frame by artificially shortening the struct by a byte and marking the `__started_` member with `[[no_unique_address]]`, which means that `destroy` could point to a somewhat arbitrary address, depending on how a `true` `bool` in the last byte is interpreted. This is potentially a security hole so, to be safe, let's not squeeze so hard and just make room for a function pointer and assign it the address of a no-op function.
|
/ok to test 0995025 |
This diff adapts an idea originally due to @lewissbaker, originally shared at https://godbolt.org/z/zGG9fsPrz,
and combines it with the synthetic coroutine frame idea from @sgerbino and @vinniefalco in Capy.
Changes to Lewis's original include:
coro.destroy()since it's a no-op anywayunhandled_stopped()andget_env()Changes to the synthetic coroutine frame:
void* dataand usereinterpret_castand known offsets insteadWhile we're in here optimizing for size, also make storage of the optional awaitable objects conditional on
their existence (i.e. don't store the result of
as_awaitable(auto& promise)oroperator co_awaitif we'refalling back to the defaults).