-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Summary
The sched_ext rt_stall selftest remains flaky in CI despite the synchronization fix in commit 0b82cc331d2e. The test fails because it measures total CPU time since fork rather than the delta during the measurement window, causing pre-sleep CPU accumulation to skew the ratio below the 4% threshold.
Failure Details
- Test / Component: selftests/sched_ext rt_stall
- Frequency: Most sched_ext CI runs on x86_64 (observed in 3+ independent PRs over 3 days)
- Failure mode: Flaky — ratio drops to 3.0-3.9% instead of expected ~5%
- Affected architectures: x86_64
- CI runs observed:
- https://github.com/kernel-patches/bpf/actions/runs/22834153226 (bpf-next_test)
- https://github.com/kernel-patches/bpf/actions/runs/22803463271 (fix constant blinding bypass)
- https://github.com/kernel-patches/bpf/actions/runs/22788551155 (perf_link: avoid failures)
Root Cause Analysis
The rt_stall test verifies that the sched_ext deadline server prevents RT tasks from starving EXT/FAIR tasks. It forks two children pinned to the same CPU — one EXT/FAIR and one SCHED_FIFO — then measures their CPU time ratio after sleep(5).
Commit 0b82cc331d2e added pipe-based synchronization so children complete their setup before the parent starts sleep(RUN_TIME). However, children start busy-looping immediately after signal_ready(), while the parent still needs to process both pipe reads before calling sleep(). During this gap, both children accumulate CPU time — with the RT child dominating.
The measurement reads total CPU time from /proc/pid/stat (utime + stime), which includes this pre-sleep time. This inflates the RT denominator:
| Run | Failing iteration | EXT/FAIR | RT | Ratio | Note |
|---|---|---|---|---|---|
| bpf-next_test | i=2 (FAIR) | 0.180s | 4.740s | 3.66% | |
| constant blinding | i=3 (EXT) | 0.180s | 5.690s | 3.07% | RT > RUN_TIME proves pre-sleep accumulation |
| perf_link | i=3 (EXT) | 0.190s | 4.740s | 3.85% |
The RT task getting 5.69s of CPU time in a 5-second sleep window is conclusive evidence: ~0.69s of RT time was accumulated before the measurement started.
The failure tends to occur in later iterations (i=2 or i=3) because the parent has more work to do between iterations (destroying/re-attaching the sched_ext link), giving children more time to accumulate pre-sleep CPU time.
Relevant code: tools/testing/selftests/sched_ext/rt_stall.c:sched_stress_test()
Proposed Fix
Take before/after snapshots of CPU time around the sleep(RUN_TIME) window and compute deltas, rather than using total CPU time since fork. This eliminates the pre-measurement bias regardless of how long the gap between signal_ready() and sleep() takes.
See attached patch: 0001-selftests-sched_ext-Fix-rt_stall-flaky-measurement-w.patch
The fix adds:
- A pre-sleep snapshot of both children's CPU times via
get_process_runtime() - Subtraction of the pre-sleep snapshot from the post-sleep reading
- Error handling for the new snapshot reads
Impact
Without this fix, the sched_ext test job fails in most CI runs, blocking unrelated PRs from passing CI. The sched_ext job is not marked continue_on_error, so any sched_ext test failure fails the entire workflow.
References
- Prior fix: 0b82cc331d2e ("selftests/sched_ext: Fix rt_stall flaky failure")
- Original test: be621a76341c ("selftests/sched_ext: Add test for sched_ext dl_server")
- Related vmtest issue: [bpf-ci-bot] The sched_ext numa selftest is flaky #453 (covers numa test, not rt_stall)