wip: vmm_tests: shard vmm tests across multiple runners#2813
wip: vmm_tests: shard vmm tests across multiple runners#2813justus-camp-microsoft wants to merge 1 commit intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements test sharding for VMM tests to distribute test execution across multiple CI runners, reducing overall test execution time.
Changes:
- Introduces
NextestPartitionstruct to represent test shard configuration (shard number and total shards) - Threads the partition parameter through the entire test execution pipeline (from job configuration to nextest command generation)
- Configures specific test jobs to run in 3 shards:
x64-windows-intelandx64-windows-amd
Reviewed changes
Copilot reviewed 11 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
flowey/flowey_lib_common/src/gen_cargo_nextest_run_cmd.rs |
Adds NextestPartition struct and generates --partition count:shard/total argument for nextest |
flowey/flowey_lib_common/src/run_cargo_nextest_run.rs |
Threads nextest_partition parameter through the run pipeline |
flowey/flowey_lib_hvlite/src/test_nextest_vmm_tests_archive.rs |
Adds partition parameter to test archive execution |
flowey/flowey_lib_hvlite/src/run_cargo_nextest_run.rs |
Passes partition parameter to common library |
flowey/flowey_lib_hvlite/src/build_nextest_vmm_tests.rs |
Sets partition to None for build-time test execution |
flowey/flowey_lib_hvlite/src/build_nextest_unit_tests.rs |
Sets partition to None for unit test builds |
flowey/flowey_lib_hvlite/src/test_nextest_unit_tests_archive.rs |
Sets partition to None for unit test archives |
flowey/flowey_lib_hvlite/src/_jobs/local_build_and_run_nextest_vmm_tests.rs |
Sets partition to None for local builds |
flowey/flowey_lib_hvlite/src/_jobs/consume_and_test_nextest_vmm_tests_archive.rs |
Adds partition parameter and passes it through to test execution |
flowey/flowey_hvlite/src/pipelines/checkin_gates.rs |
Configures shard_count for specific test jobs and generates multiple jobs per shard |
ci-flowey/openvmm-pr.yaml |
Generated ADO pipeline with new sharded test jobs (job16-job18 for intel, etc.) |
.github/workflows/openvmm-ci.yaml |
Generated GitHub Actions workflow with new sharded test jobs |
825c136 to
9ebd801
Compare
| needs_prep_run: bool, | ||
| /// Number of shards to split test execution across. | ||
| /// None means no sharding (single job). | ||
| shard_count: Option<usize>, |
There was a problem hiding this comment.
What's the difference between None and Some(1)? Why not just make this a plain usize?
There was a problem hiding this comment.
No issue I guess, the nextest invocation would just end up having --partition count:1/1 which is harmless. In any case, I'm not sure sharding this way ends up being worth it from testing
|
So I tried 2 shards and 3 shards, here's the result. Looks like 2 shards ends up basically being wall-clock neutral because of how nextest shards the tests across runners. 3 shards gives us ~15 minutes of wall-clock time savings at the cost of a 10% increase in total CI runner time. Does that seem worth it? |
No description provided.