-
Notifications
You must be signed in to change notification settings - Fork 189
Fix CPU detection in cgroup v2 constrained environments #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Remove sysconf(_SC_NPROCESSORS_CONF) validation that causes false failures when the library runs in a process with cgroup v2 CPU limits. When cpu.max is set in cgroup v2, sysconf(_SC_NPROCESSORS_CONF) returns the cgroup-limited CPU count rather than the actual system CPU count. However, scanning /sys/devices/system/cpu always returns the real number of cores in the system, causing a mismatch and spurious check failures. By removing the sysconf() call and relying solely on the sysfs scan, we ensure consistent CPU topology detection regardless of cgroup constraints. Signed-off-by: Mikhail Malyshev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes CPU detection failures in cgroup v2 constrained environments by removing the sysconf(_SC_NPROCESSORS_CONF) validation check that caused false positives when CPU limits were applied.
Key Changes:
- Removed sysconf() validation that conflicted with sysfs scanning in cgroup v2 environments
- Enhanced error messages with more specific context including the SYSTEM_CPU path
- Added debug/error logging for CPU topology detection failures
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| num_cpus = scandir(SYSTEM_CPU, &namelist, filter_cpu, cpu_sort); | ||
|
|
||
| if (pqos_set_no_files_limit(max_core_count)) { | ||
| if (pqos_set_no_files_limit(num_cpus)) { |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scandir() result is not validated before being used in pqos_set_no_files_limit(). If scandir() fails and returns a negative value, this will be passed to pqos_set_no_files_limit() which expects a positive count. The validation at line 278 occurs after this call. Move the validation before line 262 to prevent passing invalid values.
| num_cpus = scandir(SYSTEM_CPU, &namelist, filter_cpu, cpu_sort); | ||
| if (num_cpus <= 0 || max_core_count < num_cpus) { | ||
| LOG_ERROR("Failed to read proc cpus!\n"); | ||
| if (num_cpus <= 0 ) { |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space before closing parenthesis. Remove the trailing space for consistency with code style.
| LOG_DEBUG("Logical core %u is offline, skipping\n", lcore); | ||
| continue; |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation uses spaces instead of consistent indentation style. The continue statement and surrounding braces at lines 288-291 should match the indentation pattern used elsewhere in the function (which appears to use 8-space indentation based on the surrounding code).
| LOG_DEBUG("Logical core %u is offline, skipping\n", lcore); | |
| continue; | |
| LOG_DEBUG("Logical core %u is offline, skipping\n", lcore); | |
| continue; |
| if (retval != PQOS_RETVAL_OK) | ||
| break; | ||
| if (retval != PQOS_RETVAL_OK) { | ||
| LOG_ERROR("Failed to get socket for lcore %u\n", lcore); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly added error logging blocks at lines 294-297, 300-303, and 307-310 use inconsistent indentation (spaces instead of tabs). Ensure these blocks match the indentation style used in the rest of the file for consistency.
| if (retval != PQOS_RETVAL_OK) | ||
| break; | ||
| if (retval != PQOS_RETVAL_OK) { | ||
| LOG_ERROR("Failed to get NUMA node for lcore %u\n", lcore); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly added error logging blocks at lines 294-297, 300-303, and 307-310 use inconsistent indentation (spaces instead of tabs). Ensure these blocks match the indentation style used in the rest of the file for consistency.
| if (retval != PQOS_RETVAL_OK) | ||
| break; | ||
| if (retval != PQOS_RETVAL_OK) { | ||
| LOG_ERROR("Failed to get cache IDs for lcore %u\n", lcore); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly added error logging blocks at lines 294-297, 300-303, and 307-310 use inconsistent indentation (spaces instead of tabs). Ensure these blocks match the indentation style used in the rest of the file for consistency.
|
@rucoder Please make sure that your patch follows the project's style. Also make sure that unit-tests are passed: |
Description
Remove sysconf(_SC_NPROCESSORS_CONF) validation that causes false failures when the library runs in a process with cgroup v2 CPU limits.
When cpu.max is set in cgroup v2, sysconf(_SC_NPROCESSORS_CONF) returns the cgroup-limited CPU count rather than the actual system CPU count. However, scanning /sys/devices/system/cpu always returns the real number of cores in the system, causing a mismatch and spurious check failures.
By removing the sysconf() call and relying solely on the sysfs scan, we ensure consistent CPU topology detection regardless of cgroup constraints.
Affected parts
Motivation and Context
https://github.com/lf-edge/eve uses libpqos from privileged system container pkg/pillar. This is a houskeeping conitainer that creates user workloads and it is restricted only to core 0 but has full visibility of /sys/fs so the library did not initilize
How Has This Been Tested?
This was tested from EVE OS on a real device with intel RDT support
Types of changes
Checklist: