Skip to content

Add support for CPU overcommit #222

@bonachea

Description

@bonachea

Caffeine currently implicitly assumes that on a given target node it can access sufficient CPU resources to dedicate at least one CPU core (or at least one hardware thread) per image in a multi-image run. If a user decides to run more images on a node than there are available physical cores (i.e. "CPU overcommit") then performance might degrade. Running with CPU overcommit is probably never recommended in production runs, but it's a practice that can occasionally be useful when using a laptop/workstation to debug defects that only arise at larger image scales.

There are things that can be done at the runtime level to detect CPU overcommit (in many cases). The actual impact of overcommit on system performance is a complicated topic that depends many details including the communication transports in use and the OS process scheduling policies. Once detected (or directed by a user setting) that we are in an overcommit scenario, adjustments can be made (e.g. in some busy-wait synchronization algorithms) to provide more friendly sharing of core resources, to hopefully avoid some of the worst performance penalties in heavy overcommit scenarios.

This issue exists to track progress on this topic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions