-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Caffeine currently implicitly assumes that on a given target node it can access sufficient CPU resources to dedicate at least one CPU core (or at least one hardware thread) per image in a multi-image run. If a user decides to run more images on a node than there are available physical cores (i.e. "CPU overcommit") then performance might degrade. Running with CPU overcommit is probably never recommended in production runs, but it's a practice that can occasionally be useful when using a laptop/workstation to debug defects that only arise at larger image scales.
There are things that can be done at the runtime level to detect CPU overcommit (in many cases). The actual impact of overcommit on system performance is a complicated topic that depends many details including the communication transports in use and the OS process scheduling policies. Once detected (or directed by a user setting) that we are in an overcommit scenario, adjustments can be made (e.g. in some busy-wait synchronization algorithms) to provide more friendly sharing of core resources, to hopefully avoid some of the worst performance penalties in heavy overcommit scenarios.
This issue exists to track progress on this topic.