[Platform Manager] adds exponential backoff retry to hwmon path detection #841
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pre-submission checklist
pip install -r requirements-dev.txt && pre-commit installpre-commit runSummary
Issue
There is currently a race condition in platform_manager's device exploration logic: if too much time elapses between when a device is initialized and when its hwmon attributes are enumerated, then
DevicePathResolver::resolveSensorPathcan fail to find a hwmon path which exists.Here is an example to illustrate this issue. Several Arista platforms use an I2C CPLD to detect presence of fan modules; defined in a platform_manager configuration, this looks like:
In a system where a single CPLD manages multiple fan slots, we have seen issues where platform_manager is not able to correctly detect the
FAN_SLOTpresence, which depends on checking thehwmon/hwmon[n]/fanX_presentsysfs file provided by the CPLD kernel driver, after loading the driver. The platform_manager service ends up logging errors despite the driver correctly creating the sysfs files:To determine the sequence of events, I wrote a small script to log millisecond-granularity timestamps and the contents of the CPLD driver's sysfs directory, polled at a 3ms interval. I ran this script concurrently with platform_manager and traced the sequence of events. In sysfs, I saw the following sequence:
And here is the corresponding platform_manager log:
Based on the timestamps, the
FAN_SLOTdevice exploration occurs between 11:53:47.501047 and 11:53:47.501942, which is entirely between when the CPLD device has been created (11:53:47.449) and when the hwmon subsystem is created (11:53:47.568). With this same configuration, I sometimes observed platform_manager completing device exploration with no errors. In the case where platform_manager succeeded with no errors, the CPLD hwmon creation took only 64ms; in the case where it fails, the hwmon creation took 119ms.platform _manager does not take into account that device creation and hwmon enumeration are not atomic: therefore there could be 10s to 100s of milliseconds between when an I2C device is created in the kernel and when its hwmon endpoints are published. The service does not account for this.
Solution
This PR presents one proposed solution to this issue. In
DevicePathResolver.cpp, I have updated theDevicePathResolver::resolveSensorPathfunction to implement a retry loop with exponential backoff, making detection ofhwmonsubsystems robust to small fluctuations in timing. If the endpoint is already enumerated, then there is no added overhead. If the endpoint is not present, then the function will wait 10 ms + backoff for the device path. In the worst case, where the driver has failed to create an endpoint, the function would add a maximum of 5 seconds for any sensor path.Positive aspects of this approach:
Test Plan
The
platform_managerbinary builds successfully and clang-format passes.To verify the proposed solution, I loaded the same example setup with a CPLD managing eight fan slots. Here is the platform_manager log after this improvement:
There are no errors reported. From the log, we can see that all fan slots are now correctly identified as present, and the improvement only added 150ms of time to the total device exploration time:
Here's the sysfs sequence during the platform_manager exploration, where the hwmon endpoint takes 139ms to enumerate: