teuthology/task/kernel: always hard reboot#2147
Conversation
On the new trial machines, the `shutdown -r now` routine is hanging somewhere before reboot. The cause of this is unknown; it's been very resistant to debugging. So, just sync file systems, remount RO, and then do a hard reboot. Fixes: https://tracker.ceph.com/issues/74717 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
| 'sudo', 'bash', '-c', | ||
| """ | ||
| echo pci > /sys/kernel/reboot/type && \ | ||
| echo 1 > /sys/kernel/reboot/force && \ |
There was a problem hiding this comment.
Are both pci and force required? Have you tried passing just force?
I'm curious because it might help with root causing the issue. IIUC force causes the kernel to skip attempting to stop other CPUs and things like interrupt controllers, hardware timers and IOMMU. If the problem lies in one of those, then the choice of the reboot method/type should be irrelevant because that happens before the actual reboot is issued.
There was a problem hiding this comment.
I don't know. Let's experiment with that after merging this?
There was a problem hiding this comment.
I did look over the kernel code for logging about which phase has been completed and it's amazingly devoid of any even optional logging. I guess the guys that wrote this code, every single one of them, got it 100% right the first time on every change, and no one has ever encountered a platform bug that needs diagnosis before us
There was a problem hiding this comment.
and also @idryomov no one appreciates an actual root-causing more than me, but even if we do identify which thing makes the difference, it's likely going to be a bug in the Supermicro hardware or firmware that's the culprit, and at this point we really need this to Just Work, so we're pretty deep in "it doesn't matter" territory.
There was a problem hiding this comment.
@dmick I get your point, I asked just because there is a reason pci isn't one of the reboot methods that Linux tries by default (my understanding is that Windows doesn't use it and there are known cases where writing to the PCI port doesn't work at all -- most x86 hardware is still getting tested only on Windows by the manufacturers). Putting root causing aside, since this is being generalized now (added to Remote class for more widespread use in teuthology, injected into the reimage process, etc), if we can get away with one special quirk/override instead of two, we should do that IMO to minimize the chances of having to undo (some of) this in the future. We haven't tested pci on smithis, for example.
djgalloway
left a comment
There was a problem hiding this comment.
I can confirm the order of commands causes the trial machines to reboot...
There are a bunch of python-y things I do not understand in here (e.g., procs[role_remote.name] = role_remote.safe_hard_reboot() but then procs[role_remote.name] = role_remote.run( elsewhere) so I'm not comfy Approving.
|
Zack approved on Slack. I'm still seeing failing jobs because of this. I say we merge it. |
Second recreation due to mixing up repositories with mine vs. upstream (to test).
Last PR was #2145