Skip to content

teuthology/task/kernel: always hard reboot#2147

Merged
djgalloway merged 1 commit intoceph:mainfrom
batrick:kernel-fix
Feb 17, 2026
Merged

teuthology/task/kernel: always hard reboot#2147
djgalloway merged 1 commit intoceph:mainfrom
batrick:kernel-fix

Conversation

@batrick
Copy link
Member

@batrick batrick commented Feb 13, 2026

Second recreation due to mixing up repositories with mine vs. upstream (to test).

Last PR was #2145

@batrick batrick requested a review from a team as a code owner February 13, 2026 13:06
@batrick batrick requested review from kamoltat and zmc and removed request for a team February 13, 2026 13:06
On the new trial machines, the `shutdown -r now` routine
is hanging somewhere before reboot. The cause of this is unknown; it's
been very resistant to debugging. So, just sync file systems, remount
RO, and then do a hard reboot.

Fixes: https://tracker.ceph.com/issues/74717
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
'sudo', 'bash', '-c',
"""
echo pci > /sys/kernel/reboot/type && \
echo 1 > /sys/kernel/reboot/force && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are both pci and force required? Have you tried passing just force?

I'm curious because it might help with root causing the issue. IIUC force causes the kernel to skip attempting to stop other CPUs and things like interrupt controllers, hardware timers and IOMMU. If the problem lies in one of those, then the choice of the reboot method/type should be irrelevant because that happens before the actual reboot is issued.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Let's experiment with that after merging this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look over the kernel code for logging about which phase has been completed and it's amazingly devoid of any even optional logging. I guess the guys that wrote this code, every single one of them, got it 100% right the first time on every change, and no one has ever encountered a platform bug that needs diagnosis before us

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also @idryomov no one appreciates an actual root-causing more than me, but even if we do identify which thing makes the difference, it's likely going to be a bug in the Supermicro hardware or firmware that's the culprit, and at this point we really need this to Just Work, so we're pretty deep in "it doesn't matter" territory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmick I get your point, I asked just because there is a reason pci isn't one of the reboot methods that Linux tries by default (my understanding is that Windows doesn't use it and there are known cases where writing to the PCI port doesn't work at all -- most x86 hardware is still getting tested only on Windows by the manufacturers). Putting root causing aside, since this is being generalized now (added to Remote class for more widespread use in teuthology, injected into the reimage process, etc), if we can get away with one special quirk/override instead of two, we should do that IMO to minimize the chances of having to undo (some of) this in the future. We haven't tested pci on smithis, for example.

Copy link
Contributor

@djgalloway djgalloway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm the order of commands causes the trial machines to reboot...

There are a bunch of python-y things I do not understand in here (e.g., procs[role_remote.name] = role_remote.safe_hard_reboot() but then procs[role_remote.name] = role_remote.run( elsewhere) so I'm not comfy Approving.

@djgalloway
Copy link
Contributor

Zack approved on Slack. I'm still seeing failing jobs because of this. I say we merge it.

@djgalloway djgalloway merged commit 7fc6083 into ceph:main Feb 17, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments