teuthology/task/kernel: always hard reboot by batrick · Pull Request #2147 · ceph/teuthology

batrick · 2026-02-13T13:06:41Z

Second recreation due to mixing up repositories with mine vs. upstream (to test).

Last PR was #2145

On the new trial machines, the `shutdown -r now` routine is hanging somewhere before reboot. The cause of this is unknown; it's been very resistant to debugging. So, just sync file systems, remount RO, and then do a hard reboot. Fixes: https://tracker.ceph.com/issues/74717 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

idryomov · 2026-02-13T13:43:06Z

teuthology/orchestra/remote.py

+            'sudo', 'bash', '-c',
+            """
+            echo pci > /sys/kernel/reboot/type && \
+            echo 1 > /sys/kernel/reboot/force && \


Are both pci and force required? Have you tried passing just force?

I'm curious because it might help with root causing the issue. IIUC force causes the kernel to skip attempting to stop other CPUs and things like interrupt controllers, hardware timers and IOMMU. If the problem lies in one of those, then the choice of the reboot method/type should be irrelevant because that happens before the actual reboot is issued.

I don't know. Let's experiment with that after merging this?

I did look over the kernel code for logging about which phase has been completed and it's amazingly devoid of any even optional logging. I guess the guys that wrote this code, every single one of them, got it 100% right the first time on every change, and no one has ever encountered a platform bug that needs diagnosis before us

and also @idryomov no one appreciates an actual root-causing more than me, but even if we do identify which thing makes the difference, it's likely going to be a bug in the Supermicro hardware or firmware that's the culprit, and at this point we really need this to Just Work, so we're pretty deep in "it doesn't matter" territory.

@dmick I get your point, I asked just because there is a reason pci isn't one of the reboot methods that Linux tries by default (my understanding is that Windows doesn't use it and there are known cases where writing to the PCI port doesn't work at all -- most x86 hardware is still getting tested only on Windows by the manufacturers). Putting root causing aside, since this is being generalized now (added to Remote class for more widespread use in teuthology, injected into the reimage process, etc), if we can get away with one special quirk/override instead of two, we should do that IMO to minimize the chances of having to undo (some of) this in the future. We haven't tested pci on smithis, for example.

djgalloway

I can confirm the order of commands causes the trial machines to reboot...

There are a bunch of python-y things I do not understand in here (e.g., procs[role_remote.name] = role_remote.safe_hard_reboot() but then procs[role_remote.name] = role_remote.run( elsewhere) so I'm not comfy Approving.

djgalloway · 2026-02-17T00:16:18Z

Zack approved on Slack. I'm still seeing failing jobs because of this. I say we merge it.

batrick requested a review from a team as a code owner February 13, 2026 13:06

batrick requested review from kamoltat and zmc and removed request for a team February 13, 2026 13:06

batrick force-pushed the kernel-fix branch from 57ca474 to 1475e16 Compare February 13, 2026 13:28

batrick force-pushed the kernel-fix branch from 1475e16 to 76f36e6 Compare February 13, 2026 13:30

idryomov reviewed Feb 13, 2026

View reviewed changes

djgalloway reviewed Feb 13, 2026

View reviewed changes

djgalloway approved these changes Feb 17, 2026

View reviewed changes

djgalloway merged commit 7fc6083 into ceph:main Feb 17, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

teuthology/task/kernel: always hard reboot#2147

teuthology/task/kernel: always hard reboot#2147
djgalloway merged 1 commit intoceph:mainfrom
batrick:kernel-fix

batrick commented Feb 13, 2026

Uh oh!

idryomov Feb 13, 2026

Uh oh!

batrick Feb 13, 2026

Uh oh!

dmick Feb 13, 2026

Uh oh!

dmick Feb 13, 2026

Uh oh!

idryomov Feb 13, 2026

Uh oh!

djgalloway left a comment

Uh oh!

djgalloway commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

batrick commented Feb 13, 2026

Uh oh!

idryomov Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

batrick Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

dmick Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

dmick Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

idryomov Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

djgalloway left a comment

Choose a reason for hiding this comment

Uh oh!

djgalloway commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments