-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Description
On a Raspberry Pi 5, the system can become fully unresponsive under certain workload conditions.
When this happens:
- SSH becomes unreachable
- most userland processes and containers stop responding
- openHAB may still update some internal state
- Homematic device communication ceases entirely
- only a hard power-cycle restores operation
The behavior does not resemble an immediate kernel panic, out-of-memory kill, or userspace crash.
Instead, it is preceded by repeated kernel log messages that appear to originate from a UART/driver path, followed by systemd-journald warnings and eventual stall.
Observed Behavior
In every recorded instance of this issue, shortly before the unresponsiveness, the kernel log shows a repeating pattern similar to:
raw-uart raw-uart1: generic_raw_uart_handle_rx_char(): rx fifo full
eq3loop: eq3loop_write_master() mmd_hmip: not enough space in the buffers
eq3loop: eq3loop_write_master() return error
Followed by journald messages indicating overrun and watchdog timeouts:
systemd-journald: /dev/kmsg buffer overrun, some messages lost
systemd-journald.service: Watchdog timeout
systemd-journald.service: Killing process systemd-journal
After these messages, thousands of kernel messages are lost, journald is restarted, and the system eventually becomes largely unresponsive.
Expected Behavior
- Under heavy load, or in the presence of occasional driver buffer exhaustion, the system should remain responsive and recover gracefully. Specifically:
- driver buffer exhaustion should be handled without log flooding
- kernel log streams should be rate limited to avoid overwhelming the logging subsystem
- unrelated kernel subsystems (network, scheduling, file systems) should not be impacted by issues in a single hardware driver path
- the system should remain responsive even under high I/O pressure
Workloads Observed
- The issue has been observed under the following conditions:
- Raspberry Pi 5 with NVMe root filesystem
- RaspberryMatic running inside Docker
- Other containers including openHAB, InfluxDB, Frigate
- NVMe subjected to high sustained I/O (e.g., video timeline scrub, large historical data queries)
- RF traffic concurrently active via a USB-connected Homematic stick
The problem is rare and not reliably reproducible; it occurs under combinations of high I/O and sustained driver activity.
System information
- Platform: Raspberry Pi 5 (ARM64 / aarch64)
- Kernel: 6.6.x+rpt-rpi
- OS: Raspberry Pi OS 64-bit
- Root filesystem on NVMe (Pineboards AI Bundle M-Key)
- Docker engine (dockerd / containerd)
- Homematic RF stick connected directly by USB
- Other RF devices on a powered USB hub
Full journal logs around the freeze events:
Attached logs show consistent patterns across multiple freeze events:
dmesg_after_reboot.log
journal_prevboot.log
kernel.log
lastlog.txt
lsmod.log
wtmp.txt
These logs consistently contain the repeating uart/eq3loop buffer exhaustion messages preceding the journal overruns and subsequent system stall.
Additional context
- USB power supply issues have been tested and ruled out using both passive and active hubs
- No undervoltage or CPU throttling flags detected via firmware telemetry
- The system is normally stable for extended periods (weeks–months) between incidents
- The issue does not appear tied to specific userland containers or services alone
Additional observations
I have observed two closely related failure modes:
- A complete system hang requiring a hard power cycle
- A partial failure where Homematic commands are delayed by ~15–30 seconds,
followed by execution of multiple queued commands almost simultaneously
In the second case, the system remains partially responsive for a short time
(SSH sometimes still works, non-Homematic services continue to operate),
but the Homematic communication path appears stalled.
In at least one incident, this delayed state escalated into an automatic reboot
rather than a permanent freeze. This suggests a kernel-level failure that
progresses over time rather than an immediate hard lockup.
Shortly before both types of failures, kernel logs show sustained flooding of:
raw-uart: generic_raw_uart_handle_rx_char(): rx fifo full
eq3loop: not enough space in the buffers
followed by /dev/kmsg buffer overruns and journald watchdog events.
This indicates that the delayed Homematic behavior may be an early warning sign
of the same kernel-level issue that later results in a full system hang or reboot.