Once or twice a month, my IPFire system locks up with a screen full of watchdog errors. From what I have read, watchdog should be rebooting the system in these cases, which it is not doing. A manual reboot gets the system back up.
Curious why watchdog is even running? When looking on my IPFire admin site, under IPFire → Pakfire addons, it shows watchdog as an available addon, but it is not under Installed addons.
Also, I cannot locate the /etc/watchdog.conf file to see what configuration watchdog is using.
Nov 6 20:21:20 alpha kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 641s! [ksoftirqd/0:13]
Nov 6 20:21:46 alpha kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [kworker/3:1H:477]
Nov 6 20:23:41 alpha kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 48s! [kworker/3:1H:477]
Nov 6 20:27:29 alpha kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 74s! [kworker/3:1H:477]
Oct 7 18:56:42 alpha kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [swapper/0:0]
This is not the addon watchdog. This is a watchdog that operates in the kernel. Searching on this error message has got the following information.
A ‘soft lockup’ is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds without giving other tasks a chance to run.
The watchdog daemon will send an non-maskable interrupt (NMI) to all CPUs in the system who, in turn, print the stack traces of their currently running tasks.
Under normal circumstances, these messages may go away if the load decreases.
This ‘soft lockup’ can happen if the kernel is busy, working on a huge amount of objects which need to be scanned, freed, or allocated, respectively.
The stack traces of those tasks can give a first idea about what the tasks were doing. However, to be able to examine the cause behind the messages, a kernel dump would be needed.
When you start getting these messages and before the system freezes are you able to see what the cpu load overall and per core is.
What system are you running IPFire on. Can you provide the fireinfo link.
Yeah, from what I have read about the kernel watchdog is that it’s only real job is to reboot the system when it detects this kind of issue (but for some reason is not rebooting in my case) :
I usually don’t see it happen live, so haven’t checked the system load at the time.
Here is my fireinfo link. It’s not working when I tried the link, but maybe it’s because I just submitted it.
This might be a kernel bug. What version of IPFire are you running?
These messages seem to be related as the AMD-Vi one I have found linked to iommu. In CU169 or CU170 there were new kernels which did appear to have iommu bugs in them with some hardware that was raised in the community. However since the updated kernel version in CU171 I haven’t seen any more reports about kernel issues except for yours and one other. That other one did not have any info on what the kernel message was and his symptoms are quite different from yours. https://community.ipfire.org/t/green-no-internet-on-171/8976
Looking up the above messages some people added iommu=off or iommu=soft on the grub kernel cmdline but that was on desktop computers not IPFire.
I will upgrade to Core Update 171 and see how it goes, looks like the kernel was updated to 5.15.71 in that core update.
Would still like to know why kernel watchdog is not rebooting the system, or where the kernel watchdog config file is located in the IpFire distro though.
IPFire not use the hardware watchdogs. This need calls in the userspace process that should monitored that reset this watchdog regulary so it is only usefull for special purposes.
This kernel internal watchdog to detect softlockups has nothing to do with this.
Often corrupt swap space or bad memory are the reason for such lockups so check memory and disks.