Watchdog errors, system locks up

Once or twice a month, my IPFire system locks up with a screen full of watchdog errors. From what I have read, watchdog should be rebooting the system in these cases, which it is not doing. A manual reboot gets the system back up.

Curious why watchdog is even running? When looking on my IPFire admin site, under IPFire → Pakfire addons, it shows watchdog as an available addon, but it is not under Installed addons.

Also, I cannot locate the /etc/watchdog.conf file to see what configuration watchdog is using.

Nov 6 20:21:20 alpha kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 641s! [ksoftirqd/0:13]
Nov 6 20:21:46 alpha kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [kworker/3:1H:477]
Nov 6 20:23:41 alpha kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 48s! [kworker/3:1H:477]
Nov 6 20:27:29 alpha kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 74s! [kworker/3:1H:477]
Oct 7 18:56:42 alpha kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [swapper/0:0]

Hi @wh0dat

Welcome to the IPFire community.

This is not the addon watchdog. This is a watchdog that operates in the kernel. Searching on this error message has got the following information.

A ‘soft lockup’ is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds without giving other tasks a chance to run.
The watchdog daemon will send an non-maskable interrupt (NMI) to all CPUs in the system who, in turn, print the stack traces of their currently running tasks.
Under normal circumstances, these messages may go away if the load decreases.
This ‘soft lockup’ can happen if the kernel is busy, working on a huge amount of objects which need to be scanned, freed, or allocated, respectively.
The stack traces of those tasks can give a first idea about what the tasks were doing. However, to be able to examine the cause behind the messages, a kernel dump would be needed.

When you start getting these messages and before the system freezes are you able to see what the cpu load overall and per core is.

What system are you running IPFire on. Can you provide the fireinfo link.

2 Likes

Yeah, from what I have read about the kernel watchdog is that it’s only real job is to reboot the system when it detects this kind of issue (but for some reason is not rebooting in my case) :

https://linuxhint.com/linux-kernel-watchdog-explained/

I usually don’t see it happen live, so haven’t checked the system load at the time.
Here is my fireinfo link. It’s not working when I tried the link, but maybe it’s because I just submitted it.

https://fireinfo.ipfire.org/profile/806c035220a76b5fea4b6c74dbe07733ac141dd0

Here is a full dump for one of the watchdog events

Nov 22 14:52:23 alpha kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 93s! [kworker/0:1:16]
Nov 22 14:52:23 alpha kernel: Modules linked in: it87 hwmon_vid nfsd auth_rpcgss nfs_acl lockd grace sunrpc tun nfnetlink_queue xt_NFQUEUE xt_MASQUERADE cfg80211 rfkill 8021q garp ip_set xt_hashlimit xt_mac xt_policy xt_TCPMSS xt_conntrack xt_comment ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit xt_mark xt_connmark nf_log_syslog iptable_raw iptable_mangle iptable_filter vfat fat sch_cake snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio edac_mce_amd snd_hda_codec_hdmi kvm_amd snd_hda_intel ccp snd_intel_dspcfg snd_hda_codec kvm snd_hda_core ax88179_178a snd_hwdep irqbypass usbnet snd_pcm mii snd_timer ppdev r8169 snd pcspkr k10temp parport_pc soundcore i2c_piix4 realtek i2c_core parport acpi_cpufreq efivarfs crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sp5100_tco ohci_pci video
Nov 22 14:52:23 alpha kernel: CPU: 0 PID: 16 Comm: kworker/0:1 Tainted: G        W    L    5.15.49-ipfire #1
Nov 22 14:52:23 alpha kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XM-D3H, BIOS F4 11/27/2013
Nov 22 14:52:23 alpha kernel: Workqueue: events usbnet_deferred_kevent [usbnet]
Nov 22 14:52:23 alpha kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x1f/0x30
Nov 22 14:52:23 alpha kernel: Code: 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 c6 07 00 f7 c6 00 02 00 00 75 08 31 c0 89 c6 89 c7 c3 cc fb 66 0f 1f 44 00 00 <31> c0 89 c6 89 c7 c3 cc 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
Nov 22 14:52:23 alpha kernel: RSP: 0018:ffffaabec0003e38 EFLAGS: 00000206
Nov 22 14:52:23 alpha kernel: RAX: 0000000000000001 RBX: 0000000000007000 RCX: 0000000000000000
Nov 22 14:52:23 alpha kernel: RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffff8c5ec1b11b58
Nov 22 14:52:23 alpha kernel: RBP: 000000007ef48000 R08: 0000000000000000 R09: 0000000000000000
Nov 22 14:52:23 alpha kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c5ec3cea800
Nov 22 14:52:23 alpha kernel: R13: ffff8c5ec1b11a10 R14: ffff8c5ec3cea808 R15: 000000007ef48040
Nov 22 14:52:23 alpha kernel: FS:  0000000000000000(0000) GS:ffff8c5f1b400000(0000) knlGS:0000000000000000
Nov 22 14:52:23 alpha kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 22 14:52:23 alpha kernel: CR2: 00007ec0e2741090 CR3: 0000000004ecc000 CR4: 00000000000406f0
Nov 22 14:52:23 alpha kernel: Call Trace:
Nov 22 14:52:23 alpha kernel:  <IRQ>
Nov 22 14:52:23 alpha kernel:  __iommu_dma_unmap+0x124/0x190
Nov 22 14:52:23 alpha kernel:  iommu_dma_unmap_page+0x44/0xa0
Nov 22 14:52:23 alpha kernel:  usb_hcd_unmap_urb_for_dma+0x77/0x120
Nov 22 14:52:23 alpha kernel:  __usb_hcd_giveback_urb+0x4b/0x100
Nov 22 14:52:23 alpha kernel:  usb_giveback_urb_bh+0xaa/0x110
Nov 22 14:52:23 alpha kernel:  tasklet_action_common.constprop.0+0xbf/0x130
Nov 22 14:52:23 alpha kernel:  __do_softirq+0xc6/0x27f
Nov 22 14:52:23 alpha kernel:  irq_exit_rcu+0x8a/0xb0
Nov 22 14:52:23 alpha kernel:  common_interrupt+0x80/0xa0
Nov 22 14:52:23 alpha kernel:  </IRQ>
Nov 22 14:52:23 alpha kernel:  <TASK>
Nov 22 14:52:23 alpha kernel:  asm_common_interrupt+0x1e/0x40
Nov 22 14:52:23 alpha kernel: RIP: 0010:clear_page_rep+0x7/0x10
Nov 22 14:52:23 alpha kernel: Code: 48 89 d1 48 c1 f9 02 85 c9 48 0f 45 c2 eb ce 31 c0 eb ca e8 db da 59 00 cc cc cc cc cc cc cc cc cc cc cc b9 00 02 00 00 31 c0 <f3> 48 ab c3 cc 0f 1f 40 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00
Nov 22 14:52:23 alpha kernel: RSP: 0018:ffffaabec0127b50 EFLAGS: 00000246
Nov 22 14:52:23 alpha kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 00000000000001c6
Nov 22 14:52:23 alpha kernel: RDX: ffffe6488027d180 RSI: ffff8c5ec1915d00 RDI: ffff8c5ec9f461d0
Nov 22 14:52:23 alpha kernel: RBP: ffffffffb640c080 R08: ffffe6488027d200 R09: 0000000000000000
Nov 22 14:52:23 alpha kernel: R10: 0000000000000246 R11: ffff8c5f1b42fb90 R12: ffff8c5f1b42faf0
Nov 22 14:52:23 alpha kernel: R13: 0000000000000003 R14: ffff8c5f1b4282e8 R15: ffffffffb640c080
Nov 22 14:52:23 alpha kernel:  kernel_init_free_pages.part.0+0x46/0x70
Nov 22 14:52:23 alpha kernel:  get_page_from_freelist+0x898/0xc40
Nov 22 14:52:23 alpha kernel:  ? sched_clock_local+0xe/0x90
Nov 22 14:52:23 alpha kernel:  ? psi_task_change+0x57/0x130
Nov 22 14:52:23 alpha kernel:  __alloc_pages_slowpath.constprop.0+0x3d9/0xc90
Nov 22 14:52:23 alpha kernel:  __alloc_pages+0x2d7/0x2f0
Nov 22 14:52:23 alpha kernel:  kmalloc_order+0x2d/0xc0
Nov 22 14:52:23 alpha kernel:  kmalloc_order_trace+0x19/0xa0
Nov 22 14:52:23 alpha kernel:  __alloc_skb+0x84/0x1d0
Nov 22 14:52:23 alpha kernel:  __netdev_alloc_skb+0x3e/0x170
Nov 22 14:52:23 alpha kernel:  rx_submit+0x41/0x300 [usbnet]
Nov 22 14:52:23 alpha kernel:  usbnet_deferred_kevent+0x87/0x34d [usbnet]
Nov 22 14:52:23 alpha kernel:  process_one_work+0x232/0x3d0
Nov 22 14:52:23 alpha kernel:  worker_thread+0x4d/0x3e0
Nov 22 14:52:23 alpha kernel:  ? process_one_work+0x3d0/0x3d0
Nov 22 14:52:23 alpha kernel:  kthread+0x127/0x150
Nov 22 14:52:23 alpha kernel:  ? set_kthread_struct+0x50/0x50
Nov 22 14:52:23 alpha kernel:  ret_from_fork+0x22/0x30
Nov 22 14:52:23 alpha kernel:  </TASK>
Nov 22 14:52:23 alpha kernel: AMD-Vi: Completion-Wait loop timed out

This might be a kernel bug. What version of IPFire are you running?

These messages seem to be related as the AMD-Vi one I have found linked to iommu. In CU169 or CU170 there were new kernels which did appear to have iommu bugs in them with some hardware that was raised in the community. However since the updated kernel version in CU171 I haven’t seen any more reports about kernel issues except for yours and one other. That other one did not have any info on what the kernel message was and his symptoms are quite different from yours.
https://community.ipfire.org/t/green-no-internet-on-171/8976

Looking up the above messages some people added iommu=off or iommu=soft on the grub kernel cmdline but that was on desktop computers not IPFire.

Looks like I am a couple versions behind :

IPFire 2.27 (x86_64) - Core Update 169

I will upgrade to Core Update 171 and see how it goes, looks like the kernel was updated to 5.15.71 in that core update.

Would still like to know why kernel watchdog is not rebooting the system, or where the kernel watchdog config file is located in the IpFire distro though.

1 Like

IPFire not use the hardware watchdogs. This need calls in the userspace process that should monitored that reset this watchdog regulary so it is only usefull for special purposes.

This kernel internal watchdog to detect softlockups has nothing to do with this.
Often corrupt swap space or bad memory are the reason for such lockups so check memory and disks.

4 Likes