Greetings,
I have a LightningWireLabs eco from 2015 that I adore and has done me well for 6 years. After the upgrade to 161 Saturday, I’ve been having issues with it hard locking up on me. After an update to 162, I can’t get it to boot at all.
I think that the hardware just gave up on me, but I want to report what went down and see if anyone has any thoughts/suggestions.
Plugging a monitor in all I see is:
NMI: IOCK error (debug interrupt?) for reason 70 on CPU0
NMI: IOCK error (debug interrupt?) for reason 60 on CPU0
And the keyboard wouldn’t respond. A hard reset “works”. Sometimes for a few hours, sometimes for only a few minutes. The IOCK errors are not in any log files before the upgrade. They start appearing really quick after ipfire is booted up. In dmesg I see things like this:
Dec 6 16:31:22 ipfire kernel: NMI: IOCK error (debug interrupt?) for reason 60 on CPU 0.
Dec 6 16:31:22 ipfire kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 5.10.76-ipfire #1
Dec 6 16:31:22 ipfire kernel: Hardware name: ASUS P9A-I/2550/4L/P9A-I/2550/4L, BIOS 0206 05/14/2014
Dec 6 16:31:22 ipfire kernel: RIP: 0010:mwait_idle_with_hints.constprop.0+0x52/0xa0
Dec 6 16:31:22 ipfire kernel: Code: 65 48 8b 04 25 c0 7b 01 00 0f 01 c8 48 8b 00 a8 08 75 17 e9 07 00 00 00 0f 00 2d 95 84 9b 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 7b 01 00 f0 80 60 02 df f0 83 4
4 24 fc 00 48 8b
Dec 6 16:31:22 ipfire kernel: RSP: 0018:ffffffffb0c03e38 EFLAGS: 00000046
Dec 6 16:31:22 ipfire kernel: RAX: 0000000000000051 RBX: ffff8cc337c33f20 RCX: 0000000000000001
Dec 6 16:31:22 ipfire kernel: RDX: 0000000000000000 RSI: ffffffffb0daa3a0 RDI: 0000000000000051
Dec 6 16:31:22 ipfire kernel: RBP: 0000000000000002 R08: 000000290b3459e6 R09: 00000029218d2bef
Dec 6 16:31:22 ipfire kernel: R10: 0000000000007938 R11: 000000000000f5d1 R12: 0000000000000002
Dec 6 16:31:22 ipfire kernel: R13: ffffffffb0daa488 R14: 0000000000000002 R15: 0000000000000000
Dec 6 16:31:22 ipfire kernel: FS: 0000000000000000(0000) GS:ffff8cc337c00000(0000) knlGS:0000000000000000
Dec 6 16:31:22 ipfire kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 6 16:31:22 ipfire kernel: CR2: 0000000001d445f8 CR3: 0000000188c0c000 CR4: 00000000001006f0
Dec 6 16:31:22 ipfire kernel: Call Trace:
Dec 6 16:31:22 ipfire kernel: intel_idle+0x1f/0x30
Dec 6 16:31:22 ipfire kernel: cpuidle_enter_state+0x89/0x370
Dec 6 16:31:22 ipfire kernel: cpuidle_enter+0x29/0x40
Dec 6 16:31:22 ipfire kernel: do_idle+0x1cc/0x230
Dec 6 16:31:22 ipfire kernel: cpu_startup_entry+0x19/0x20
Dec 6 16:31:22 ipfire kernel: start_kernel+0x83e/0x879
Dec 6 16:31:22 ipfire kernel: secondary_startup_64_no_verify+0xb0/0xbb
Dec 6 16:31:24 ipfire kernel: DROP_INPUT IN=red0 OUT= MAC=78:24:af:82:9e:f2:00:00:5e:00:01:8c:08:00 SRC=185.195.232.162 DST=136.35.226.111 LEN=132 TOS=0x00 PREC=0x00 TTL=53 ID=51558 PROTO=UDP SPT=51820 DPT=48925 L
EN=112 MARK=0x80000000
Dec 6 16:31:27 ipfire kernel: NMI: IOCK error (debug interrupt?) for reason 70 on CPU 0.
Dec 6 16:31:27 ipfire kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 5.10.76-ipfire #1
Dec 6 16:31:27 ipfire kernel: Hardware name: ASUS P9A-I/2550/4L/P9A-I/2550/4L, BIOS 0206 05/14/2014
Dec 6 16:31:27 ipfire kernel: RIP: 0010:mwait_idle_with_hints.constprop.0+0x52/0xa0
Dec 6 16:31:27 ipfire kernel: Code: 65 48 8b 04 25 c0 7b 01 00 0f 01 c8 48 8b 00 a8 08 75 17 e9 07 00 00 00 0f 00 2d 95 84 9b 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 7b 01 00 f0 80 60 02 df f0 83 4
4 24 fc 00 48 8b
Dec 6 16:31:27 ipfire kernel: RSP: 0018:ffffffffb0c03e38 EFLAGS: 00000046
Dec 6 16:31:27 ipfire kernel: RAX: 0000000000000051 RBX: ffff8cc337c33f20 RCX: 0000000000000001
:
Message from syslogd@ipfire at Mon Dec 6 16:32:56 2021 ...
ipfire kernel: NMI: IOCK error (debug interrupt?) for reason 70 on CPU 0.
In a hope that it was possibly a bad kernel update, I updated to 162 hoping that it would fix it. I rebooted and… nothing… Once IPFire starts to boot I get a blinking cursor. Well… dang… Hard power off and try again… and… blinking cursor.
Fine. Yank power, let it sit 10 seconds, plug in, and hit power button … And it won’t even post the BIOS anymore…
Damn.
Grabbed an old desktop sitting under my desk, grabbed a bunch of USB network dongles from an old RPi project, yank the SSD out of the IPFire system, cram it into the desktop, and… it works!! Re-did setup. Fought hard with my ISP to give me a proper DHCP to a new device it wasn’t expecting… and my firewall lives!!!
But my eco does not…
I got 6 hard years of always-on from that thing. Maybe it was time and this was all horrible coincidence. In fact, I’m 99% confident this was just bad timing on the hardware failing. However, the timing of this is suspect to me that I did an update Saturday night, these errors appear Sunday, and Sunday night/all-day-Monday I’m fighting hard lockups. Not to mention that it was immediately after the 162 update that the device just doesn’t boot anymore. So for that <1% chance that something went horribly wrong in 161/162 update process I’m asking for other considerations.
Anyone else have reason to suspect something in 161/162 that might have not gone well for an old Eco?
Thanks!