Hardware issue? Kernel errors on protectli box…

hi all, running ipfire for 3 days now on my protectli box…had stability issues with my last os but now i suspect a hardware issue due to this error in my daily log summary appearing every day:

Daily log summary:

WARNING:  Kernel Errors Present
   pcieport 0000:00:1d.4:   device [8086:02b4] error status/mask=0000 ...:  24 Time(s)
   pcieport 0000:00:1d.4: AER: Correctable error message received ...:  24 Time(s)
   pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correc ...:  24 Time(s)
3 Time(s): Kernel log daemon terminating.
3 Time(s): Kernel logging (proc) stopped.
4 Time(s): igc 0000:01:00.0 green0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
7 Time(s): igc 0000:02:00.0 red0: NIC Link is Down
11 Time(s): igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
4 Time(s): pci 0000:00:1f.3: deferred probe pending: snd_hda_intel: couldn't bind with audio component
24 Time(s): pcieport 0000:00:1d.4:    [ 0] RxErr                  (First)

Fill kernel log from yest for example (same thing day prior and today):

23:07:27 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
23:07:27 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
23:07:27 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
23:07:27 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
23:07:21 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
23:07:21 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
23:07:21 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
23:07:21 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
18:25:07 	kernel: 	pci 0000:00:1f.3: deferred probe pending: snd_hda_intel: couldn't bind with audio component
18:25:03 	kernel: 	igc 0000:01:00.0 green0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
18:25:03 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
18:24:13 	kernel: 	Kernel log daemon terminating.
18:24:13 	kernel: 	Kernel logging (proc) stopped.
14:52:53 	kernel: 	pci 0000:00:1f.3: deferred probe pending: snd_hda_intel: couldn't bind with audio component
14:52:48 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
14:52:48 	kernel: 	igc 0000:01:00.0 green0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
14:51:58 	kernel: 	Kernel log daemon terminating.
14:51:58 	kernel: 	Kernel logging (proc) stopped.
10:46:31 	kernel: 	pci 0000:00:1f.3: deferred probe pending: snd_hda_intel: couldn't bind with audio component
10:46:26 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
10:46:26 	kernel: 	igc 0000:01:00.0 green0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
10:35:28 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
10:35:20 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Down
10:35:19 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
10:35:16 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Down
10:34:59 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
10:34:55 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Down
09:24:06 	kernel: 	pci 0000:00:1f.3: deferred probe pending: snd_hda_intel: couldn't bind with audio component
09:24:01 	kernel: 	igc 0000:01:00.0 green0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
09:24:01 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
09:23:11 	kernel: 	Kernel log daemon terminating.
09:23:11 	kernel: 	Kernel logging (proc) stopped.
09:02:09 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
09:02:09 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
09:02:09 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
09:02:09 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
09:00:31 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
09:00:31 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
09:00:31 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
09:00:31 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
07:34:38 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
07:34:38 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
07:34:38 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
07:34:38 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
07:07:06 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
07:07:06 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
07:07:06 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
07:07:06 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
06:59:09 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
06:59:09 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
06:59:09 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
06:59:09 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
06:28:36 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
06:28:36 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
06:28:36 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
06:28:36 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
06:19:36 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
06:19:36 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
06:19:36 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
06:19:36 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
05:11:07 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
05:11:07 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
05:11:07 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
05:11:07 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
04:21:19 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
04:21:19 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
04:21:19 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
04:21:19 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
04:21:11 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
04:21:11 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
04:21:11 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
04:21:11 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
04:19:03 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
04:19:03 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
04:19:03 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
04:19:03 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
03:28:01 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
03:28:01 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
03:28:01 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
03:28:01 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
03:13:01 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
03:13:01 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
03:13:01 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
03:13:01 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
03:06:36 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
03:06:36 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
03:06:36 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
03:06:36 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4
03:01:48 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
03:01:40 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Down
03:01:39 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
03:01:35 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Down
03:01:18 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
03:01:15 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Down
03:01:13 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
03:01:10 	kernel: 	igc 0000:02:00.0 red0: NIC Link is Down
01:05:07 	kernel: 	pcieport 0000:00:1d.4: [ 0] RxErr (First)
01:05:07 	kernel: 	pcieport 0000:00:1d.4: device [8086:02b4] error status/mask=00000001/00002000
01:05:07 	kernel: 	pcieport 0000:00:1d.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
01:05:07 	kernel: 	pcieport 0000:00:1d.4: AER: Correctable error message received from 0000:00:1d.4 

i previously tested opnsense for 10 days and fought with crash/freezing every 1-6 hours…tried everything to get it stable thinking it must be some setting or combo of settings so i gave up on it and moved over here to ipfire. is there something there in my summary that definitively points to hardware or are these pcie bus errors a nothing burger? thanks

appreciate any help

I have never seen such messages from the PCIe root hub. But is sound like bit errors on the PCIe transfer. So maybee this is an hardware problem or the firmware has not correct set the speed of this PCIe port. Are there settings in the Bios to slow down the PCIe ports?

thanks so much arne! i’ll check on any applicable bios settings…

ok just checked my coreboot bios, no setting like that exposed in its gui. i better really run this down hard since you haven’t ever seen this before given your breadthe of expertise being on the dev team. so greatly appreciate your time to drop in. should i ask protectli support at this point, given i suppose what is a high likelihood of a hardware based issue?

No hardware issue, just configuration.

if you have a way to disable pcie aspm in bios, do so as well as all other power management as this is an always on system.

If you don’t have that control, this must be passed in grub:

GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off"

But the above should be standardized as pcie aspm should be disabled in bios, But the above should be inserted as the standard affair to manually turn this off when the OS installer has not disabled or does not have control of pcie aspm in BIOS.

2 Likes

oh wow! thank you dave…for sure no setting about power management exposed in coreboot bios either…settings are incredibly sparse pretty much limited to enabling a password, changing boot options, and setting secure boot.

so would adding that to grub only be suppressing errors vs actually fixing something? is that would could’ve really been what is causing my sporadic crashes, often to the extent that the power button press doesn’t respond thus requiring a press and hold?

lastly, sorry for not knowing for sure how to do this. i know i should ssh into the box to edit grub, but do you know the exact commands to edits ipfire’s grub? i think every distro out there has a certain way to handling passing commands to grub or something…sorry again for not knowing more…

That is the Kernel level switch that was added in Linux core a few years ago, according to my Linux core dev notes. Linus was not happy people were writing sloppy bios for hardware when this was added.

so add this line below the GRUB_CMDLINE_LINUX statement in /etc/default/grub file and then run update-grub

2 Likes

thanks again. “update-grub” didnt work for one reason or another so i did “sudo grub-mkconfig -o /boot/grub/grub.cfg” instead. i’ll monitor kernel logs throughout tomorrow and check the log summary to see if that did the trick. so so hope my crashing/freezing is behind me…2 weeks is a long time to deal with this instability

Just to be extra-cheeky, in case you are looking for an alternative, we recommend the following:

3 Likes

So what was wrong with the motherboard’s bios that you installed LinuxBios aka coreboot?

You should flash the motherboard back to the original bios. Because coreboot doesn’t handle APCI properly.

good news folks…no crash/freeze for 38 hours now!

daily summary is much cleaner:

 8 Time(s): igc 0000:02:00.0 red0: NIC Link is Down
 8 Time(s): igc 0000:02:00.0 red0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
 1 Time(s): perf: interrupt took too long (2595 > 2500), lowering kernel.perf_event_max_sample_rate to 77000
 1 Time(s): perf: interrupt took too long (3248 > 3243), lowering kernel.perf_event_max_sample_rate to 61000

i think those nic down times correspond to when i updated ips rules throughout the day. although, is there anything i can do about these sample rate adjustments or should i just ignore?

p.s. michael, appreciate the head’s up on ipfire appliances and maybe going to ami bios again…

2 Likes

Protectli devices come with the option of coreboot when you buy them. I’m guessing he chose this configuration at purchase, rather than changing it from the original BIOS after the fact.

2 Likes

I think they are contributors to the coreboot project. Which they really should have gone through it and set it up correctly for the device and disable all that power saving and sleep modes before flashing all of their devices to coreboot.

1 Like

thanks again michael. i’ll ping them…they’re really great guys over there at protectli so i’ll ask if they are aware

It was just an APCI issue that was easily fixed by disabling it.

The sampling rate notice is from memory to cpu timing lag from a non-maskable interrupt. It hints to the kernel how much CPU time it should be allowed to use to handle perf sampling events. If the perf subsystem
is informed that its samples are exceeding this limit, it will drop its sampling frequency to attempt to reduce its CPU usage.

Some perf sampling happens in NMIs. If these samples
unexpectedly take too long to execute, the NMIs can become
stacked up next to each other so much that nothing else is
allowed to execute.

0: disable the mechanism. Do not monitor or correct perf’s
sampling rate no matter how CPU time it takes.

1-100: attempt to throttle perf’s sample rate to this
percentage of CPU. Note: the kernel calculates an
“expected” length of each sample event. 100 here means
100% of that expected length. Even if this is set to
100, you may still see sample throttling if this
length is exceeded. Set to 0 if you truly do not care
how much CPU is consumed.

I would just leave this alone as this is just notice that system loading was high at the moment and decided this route in cpu scaling…

1 Like

ok sounds good, i’ll leave that alone. really appreciate everything!

1 Like