Intermittent problem with Intel igb driver and quad-port I211 Gigabit card

Hello,

I run IPFire on a miniPC which has built-in quad Intel network cards. Sometimes when a device connected to IPFire (such as a managed switch) disconnects and reconnects (for example, to install firmware updates on the switch) IPFire fails to bring the link back up.

Linux acts as if the physical cable is still unplugged, even though it is not. I have tried disconnecting, waiting and reconnecting the cable, but the link never comes back up.

The only way to fix it, without rebooting, is to reload the igb kernel module and restart all IPFire networks!

!#/bin/sh
modprobe -r igb
sleep 1
modprobe igb
sleep 1
/etc/init.d/network restart red
/etc/init.d/network restart green
/etc/init.d/network restart blue
/etc/init.d/network restart orange

The problem I have is identical to this old bug reported against Fedora with Kernel 5.3.7, but obviously IPFire has a very different kernel. I can dig for specific errors from my system, but as I cannot find them today, I’m sure the symptoms are exactly the same as this:

[   35.883590] igb 0000:04:00.0 enp4s0: PCIe link lost, device now detached
[   35.891333] br0: port 1(enp4s0) entered blocking state
[   35.891338] br0: port 1(enp4s0) entered disabled state

Specifics of my hardware:

# lspci | grep Network
01:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
04:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
# lspci -v -s 04:00.0
04:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
	Subsystem: Intel Corporation I211 Gigabit Network Connection
	Flags: bus master, fast devsel, latency 0, IRQ 19
	Memory at 88600000 (32-bit, non-prefetchable) [size=128K]
	I/O ports at b000 [size=32]
	Memory at 88620000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 40-62-31-ff-ff-08-a4-db
	Capabilities: [1a0] Transaction Processing Hints
	Kernel driver in use: igb
	Kernel modules: igb

If you have any idea how this could be diagnosed I’d really appreciate it!

Thank you in advance.

PCIe link loss has nothing to do with the network link on the LAN cable.
The Nic was disconnected / crashed on the PCIe Bus and the chip doesn’t anwer for driver requests. Reloading the modul will reset the chip and reload the firmware into the nic.

Have you checked your cableing for grounding problems (i.e high voltage on the shield or similar issues.)

1 Like

I’ve come across this before and this post http://lkml.iu.edu/hypermail/linux/kernel/1806.1/00872.html seems to be a very close fit.

It is reproducible using kernel 4.9.107 and 4.17.0.
It is not reproducible using kernels 4.1.48, 4.4.136.
So it might be related to the changes in the igb versions from 5.3.0-k
(good) to 5.4.0-k (bad).

IPFire is also on the 5.4.0-k driver. I’m not sure when the change to 5.4.0-k occurred but I checked an instance of Debian Stretch I have (kernel 4.9.0-14) and it’s also on 5.4.0-k.

If the post I linked to is correct, and it is a driver issue, then it’s unlikely to be fixed anytime soon. You’ll therefore have to work around it as best you can with some kind of daemon or cron job that checks for this failed condition and restarts the interface(s). You might also consider only restarting the affected interface, rather than all of them.

Another approach that you could try is to force a rescan of the pci bus for the affected interface. It’ll be something like:

echo 1 > /sys/bus/pci/devices/$port/rescan

where $port is the PCI address of the NIC port.

Good luck.

Thanks @arne_f but while I have not seen the problem recently, I suspect that @krasnal is right.
The cables IPFire system and the other devices are only 25 cm long and I have tried replacing them. If the chip had crashed as you described, why is only 1 of 4 interfaces usually affected? The problem also can happen with any devices attached to any of the 4 ports, not only the switch I mentioned (it’s just the only thing which self-reboots to update).

Thanks @krasnal I’ll try the rescan idea from the console when the problem happens next (if I’m not in a hurry anyway!).

I’m having the same trouble on my APU.3C4 running core 153.

Is there any permanent solution to this problem?

In my environment the red0 interface goes wild. If i do a reboot, the system ends with an segmentation error…

Someone can give me an advice how to downgrade igb firmware to 5.3.0-k??

Greetz

Sorry to hear that you’re having the problem too. Yours sounds worse as a reboot doesn’t immediately resolve it.

I haven’t found any solution. The problem possibly happens less now, but when it does happen I don’t have time to troubleshoot, so I just press the power button on the mini PC to shut it down gracefully, then power it on again. Sorry that’s of no help to you!

It would be difficult to get an old kernel module working now.

I notice that IPFire has sadly lagged behind on kernel version again. It’s possible that a much newer kernel may have the problem resolved. You might try a Linux distro using a much newer kernel, like Fedora or Ubuntu to see if you can reproduce the problem on it?

You can try if kernel-5.10 works better:
https://people.ipfire.org/~arne_f/highly-experimental/kernel-5.10/

3 Likes

The “modprobe” woraround does not solve the problem in my setup. Only after a full reboot the system seems to work as usual.

This is a pity!

I migrate (clonezilla) to an older APU1 with realtek NIC and the system works without problems.

@arne_f I’ll try the experimental kernel ASAP!!

Greetz

Strange association:
I have some DVB-T2 equipment where the driver doesn’t survive any sleep mode …

Could this perhaps be related to power saving techniques?
Perhaps try to disable every power saving, as a test.

@manfred_knick

This could be a good point of investigation. How to stop thus power saving thing?