Intermittent problem with Intel igb driver and quad-port I211 Gigabit card

Hello,

I run IPFire on a miniPC which has built-in quad Intel network cards. Sometimes when a device connected to IPFire (such as a managed switch) disconnects and reconnects (for example, to install firmware updates on the switch) IPFire fails to bring the link back up.

Linux acts as if the physical cable is still unplugged, even though it is not. I have tried disconnecting, waiting and reconnecting the cable, but the link never comes back up.

The only way to fix it, without rebooting, is to reload the igb kernel module and restart all IPFire networks!

!#/bin/sh
modprobe -r igb
sleep 1
modprobe igb
sleep 1
/etc/init.d/network restart red
/etc/init.d/network restart green
/etc/init.d/network restart blue
/etc/init.d/network restart orange

The problem I have is identical to this old bug reported against Fedora with Kernel 5.3.7, but obviously IPFire has a very different kernel. I can dig for specific errors from my system, but as I cannot find them today, I’m sure the symptoms are exactly the same as this:

[   35.883590] igb 0000:04:00.0 enp4s0: PCIe link lost, device now detached
[   35.891333] br0: port 1(enp4s0) entered blocking state
[   35.891338] br0: port 1(enp4s0) entered disabled state

Specifics of my hardware:

# lspci | grep Network
01:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
04:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
# lspci -v -s 04:00.0
04:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
	Subsystem: Intel Corporation I211 Gigabit Network Connection
	Flags: bus master, fast devsel, latency 0, IRQ 19
	Memory at 88600000 (32-bit, non-prefetchable) [size=128K]
	I/O ports at b000 [size=32]
	Memory at 88620000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 40-62-31-ff-ff-08-a4-db
	Capabilities: [1a0] Transaction Processing Hints
	Kernel driver in use: igb
	Kernel modules: igb

If you have any idea how this could be diagnosed I’d really appreciate it!

Thank you in advance.

PCIe link loss has nothing to do with the network link on the LAN cable.
The Nic was disconnected / crashed on the PCIe Bus and the chip doesn’t anwer for driver requests. Reloading the modul will reset the chip and reload the firmware into the nic.

Have you checked your cableing for grounding problems (i.e high voltage on the shield or similar issues.)

1 Like

I’ve come across this before and this post http://lkml.iu.edu/hypermail/linux/kernel/1806.1/00872.html seems to be a very close fit.

It is reproducible using kernel 4.9.107 and 4.17.0.
It is not reproducible using kernels 4.1.48, 4.4.136.
So it might be related to the changes in the igb versions from 5.3.0-k
(good) to 5.4.0-k (bad).

IPFire is also on the 5.4.0-k driver. I’m not sure when the change to 5.4.0-k occurred but I checked an instance of Debian Stretch I have (kernel 4.9.0-14) and it’s also on 5.4.0-k.

If the post I linked to is correct, and it is a driver issue, then it’s unlikely to be fixed anytime soon. You’ll therefore have to work around it as best you can with some kind of daemon or cron job that checks for this failed condition and restarts the interface(s). You might also consider only restarting the affected interface, rather than all of them.

Another approach that you could try is to force a rescan of the pci bus for the affected interface. It’ll be something like:

echo 1 > /sys/bus/pci/devices/$port/rescan

where $port is the PCI address of the NIC port.

Good luck.

Thanks @arne_f but while I have not seen the problem recently, I suspect that @krasnal is right.
The cables IPFire system and the other devices are only 25 cm long and I have tried replacing them. If the chip had crashed as you described, why is only 1 of 4 interfaces usually affected? The problem also can happen with any devices attached to any of the 4 ports, not only the switch I mentioned (it’s just the only thing which self-reboots to update).

Thanks @krasnal I’ll try the rescan idea from the console when the problem happens next (if I’m not in a hurry anyway!).