Random connection drops on RED

mxgeek · 14 June 2020 19:00

Running IPFire 2.25 (x86_64) - Core Update 145 on a Lenovo ThinkCentre i5, green on a Netgear Gbit card, red on the motherboard’s Intel NIC, and getting random disconnects on the RED interface since earlier in the year, but it seemed to get worse after Core Update 144.

I’ve found that I can trigger a network drop if I run the speed test at https://www.dslreports.com/speedtest 9 out of 10 times, less so if I run the Ookla test (speedtest.net), but still disconnects on a lesser frequency, both fail on the Upload section. I saw elsewhere in the forum that the Intel NIC driver might be misbehaving.

Is there a fix or workaround?

pmueller · 15 June 2020 11:31

Hi,

@arne_f is currently working on kernel 4.14.184, but I am not sure if it already contains fixes for the e1000e NIC driver.

Thanks, and best regards,
Peter Müller

arne_f · 15 June 2020 12:50

The Intel e1000e driver in kernel 4.14.x was not touched by the kernel developers since Oktober 2019
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers/net/ethernet/intel/e1000e?h=v4.14.184

and core 144 doesn’t contain drivers at all.

mxgeek · 16 June 2020 01:05

Thanks for your responses. I guess the easiest solution is to install a second NIC card with a chip other than Intel. As I wrote, the disconnects were there before, but got worse. Might have been a coincidence. I will add the second NIC and report back, have to do it when my wife is not using the connection or there will be hell to pay…

pmueller · 17 June 2020 18:06

I see, thanks for clarifying and sorry for my noise…

mxgeek · 20 June 2020 18:22

Well, I finally got to add a second Raltek chip NIC, assigned RED to the new NIC and so far IPFire has run without disconnects for 28 hours, previously it was dropping the connection several times a day, so that confirms it is the Intel NIC. The Kernel log shows:

Time	Section
09:08:02	kernel:	e1000e 0000:00:19.0 red0: Detected Hardware Unit Hang:
09:08:02	kernel:	TDH
09:08:02	kernel:	TDT
09:08:02	kernel:	next_to_use
09:08:02	kernel:	next_to_clean
09:08:02	kernel:	buffer_info[next_to_clean]:
09:08:02	kernel:	time_stamp <10a4426cc>
09:08:02	kernel:	next_to_watch
09:08:02	kernel:	jiffies <10a442800>
09:08:02	kernel:	next_to_watch.status <0>
09:08:02	kernel:	MAC Status <80083>
09:08:02	kernel:	PHY Status <796d>
09:08:02	kernel:	PHY 1000BASE-T Status <3c00>
09:08:02	kernel:	PHY Extended Status <3000>
09:08:02	kernel:	PCI Status <10>
09:08:04	kernel:	e1000e 0000:00:19.0 red0: Detected Hardware Unit Hang:
09:08:04	kernel:	TDH
09:08:04	kernel:	TDT
09:08:04	kernel:	next_to_use
09:08:04	kernel:	next_to_clean
09:08:04	kernel:	buffer_info[next_to_clean]:
09:08:04	kernel:	time_stamp <10a4426cc>
09:08:04	kernel:	next_to_watch
09:08:04	kernel:	jiffies <10a442a40>
09:08:04	kernel:	next_to_watch.status <0>
09:08:04	kernel:	MAC Status <80083>
09:08:04	kernel:	PHY Status <796d>
09:08:04	kernel:	PHY 1000BASE-T Status <3c00>
09:08:04	kernel:	PHY Extended Status <3000>
09:08:04	kernel:	PCI Status <10>
09:08:06	kernel:	e1000e 0000:00:19.0 red0: Detected Hardware Unit Hang:
09:08:06	kernel:	TDH
09:08:06	kernel:	TDT
09:08:06	kernel:	next_to_use
09:08:06	kernel:	next_to_clean
09:08:06	kernel:	buffer_info[next_to_clean]:
09:08:06	kernel:	time_stamp <10a4426cc>
09:08:06	kernel:	next_to_watch
09:08:06	kernel:	jiffies <10a442cc0>
09:08:06	kernel:	next_to_watch.status <0>
09:08:06	kernel:	MAC Status <80083>
09:08:06	kernel:	PHY Status <796d>
09:08:06	kernel:	PHY 1000BASE-T Status <3c00>
09:08:06	kernel:	PHY Extended Status <3000>
09:08:06	kernel:	PCI Status <10>
09:08:08	kernel:	e1000e 0000:00:19.0 red0: Detected Hardware Unit Hang:
09:08:08	kernel:	TDH
09:08:08	kernel:	TDT
09:08:08	kernel:	next_to_use
09:08:08	kernel:	next_to_clean
09:08:08	kernel:	buffer_info[next_to_clean]:
09:08:08	kernel:	time_stamp <10a4426cc>
09:08:08	kernel:	next_to_watch
09:08:08	kernel:	jiffies <10a442f00>
09:08:08	kernel:	next_to_watch.status <0>
09:08:08	kernel:	MAC Status <80083>
09:08:08	kernel:	PHY Status <796d>
09:08:08	kernel:	PHY 1000BASE-T Status <3c00>
09:08:08	kernel:	PHY Extended Status <3000>
09:08:08	kernel:	PCI Status <10>
09:08:10	kernel:	e1000e 0000:00:19.0 red0: Detected Hardware Unit Hang:
09:08:10	kernel:	TDH
09:08:10	kernel:	TDT
09:08:10	kernel:	next_to_use
09:08:10	kernel:	next_to_clean
09:08:10	kernel:	buffer_info[next_to_clean]:
09:08:10	kernel:	time_stamp <10a4426cc>
09:08:10	kernel:	next_to_watch
09:08:10	kernel:	jiffies <10a443140>
09:08:10	kernel:	next_to_watch.status <0>
09:08:10	kernel:	MAC Status <80083>
09:08:10	kernel:	PHY Status <796d>
09:08:10	kernel:	PHY 1000BASE-T Status <3c00>
09:08:10	kernel:	PHY Extended Status <3000>
09:08:10	kernel:	PCI Status <10>
09:08:11	kernel:	e1000e 0000:00:19.0 red0: Reset adapter unexpectedly
09:08:15	kernel:	e1000e: red0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

The RED log shows:

Time	Section
09:08:11	dhcpcd[1583]	: red0: carrier lost
09:08:11	dhcpcd[1583]	: red0: deleting route to xx.xx.xx.xx/20
09:08:11	dhcpcd[1583]	: red0: deleting default route via xx.xx.xx.xx
09:08:15	dhcpcd[1583]	: red0: carrier acquired
09:08:15	dhcpcd[1583]	: red0: IAID e6:c6:08:4d
09:08:16	dhcpcd[1583]	: red0: soliciting an IPv6 router
09:08:16	dhcpcd[1583]	: red0: rebinding lease of xx.xx.xx.xx
09:08:17	dhcpcd[1583]	: red0: probing address xx.xx.xx.xx/20
09:08:22	dhcpcd[1583]	: red0: leased xx.xx.xx.xx for 42442 seconds
09:08:22	dhcpcd[1583]	: red0: adding route to xx.xx.xx.xx/20
09:08:22	dhcpcd[1583]	: red0: adding default route via xx.xx.xx.xx

After switching NICs I can also run the HTML5 dslreports.com speedtest with no disconnects (why?).

mxgeek · 20 June 2020 18:44

Indeed, I had done that earlier. I had a PUMA 6 chipset modem that was resetting itself several times a day, so I bugged the ISP until they sent a technician, who confirmed the problem, replaced every connector between the modem and the utility pole outside, and also found low power on the highest frequency channels at the pole. A second technician came and probably replaced all the connectors on the main cable going around the neighborhood, and after that I had good power on all channels and good SNR. But the problem continued, so I got a new DOCSIS 3.1 modem with a Broadcom chipset from the ISP. This reduced the number of disconnects and shortened their duration, but there were still happening. I read somewhere in this forum about similar behaviour on Intel NICs, and lo and behold, my RED was an Intel NIC. Just yesterday I installed a Realtek chip NIC on the firewall box and reassigned RED to it, and so far no more disconnects. I hope that will be the end of the problem.

$\ 45x45$ Mike Clarke mwclarke13
20 June

Since went to new hardware and installed core 144 (Old hardware would not support updates past 122 ) I have had many drops, requiring restarting my cable modem and firewall to get the DHCP once cable modem was restarted. However this looks to be just a coincidence seems getting timeouts from modem to ISP in the modem logs and the ISP was repotting issues in my area when this started. But is still happening occasionally although they no longer finding problems but still think ISP issue causing my problem. may want to double check everything on your end upstream to ipfire.

bonnietwin · 5 January 2025 10:55

A post was split to a new topic: E10001 detected hardware unit hang