E10001 detected hardware unit hang

yavorz · 5 January 2025 09:35

Dear all, i never have this problem before last update Core-Update 190. For second time my red interface hang. There are logs:

11:20:32	kernel:	PCI Status <10>
11:20:32	kernel:	PHY Extended Status <3000>
11:20:32	kernel:	PHY 1000BASE-T Status <7800>
11:20:32	kernel:	PHY Status <792d>
11:20:32	kernel:	MAC Status <80283>
11:20:32	kernel:	next_to_watch.status <0>
11:20:32	kernel:	jiffies <107af1d00>
11:20:32	kernel:	next_to_watch <92>
11:20:32	kernel:	time_stamp <107a84952>
11:20:32	kernel:	buffer_info[next_to_clean]:
11:20:32	kernel:	next_to_clean <90>
11:20:32	kernel:	next_to_use
11:20:32	kernel:	TDT
11:20:32	kernel:	TDH <92>
11:20:32	kernel:	e1000e 0000:00:19.0 red0: Detected Hardware Unit Hang:
Red interface is Intel NIC Network [Intel Corporation] ‐ 82566DM-2 Gigabit Network Connection [e1000e].

Can you tell me about any solutions?

bonnietwin · 5 January 2025 11:13

This type of issue with the Intel e1000e driver seems to have been an issue since at least 2013.

From my internet searching I have found people complaining about this happening from 2013 all the way up to 2024 and yours now makes 2025.

It looks like Intel has no plans to fix it and from the searching the solution that kept being used and said to fix the problem is to turn off gso, gro & tso in the nic. These are various off loading functions and I would expect that removing these would affect the performance to some extent.

If you want to do that you would need to run the command

ethtool -K red0 gso off gro off tso off

Which is what the users in the various web sites I found did.

https://forum.proxmox.com/threads/intel-nic-e1000e-hardware-unit-hang.106001/
https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang
https://www.reddit.com/r/Proxmox/comments/1drs89s/intel_nic_e1000e_hardware_unit_hang/
https://gist.github.com/brunneis/0c27411a8028610117fefbe5fb669d10

However I suspect that when you reboot they would get turned on again so you would need to add that command into the rc.local file to get run during the boot cycle.
https://www.ipfire.org/docs/pkgs/rc-local

As to why it only started now, there was a kernel update in CU190 so either there was a change to the e1000e driver or something else changed in the kernel that causes the e1000e driver to hang.
The other thing that worked for some users suffering from this problem was a change of NIC but that is only feasible if the NIC is an add-in card.

dr_techno · 5 January 2025 14:08

Grimm did add the missing intel drivers, but I do recall him mentioning he removed the reinitialise routine on network drivers so they can remain in sync with power management. Of course in our application, we should have them in an ‘always on’ state instead. So power management will be something to look at if any thing operates or behaves differently.

yavorz · 5 January 2025 15:00

I try to disable this settings, but dont want to risk. Intel NIC is onboard on my system. I change it with one external Realtek… but from all years i have the system there is no one error on Intel NIC. This is for the first time after yersterday’s update to core 190. Maybe will be good if the part of kernel related for Intel e1000e driver is roll back

dr_techno · 11 January 2025 11:02

Its not really a change in the driver, but the system. But to let you know, I loaded 190 from a burned iso cdrom because updating caused my 4 port intel card to act up. So their update must be missing something that they should shutdown first before applying it or something like that.

jronpaul · 15 February 2025 10:25

is there a fix for this ?

im seeing it on a dell Precision T5600
running linux mint 21.3

ethtool -i enp0s25
driver: e1000e
version: 5.15.0-131-generic
firmware-version: 0.13-4
expansion-rom-version:
bus-info: 0000:00:19.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

ethtool --show-features enp0s25 |egrep “gso|gro|tso”
tx-gso-robust: off [fixed]
tx-gso-partial: off [fixed]
tx-gso-list: off [fixed]
rx-gro-hw: off [fixed]
rx-gro-list: off
rx-udp-gro-forwarding: off

i also added pcie_aspm=off to grub.

bonnietwin · 15 February 2025 11:11

Hi @jronpaul

Welcome to the IPFire community.

Be aware this forum is for the IPFire Firewall distribution and not a general purpose Linux distribution.

From a general point of view on this driver problem, as this driver has been highlighted to be causing this problem for users since 2013 then it looks like Intel are not likely to fix the problem.

From that point of view the only fix that is available is to disable the offloading functions in the driver

jronpaul · 16 February 2025 03:53

ohh dang… im sorry about that… i didnt even realize.
yes, i have found the same thing… its a known issue but it seems like
setting gso,gro,tso wasnt taking ?
but ill try again…
might need to set a script to run it each time after a reboot.
unless it can be added to /etc/network/interfaces ?

bonnietwin · 16 February 2025 08:17

In my earlier post in this thread i mention that you have to add the command into rc.local to set it every time you boot.

There is a link to the rc.local wiki page as well in that post.

So far for the other users on this forum disabling those offloads has worked.

If disabling the offloads does not fix the issue for you then i have no idea what to do.

jronpaul · 20 February 2025 04:51

unfortunately there is no rc.local in ubuntu,
but it looks like i can possibly add it to the actual network config file.
thanks again

nickh · 20 February 2025 13:12

Nothing to do with IPF, but you could set up a oneshot systemd service, or use a cron job @reboot.

dr_techno · 21 February 2025 00:01

/etc/rc.local file is an optional file, so if its not there, you just create it in Ubuntu.

it must have this on the top of the page:

  #!/bin/sh -e

and you have to mark it executable and root owns it.

sudo chown root /etc/rc.local
sudo chmod 755 /etc/rc.local

you can test your script by executing it like this:

sudo /etc/init.d/rc.local start

After boot in Ubuntu, rc.local.services should be active, but exited to see this:

systemctl status rc-local.service