Packet loss on remote system after upgrade to core 170

chrisk1 · 30 September 2022 06:26

I face a strange problem on one instance of IPFire that I just upgraded from core 168 to core 170. It seems that there are problems with some trasmitted packets coming from the IPFire’s LAN ports.

This results in very bad RTT due to retransmission. I think about 50% of the packets are lost. This reading is from one host connected to a dedicated, additional port on that IPfire (not green, orange or blue):

    RX packets 59551  bytes 10061459 (9.5 MiB)
    RX errors 45175  dropped 0  overruns 0  frame 0
    TX packets 56693  bytes 11750353 (11.2 MiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

As you can see, only the RX packets are in trouble.

On the IPFire’ side all looks fine:

    RX packets 78639  bytes 13943835 (13.2 MiB)
    RX errors 0  dropped 0  overruns 0  frame 0
    TX packets 97137  bytes 50830661 (48.4 MiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I’m quite sure this must be in connection with my update yesterday, because the porblems started right after the update and I never faced them before.

The ethernet-chip on IPFire side is:

    Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
    Subsystem: Intel Corporation I211 Gigabit Network Connection
    Kernel driver in use: igb
    Kernel modules: igb

Any ideas what could have changed between 168 and 170 and what I could tweak to get it working correctly again?

TIA!

EDIT: Added kernel module info

xperimental · 30 September 2022 06:45

So do I get you right: the client shows up RX erros but IPFire shows no errors?

I wonder that this is even possible because the error count of RX packages on client side should be the TX error count on server side.

With core 170 the kernel has been updated: blog.ipfire.org - IPFire 2.27 - Core Update 170 released

I haven’t checked if any intel drivers have benn updated. The easiest and fasted way will be to restore to a previous version of IPFire and check if this issues is really related to core 170.

chrisk1 · 30 September 2022 06:58

@Terry: Thanks for you reply.

Yes, you’re right. I’ts strange that there are not transmission errors on IPFire-side. I’d also think they should normally count up if a paket could not be delivered correctly to the receiver.

Unfortunately I can not easily roll back, since this is a productive systems most of the internet-traffic goes through here in our company. So quick testing can’t be done on this system.
I have a dedicated testing-instance with quite the same specs. I did not face the problems there, but there is not real load on that system. Maybe it’s a IRQ-related problem that only appears when there’s a certain level of load on the system.
Since I upgraded from 168 to 170, there actually are at least kernel-updates in core 169, so maybe the cause of my problems come from this.

I noticed there are threads with similar problems on intel chips after update to core 170, but the problems described there seem to be different to mine. Strange enough, it does not affect all ports. So it seems to matter what’s on the other side, too.

I tried the command @pmueller suggested here, but no measurable change.

xperimental · 30 September 2022 07:28

Kernel Update has been in core170. Linux Kernel 5.15.59.

We are still on core 169 and have many Intel 2xx chips in use and don’t have that problem. The old PRO1000 use a different driver.

I don’t think you have a choise, because I’m not even sure it’s a problem caused by ipfire. Live with it or role back or even use the backup system. The weekend is just around the corner. This is the best time to restore/replace the system.

My system next to me with 2x Intel 211 on core 169 for green and orange had a transmitted data amount above 100GB from yesterday till this morning and no single RX/TX error.

chrisk1 · 30 September 2022 07:38

Yes, the other interfaces also seem to work without problems. All interfaces are I211, so like I said the problem only appears in conjunction with specific hardware on the receiver side. Yet, the problems started right after updating to core 170.

xperimental · 30 September 2022 07:53

Some faster step. Put a simple unmanaged switch between IPFire and that “client” and see iif this changes anything (just for troubleshooting).

chrisk1 · 30 September 2022 08:18

I think I made an important observation regarding this problem:
The RTT drops as long as there is enough activity on the interface! If I run a second connection that permanently transfers data between the two hosts, the problem disappears!

This looks like some mis-behaving power management in the igb driver to me. I remember, that there was some big trouble with “EEE” (green ethernet) and the intel network drivers a long time ago.
Maybe this is a regression in the updated igb module?

chrisk1 · 30 September 2022 08:57

Some more findings:

The device on the “other side” uses a ax88179 usb-2-lan adapter.
I’ve seen that there are problems when one is using this adapter on the IPFire. In my case it’s on the remote-side. Could this still be the cause of the problems. If so, is there any work around besides replacing all devices using this chip?

ethtool -i eth1
driver: ax88179_178a
version: 5.15.70-v7+
firmware-version:
expansion-rom-version:
bus-info: 1-1.1.2:1.0
supports-statistics: no
supports-test: no
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no

bonnietwin · 30 September 2022 09:57

If the adaptor was on an IPFire system then the only workaround that I have seen mentioned is to reinstall CU169 and wait for CU171 to be released with the updated kernel that has had the bug fixed.

As the adaptor is on a machine on your LAN, you need to check if the involved OS has an upgrade to kernel 5.15.68 as that is the kernel version going into CU171 and shown to not have the problem. Alternatively you could look at downgrading the kernel on your LAN machine to 5.15.49, which is the kernel running on CU169 and didn’t show the problem, until the OS ships an update with the required kernel version.

chrisk1 · 30 September 2022 11:10

@bonnietwin thanks for your reply and the clarification.

So, core 171 will contain a fix? Is there yet a testing-branch available of this version? If so, I’d like to test it on my staging device.

bonnietwin · 30 September 2022 11:38

At the moment it is still in unstable. The people in the AX88179 thread did their testing with that Unstable branch. Their testing confirmed that the new kernel version solved the problem they had been experiencing with the AX88179 USB adaptors…

I would suspect that CU171 is close to going to Testing branch, my feeling late next week sometime, but I am not involved in the detail of preparing the CU releases so something might come up while doing the unstable nightly builds that might delay it.

chrisk1 · 30 September 2022 11:44

Tried this, doesn’t change a thing

So, I think I’ll wait for core 171 to test and hope the issue is addressed by the updated kernel then.

xperimental · 30 September 2022 12:23

This is the proof that it has nothing to do with ipfire. The switch would have filtered broken packages.

bbitsch · 30 September 2022 12:44

So, the change to CU171 will not solve the issue.
The other side of the line should be corrected ( up-/downgraded as @bonnietwin recommended in an earlier post ).

chrisk1 · 30 September 2022 13:11

I don’t think so. The problem suddely arose right after updating the IPFire. The remote system was NOT updated. It also is not solved by e.g. changing the usb-2-ethernet with a devices based on Realtek-chip.
However, I will make tests mit core 171 on my stage device and see, if anything has changed.

chrisk1 · 30 September 2022 13:13

A switch does not decode packets in normal switching mode. It just delegates packets bases on ARP information / MAC-addresses.

xperimental · 30 September 2022 13:23

The package error don’t come up in the ipfire stats. As before, this is not an ipfire issue and as bbitsch mentioned you should check the client kernel version etc for this know bug of your USB-LAN dongle.

My switches can determine corrupted packages so they must have a crc check. I don’t think that they will still forward broken packages to the recipient.

mike175de · 30 September 2022 14:35

I can confirm this issue/error.

I use a PCEngine APU with 3 i211AT NICs. Since the update to 170 I have a lot o paket losses on the remote side/workspace computer.

The problem occur on all my computers in the network. So it shouldn’t be an issue with the nic at the remote side.

For example I use a banking app on my phone which is used to receive the TANs for a transaction. I only receive the TAN when I close the WLAN connection with the IPFire and use the mobile phone network instead.

If any logs or other infos are necessary please let me know.

Thanks a lot!

mike

chrisk1 · 1 October 2022 19:33

After a few more tests I can double this. This for example is a ping againt a Netgear GS724T switch, that normally has a RTT of about one millisecond:

ping 172.16.227.200
PING 172.16.227.200 (172.16.227.200) 56(84) bytes of data.
64 bytes from 172.16.227.200: icmp_seq=1 ttl=64 time=710 ms
64 bytes from 172.16.227.200: icmp_seq=2 ttl=64 time=759 ms
64 bytes from 172.16.227.200: icmp_seq=3 ttl=64 time=427 ms
64 bytes from 172.16.227.200: icmp_seq=4 ttl=64 time=674 ms
64 bytes from 172.16.227.200: icmp_seq=5 ttl=64 time=77.7 ms
64 bytes from 172.16.227.200: icmp_seq=6 ttl=64 time=505 ms
64 bytes from 172.16.227.200: icmp_seq=7 ttl=64 time=83.0 ms
(...)

Doing a bandwidth test againt my ISP also shows that the burst bandwidth has decreased, so I assume that all ports are affected by the bug more or less.

Like Mike says, I’m also willing to help by trying out builds or settings on my testing systems.

I filed a bugreport for this as now I’m sure it’s not the same as Bug #12750.

mike175de · 5 October 2022 10:54

Any new information on this?

It is going to be very annoying. Especially when you use 2-FA. The paket losses are so heavy that the tokens are not in time sync anymore and a login is almost impossible.

Please let me know, if the developers need any informations, logs and so on.

Mike