I face a strange problem on one instance of IPFire that I just upgraded from core 168 to core 170. It seems that there are problems with some trasmitted packets coming from the IPFire’s LAN ports.
This results in very bad RTT due to retransmission. I think about 50% of the packets are lost. This reading is from one host connected to a dedicated, additional port on that IPfire (not green, orange or blue):
I haven’t checked if any intel drivers have benn updated. The easiest and fasted way will be to restore to a previous version of IPFire and check if this issues is really related to core 170.
Yes, you’re right. I’ts strange that there are not transmission errors on IPFire-side. I’d also think they should normally count up if a paket could not be delivered correctly to the receiver.
Unfortunately I can not easily roll back, since this is a productive systems most of the internet-traffic goes through here in our company. So quick testing can’t be done on this system.
I have a dedicated testing-instance with quite the same specs. I did not face the problems there, but there is not real load on that system. Maybe it’s a IRQ-related problem that only appears when there’s a certain level of load on the system.
Since I upgraded from 168 to 170, there actually are at least kernel-updates in core 169, so maybe the cause of my problems come from this.
I noticed there are threads with similar problems on intel chips after update to core 170, but the problems described there seem to be different to mine. Strange enough, it does not affect all ports. So it seems to matter what’s on the other side, too.
I tried the command @pmueller suggested here, but no measurable change.
Kernel Update has been in core170. Linux Kernel 5.15.59.
We are still on core 169 and have many Intel 2xx chips in use and don’t have that problem. The old PRO1000 use a different driver.
I don’t think you have a choise, because I’m not even sure it’s a problem caused by ipfire. Live with it or role back or even use the backup system. The weekend is just around the corner. This is the best time to restore/replace the system.
My system next to me with 2x Intel 211 on core 169 for green and orange had a transmitted data amount above 100GB from yesterday till this morning and no single RX/TX error.
Yes, the other interfaces also seem to work without problems. All interfaces are I211, so like I said the problem only appears in conjunction with specific hardware on the receiver side. Yet, the problems started right after updating to core 170.
I think I made an important observation regarding this problem:
The RTT drops as long as there is enough activity on the interface! If I run a second connection that permanently transfers data between the two hosts, the problem disappears!
This looks like some mis-behaving power management in the igb driver to me. I remember, that there was some big trouble with “EEE” (green ethernet) and the intel network drivers a long time ago.
Maybe this is a regression in the updated igb module?
The device on the “other side” uses a ax88179 usb-2-lan adapter.
I’ve seen that there are problems when one is using this adapter on the IPFire. In my case it’s on the remote-side. Could this still be the cause of the problems. If so, is there any work around besides replacing all devices using this chip?
ethtool -i eth1
driver: ax88179_178a
version: 5.15.70-v7+
firmware-version:
expansion-rom-version:
bus-info: 1-1.1.2:1.0
supports-statistics: no
supports-test: no
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no
If the adaptor was on an IPFire system then the only workaround that I have seen mentioned is to reinstall CU169 and wait for CU171 to be released with the updated kernel that has had the bug fixed.
As the adaptor is on a machine on your LAN, you need to check if the involved OS has an upgrade to kernel 5.15.68 as that is the kernel version going into CU171 and shown to not have the problem. Alternatively you could look at downgrading the kernel on your LAN machine to 5.15.49, which is the kernel running on CU169 and didn’t show the problem, until the OS ships an update with the required kernel version.
At the moment it is still in unstable. The people in the AX88179 thread did their testing with that Unstable branch. Their testing confirmed that the new kernel version solved the problem they had been experiencing with the AX88179 USB adaptors…
I would suspect that CU171 is close to going to Testing branch, my feeling late next week sometime, but I am not involved in the detail of preparing the CU releases so something might come up while doing the unstable nightly builds that might delay it.
So, the change to CU171 will not solve the issue.
The other side of the line should be corrected ( up-/downgraded as @bonnietwin recommended in an earlier post ).
I don’t think so. The problem suddely arose right after updating the IPFire. The remote system was NOT updated. It also is not solved by e.g. changing the usb-2-ethernet with a devices based on Realtek-chip.
However, I will make tests mit core 171 on my stage device and see, if anything has changed.
The package error don’t come up in the ipfire stats. As before, this is not an ipfire issue and as bbitsch mentioned you should check the client kernel version etc for this know bug of your USB-LAN dongle.
My switches can determine corrupted packages so they must have a crc check. I don’t think that they will still forward broken packages to the recipient.
I use a PCEngine APU with 3 i211AT NICs. Since the update to 170 I have a lot o paket losses on the remote side/workspace computer.
The problem occur on all my computers in the network. So it shouldn’t be an issue with the nic at the remote side.
For example I use a banking app on my phone which is used to receive the TANs for a transaction. I only receive the TAN when I close the WLAN connection with the IPFire and use the mobile phone network instead.
If any logs or other infos are necessary please let me know.
After a few more tests I can double this. This for example is a ping againt a Netgear GS724T switch, that normally has a RTT of about one millisecond:
ping 172.16.227.200
PING 172.16.227.200 (172.16.227.200) 56(84) bytes of data.
64 bytes from 172.16.227.200: icmp_seq=1 ttl=64 time=710 ms
64 bytes from 172.16.227.200: icmp_seq=2 ttl=64 time=759 ms
64 bytes from 172.16.227.200: icmp_seq=3 ttl=64 time=427 ms
64 bytes from 172.16.227.200: icmp_seq=4 ttl=64 time=674 ms
64 bytes from 172.16.227.200: icmp_seq=5 ttl=64 time=77.7 ms
64 bytes from 172.16.227.200: icmp_seq=6 ttl=64 time=505 ms
64 bytes from 172.16.227.200: icmp_seq=7 ttl=64 time=83.0 ms
(...)
Doing a bandwidth test againt my ISP also shows that the burst bandwidth has decreased, so I assume that all ports are affected by the bug more or less.
Like Mike says, I’m also willing to help by trying out builds or settings on my testing systems.
I filed a bugreport for this as now I’m sure it’s not the same as Bug #12750.
It is going to be very annoying. Especially when you use 2-FA. The paket losses are so heavy that the tokens are not in time sync anymore and a login is almost impossible.
Please let me know, if the developers need any informations, logs and so on.