IPFire went down last night, can't find cause

Hi Chris, stop openVPN, delete the certificate, upgrade to core 171 and reboot. Then create a new certificate and start openVPN.

Seems to help me. It’s worth a try and less work than reinstalling everything.

Many greetings

Jürgen Schamberger

1 Like

If it is a bug in OpenVPN < 2.5.7 then it is not an endemic problem, it must be a combination between an earlier version and something else.

I am using OpenVPN with IPFire for several years and I have not seen any issues with Out Of Memory events in any of that time (I have grepped through the IPFire logs for oom and Out of memory and found nothing) and my memory graph has not seen any large spikes. The used memory has varied between 7% min and 27% max.

It might have some relation to interactions with other packages. I have seen some references to qemu also being impacted but I don’t use qemu on my IPFire system.

3 Likes

I’ll try that and post the results later today.

Chris

1 Like

Unfortunately too early happy. Same again tonight :confused:

Oct 21 01:49:21 famschamrouter kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=openvpn-authent,pid=7917,uid=0
Oct 21 01:49:21 famschamrouter kernel: Out of memory: Killed process 7917 (openvpn-authent) total-vm:8616456kB, anon-rss:6662008kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:15108kB oom_score_adj:0


So far it has not happened to me, uptime is now a whopping 14 hours and counting.

I have run into another issue though. Since upgrading both sides of an IPSec tunnel to 171, we have lost communication between the sites, even though the tunnel appears connected, I cannot ping either side from the other. Has anyone else experienced this ?

I created a bug report in IPFire Bugzilla

https://bugzilla.ipfire.org/show_bug.cgi?id=12963

Jürgen

1 Like

I installed Core 168, 169, 170 and 171 on another machine today and tested it with a minimal configuration.

The problem seems to have existed since Core 169. On Core 168 the openvpn-authentic process doesn’t start because the file in /usr/sbin doesn’t exist.

I created a certificate in Core 168 and then upgraded to Core 171. With this “old” certificate, openVPN can be stopped and started without any problems.

I think there is a problem with Core 169’s certificate generation.
Was a new openVPN version installed with Core 169 or is it TOPT?

Jürgen

I have no idea what TOPT means.

OpenVPN was updated from 2.5.4 to 2.5.6 in CU169 and then to 2.5.7 in CU171

From 2.5.4 to 2.5.6 there was no change to the rootfile and openvpn-authenticator does not come from OpenVPN. It is a separate program created as part of the 2FA addition to OpenVPN, which was introduced with CU169.

Are you using the 2FA option with OTP selected on your clients.

I am not using OTP with my OpenVPN connection and I don’t have openvpn-authenticator running at all, although it is available in the sbin directory. If you are using OTP, what happens if you don’t use the OTP, does the problem still occur with openvpn-authenticator maxing out on memory?

If the issue is actually with the certificate generation then that would not be coming from OpenVPN but from OpenSSL.

OpenSSL was updated from 1.1.1o to 1.1.1p in CU169 and then to 1.1.1q in CU170

In OpenVPN-2.5.6 the only change mentioned related to certificates is

repair handling of EC certificates on Windows with pkcs11-helper
So this is related to windows and to pkcs11 which do not relate to what is used for IPFire OpenVPN.

2 Likes

Sorry I’m not an expert, just a user. I meant OTP. I can only report what I observe. The problem seems to occur from Core 169 when creating a new certificate. I don’t know if it’s OpenSSL.
I don’t use OTP.

If you are not using OTP then I don’t understand why openvpn-authenticator is even working and consuming memory. On my system openvpn-authenticator is not even in the list at all that htop shows.

When you mention

Do you mean creating the overall global root and host certificate or do you mean creating a client connection certificate?
When I can get some time I will test out creating the appropriate certificate with a CU169 system with my vm testbed system.

Yes exactly, I created the entire global root and host certificate.

After that the process openvpn-authenticator starts.

The client connections are all deleted and new ones must be created.

Okay. All my testing has been based on an existing host certificate created from much earlier.

I will test out with a new host certificate as soon as i can get some time.
If i can duplicate your issue, that will help the developers as they can then check why that is happening on their own systems.

2 Likes

I remove this

Bildschirmfoto vom 2022-10-22 17-55-02

and then I create this

and then I create a new client connection and start openVPN.

When I stop openVPN, the process openvpn-authenticator starts

1 Like

When running Core Update 171 at 8:10 AM this morning IPFire restarted for no explainable reason, right in the middle of the engineering meeting when all of the managers were on a zoom call. This is the worst possible time.

I am seeing nothing in the logs except a black hole of time. The issue occurred at 08:08:20, then there’s a gap in /var/log/messages until 08:10:26. Log file attached for this portion of messages and the bootlog is attached. My boss has been getting frustrated as have I.

Please help.
bootlog.gz (14.2 KB)

messages.10.24.22.txt.gz (100.1 KB)

In the WUI /status/hardware graphs is the CPU temperature in normal range?

There was a brief spike after the outage, at about 8:30, but only to 38 decrees C,

That is considerably lower than other times in the month:

Or Year:

Year View

Is this the same machine we spoke about in Post 28 (7 days ago)?

Did anything change besides the CU 171 update?

I see the black hole but I do not see any issues in the message log. Hopefully someone else can take a look and make sure.

I don’t see any OOM messages. And it doesn’t look like openvpn was the cause either. I don’t see openvpn until 08:11:09.

Can you look through the other graphs for blips near 8:10?

Sorry to say this still seems like a power issue or a hardware issue of some sort. I hate to blame those things but that is my best guess at the moment.

EDIT: the temperatures look fine. My device runs hotter near 46ºC. I can post graphs of my device if that helps.

Yes, it is the same machine. Nothing else has changed except the CU 171 update. I’m shutting down the device and moving the power source after hours tonight.

It could be hardware, it could be power, or it could be an issue with the openvpn authentication process consuming all of the CPU until the unit which is what Jurgen and Adolf were discussing above.

I have two idential physical hardware devices here, they both are Supermicro SYS-E200-9B units, with Quad Core Intel Pentium N3710 1.6 Ghz processors, 8 GB ram, 120 GB SSD, Quad Intel I210 NIC controllers and an IPMI for monitoring. I also have a custom built ATX 4U Server in a rackmount case that has a Xeon 2.4 Ghz E5 CPU, 8 GB Ram, and a 500 GB Sata WD Black drive in it, and the motherboard has 4 Intel NIC Cards on it as well, single 600w power supply.

All three of those units have had some kind of outage like this. It seems unlikely that 3 separate machines would all fail with harware problems, so at the moment I’m leaning towards a slight brown out in supplying power to the machines or a bug with the openvpn authentication mechanism causes the CPU overload which causes the system to reboot regardless of the hardware present, or a third possiblity, that something got corrupted in my configuration and when I back up the config then restore it to the other units, the problems in the config follow the hardware. I have not tried a clean re-install of IPFire on one of these devices and typing in all of the configuration entries in by hand, but I can try that if I have to.

Chris

Ok, not the CPU thermal. How about:

  • ram failure;
  • hard disk failure;
  • power supply issues (e.g. overheating).

Can’t think anything else that would cause a reboot, as opposed to a crash.

That sound odd.
Are they happening at the same times?
Perhaps time for a UPS.