IPFire went down last night, can't find cause

cfusco · 24 October 2022 18:13

In the WUI /status/hardware graphs is the CPU temperature in normal range?

cwensink · 24 October 2022 18:45

There was a brief spike after the outage, at about 8:30, but only to 38 decrees C,

That is considerably lower than other times in the month:

Or Year:

Year View

jon · 24 October 2022 19:34

Is this the same machine we spoke about in Post 28 (7 days ago)?

Did anything change besides the CU 171 update?

I see the black hole but I do not see any issues in the message log. Hopefully someone else can take a look and make sure.

I don’t see any OOM messages. And it doesn’t look like openvpn was the cause either. I don’t see openvpn until 08:11:09.

Can you look through the other graphs for blips near 8:10?

Sorry to say this still seems like a power issue or a hardware issue of some sort. I hate to blame those things but that is my best guess at the moment.

EDIT: the temperatures look fine. My device runs hotter near 46ºC. I can post graphs of my device if that helps.

cwensink · 24 October 2022 19:48

Yes, it is the same machine. Nothing else has changed except the CU 171 update. I’m shutting down the device and moving the power source after hours tonight.

It could be hardware, it could be power, or it could be an issue with the openvpn authentication process consuming all of the CPU until the unit which is what Jurgen and Adolf were discussing above.

I have two idential physical hardware devices here, they both are Supermicro SYS-E200-9B units, with Quad Core Intel Pentium N3710 1.6 Ghz processors, 8 GB ram, 120 GB SSD, Quad Intel I210 NIC controllers and an IPMI for monitoring. I also have a custom built ATX 4U Server in a rackmount case that has a Xeon 2.4 Ghz E5 CPU, 8 GB Ram, and a 500 GB Sata WD Black drive in it, and the motherboard has 4 Intel NIC Cards on it as well, single 600w power supply.

All three of those units have had some kind of outage like this. It seems unlikely that 3 separate machines would all fail with harware problems, so at the moment I’m leaning towards a slight brown out in supplying power to the machines or a bug with the openvpn authentication mechanism causes the CPU overload which causes the system to reboot regardless of the hardware present, or a third possiblity, that something got corrupted in my configuration and when I back up the config then restore it to the other units, the problems in the config follow the hardware. I have not tried a clean re-install of IPFire on one of these devices and typing in all of the configuration entries in by hand, but I can try that if I have to.

Chris

cfusco · 24 October 2022 19:48

Ok, not the CPU thermal. How about:

ram failure;
hard disk failure;
power supply issues (e.g. overheating).

Can’t think anything else that would cause a reboot, as opposed to a crash.

hvacguy · 24 October 2022 19:51

That sound odd.
Are they happening at the same times?
Perhaps time for a UPS.

cfusco · 24 October 2022 19:52

Linux kernel should kill the process, not trigger a reboot.

cwensink · 24 October 2022 20:02

I ran memtest86 on 86 cycles on the unit on my desk and there were 0 errors reporting there. I do have a performance difference between the SYS-E200 in production for the hard drive. The unit in production right now when running hdparm -tT /dev/sda is getting:
[root@ipfire ~]# hdparm -tT /dev/sda

/dev/sda:
Timing cached reads: 2748 MB in 1.99 seconds = 1377.91 MB/sec
Timing buffered disk reads: 1162 MB in 3.00 seconds = 387.00 MB/sec
[root@ipfire ~]#

The unit on my desk:

[root@ipfire ~]# hdparm -tT /dev/sda

/dev/sda:
Timing cached reads: 2724 MB in 1.99 seconds = 1365.61 MB/sec
Timing buffered disk reads: 212 MB in 3.01 seconds = 70.52.00 MB/sec
[root@ipfire ~]#

Both units appear online fine. I don’t know of a better linux hard drive test to run.

As far as the overheating, there’s an issue with the AC unit in our server room, where it is stuck in “demo on” mode, not auto, so it’s actually blowing AC constantly at a much cooler temp than what I need it to be, which is why the temps are low on the graphs. In that room the temp is in the 50’s indoors with cold air blowing on it. I usually wear a hoodie when going in there because it’s so chilly.

A failing power supply is a possibility. I am moving the unit in production to a UPS after hours, when troubleshooting this a while ago I plugged in a different power supply directly into the wall.

What other hard drive tests are out there to test out the drive(s)?

Is there any APC battery backup monitoring logging software that can capture the voltage levels coming into the unit and graph that over time to make sure the device is getting clean power?

Does anyone use other power meters or voltage regulators to protect their IPFire for enterprise environments?

I’m also getting quotes for new IPFire hardware. I would like:
-Xeon Processor
-8GB+ ECC Ram
-RAID 1 SSD drives
-Redundant power supplies
-4x Intel dedicated controller cards, at least 1 Gbps each
-Dedicated IPMI for hardware monitoring.

I looked into lightningwire labs but the don’t have any models with redundant drives or redundant power supplies.

Any suggestions for a vendor that meets those specs?

jon · 24 October 2022 20:23

Hopefully you’d see a CPU overload on this page
https://ipfire.localdomain:444/cgi-bin/system.cgi

or a Memory overload on this page:
https://ipfire.localdomain:444/cgi-bin/memory.cgi

Since electrical is common to all of the IPFire devices, a UPS may help point us in the right direction. The UPS should have a way to monitor for blips. And maybe voltage.

I am curious, what is IPMI? Does it have the ability to reset the IPFire device? (I am grasping at straws here!)

cwensink · 24 October 2022 20:36

CPU Page:

Memory Page:

IPMI Info: Supermicro Intelligent Management (IPMI) | Supermicro Server Management Utilities | Supermicro
All the major vendors have this kind of functionality, Dell calls it iDRAC, HP calls it ILO for Integrated Lights Out. It’s essentially a web interface that runs through Java that allows you to look at the local console remotely so you can reset the machine, look at the Bios, and there are some integrations and monitoring tools, sort of like a KVM over IP. These are handy in case you have a server in a remote location that goes down, but the network is still up and going, you can remotely get into the device, run hardware diagnostics, give it a reboot, or check for hardware problems, etc.

Chris

jon · 24 October 2022 20:49

All of the graphs look right to me (nothing odd anyway).

Is IPMI currently installed on the existing IPFire devices?

cfusco · 24 October 2022 20:52

Java you say? You can do anything to the machine,

remotely.

In a java environment.

I am sure that is not exploitable at all.

Curiosity, can you rip that thing off and burn at the sub atomic level?

cwensink · 25 October 2022 02:15

Sure, Putin seems to have some ideas on burning atoms lately

cwensink · 25 October 2022 02:16

I don’t have IPMI plugged in at this moment, as I’m trying to narrow down the possibilities of what might be failing.

Chris

cwensink · 25 October 2022 02:27

Update for everyone, IPFire went down without explanation, this time while plugged into an APC Smart UPS 2200 battery backup unit. I could not get into it remotely in any means. I came into the server room to find it completely off. I swapped out the power supply for the unit on my desk, and moved the power cord to the other battery backup unit in the server room (same model), and booted back up.

The last entry of the log before losing power:
Oct 24 19:02:34 ipfire kernel: DROP_INPUT in-red0 OUT= MAC: XX:XX:XX:XX:XX:XX

Then after Power on:
Oct 24 20:34:14 ipfire syslogd 1.5.1: restart (remote reception).

Chris

schami23 · 25 October 2022 10:36

Have you activated the connection scheduler?

If so, please disable this.

I think the problem is when a new connection is initialized, the openvpn-authenticator process is started and runs until the memory is full.

Or you kill the openvpn-authenticator process manually.

Maybe that will bring a solution until the bug is fixed.

cwensink · 25 October 2022 13:01

I had a connection scheduled but turned off, I removed the connection at 5:00 AM. I’ll watch for CPU spikes and problems during the day today, it may just have been the power supply of the unit the whole time, we will see.

Chris

cwensink · 25 October 2022 13:24

I’m not sure if this is related but upon boot up of the system at my desk running 171, I’m also getting this error:

jon · 25 October 2022 16:06

Do you use nrpe?

If you use nrpe then you can try uninstalling and then re-installing.

You’ll see the symlink in the “Create symlinks” section of the wiki.

https://wiki.ipfire.org/addons/nagios-nrpe#create-symlinks

bonnietwin · 25 October 2022 19:32

Reading through the messages file there is a normal kernel message about a firewall DROP_INPUT for some packet on the red interface and then there are just a load of # characters before there is the message that IPFire is restarting with the checks for the network interfaces and the cpu etc, after the filesystem has been recovered as recorded in the bootlog.

For there to be no message in the log at all before IPFire starts again suggests that something caused a hard restart, the equivalent of pulling the plug on the pc.
A normal restart process would have all the messages about IPFire stopping the various services in order and none of these are present.

Reading through the bootlog for the messages after the system came back up, I found the file recovery messages for sda4 (root file system) indicating that the whole system was just effectively switched off as the filesystem had to be checked due to files that were not closed properly.
The bootlog finishes with the remounting of the drives in read write mode after the boot process has finished.

I have done a search about finding the causes for random reboots with no log messages and unfortunately I couldn’t find anything to help track it down any further.