IPFire went down last night, can't find cause

I found some thread discussion of certain combination of firmware and linux kernel, like this one, but there are many to be found. There is also a linux bug report here.

Unfortunately the situation happened again, this time as I came into the office at 7:55 AM, I was told the network was down, and it went down in the last few minutes.

My work PC was not able to get to any ip address, I could not ping the router. I went into the server room, the router was still powered on, I could log into the router at the console, and I could ping google from the console.

I tried restarting dhcpd - no internet on other machines. I tried restarting unbound, still nothing. I rebooted the router, then we are back online. No changes to any other devices were made.

Attached are the sanitized bootlog and the portion of /var/log/messages from today, with names, ip’s and mac’s omitted for security. Please help.

bootlog.gz (14.2 KB)
messags.10.27.22.txt.gz (906.2 KB)

please add these graphs:

Here they are

Out of curiosity: do you have IPsec enabled also? Do you use it?

https://ipfire.localdomain:444/cgi-bin/vpnmain.cgi

Yes, it is enabled and we use it for a point to point vpn tunnel to a second office.

Things were running when ROOT logged in and did a restart near 8:01.

It looks like the green network stopped talking to the IPFire device near 07:40. this is the last communication:

Oct 27 07:40:40 ipfire kernel: FORWARDFW IN=green0 OUT=red0 MAC=<green-mac>:7c:cb:e2:e4:d0:b8:08:00 SRC=<green-network.>26 DST=40.126.28.14 LEN=52 TOS=0x00 PREC=0x00 TTL=127 ID=2972 DF PROTO=TCP SPT=64922 DPT=443 WINDOW=8192 RES=0x00 SYN URGP=0

(blue stopped a minute before)

but I don’t see the “why”…

Can you post these graphs:
https://ipfire.localdomain:444/cgi-bin/netinternal.cgi

The uptime of the system was 2.5 days, it was up since I replaced the power supply on the unit and moved it to a different battery backup. I thought that would ultimately be the fix but it went down again today … :thinking:

to me, this looks very different from the other issues. (again I don’t know why).

What is the GREEN and BLUE networks plugged into? Is it two different switches (or hubs)?

It is odd that BLUE & GREEN both stopped within a minute. And RED kept on running.

Blue is plugged in to a stand alone wireless access point that we use for guest internet access for customers who come into the building

Green is plugged into our core switch

Red is plugged into the Fiber modem provided by our ISP

I’m strongly considering refreshing the hardware of our IPFire, but upon reviewing the Lightningwire labs boxes that are built, they don’t have some of the features that I want. The specs that I’m looking at are this:

Rackmount Unit
Single Xeon CPU
8+ GB ECC Ram
RAID 1 w/ 2 SSD’s and controller card
Redundant power supplies
4 Intel NIC Controller Cards all 1Gbps, or possibly some 10Gbps cards for green to go between the IPFire and the Core Switch.
Onboard Encryption hardware chip

Thoughts? What vendors do you guys use?

Lots! Selfmade.

What brands of components do you like?

Never had probs with Advantech or Supermicro motherboards. Case → Chenbro. I don’t use hardware raid anymore → oldschool. For reg. ECC memory I use Samsung. SSD for enterprise usage WD red.

Is the core switch also plugged into a UPS?

Yes it is.

Any other thoughts on troubleshooting on why it went down? Should I just replace the hardware?

Chris

My thought is the following: this last event is different form the previous events, or at least it looks like different. Before, reboot. Now, loss of connectivity.

Also, despite all your best effort, you cannot pinpoint an obvious culprit, like a faulty hard disk or memory bank.

My opinion is that either there is a third cause outside your IPFire machine(s), or you have a strange and unlucky combination of several unrelated events.

I would favor the first hypothesis, but I saw more than once that the second happens more frequently that we would believe possible.

I regret I cannot be more specific, as your troubleshooting is clearly challenging. I hope you can find out what’s going on.

2 Likes

What’s your bios setting for ‘Power on after power fail’? Usually the system stays off. I’ve a NAS with a Jetway motherboard that had a bad power supply and also recently restartet even it should stay off after a power outage. However I replaced the power supply and the problem is gone.

2 Likes