IPFire went down last night, can't find cause

please add these graphs:

Here they are

Out of curiosity: do you have IPsec enabled also? Do you use it?

https://ipfire.localdomain:444/cgi-bin/vpnmain.cgi

Yes, it is enabled and we use it for a point to point vpn tunnel to a second office.

Things were running when ROOT logged in and did a restart near 8:01.

It looks like the green network stopped talking to the IPFire device near 07:40. this is the last communication:

Oct 27 07:40:40 ipfire kernel: FORWARDFW IN=green0 OUT=red0 MAC=<green-mac>:7c:cb:e2:e4:d0:b8:08:00 SRC=<green-network.>26 DST=40.126.28.14 LEN=52 TOS=0x00 PREC=0x00 TTL=127 ID=2972 DF PROTO=TCP SPT=64922 DPT=443 WINDOW=8192 RES=0x00 SYN URGP=0

(blue stopped a minute before)

but I don’t see the “why”…

Can you post these graphs:
https://ipfire.localdomain:444/cgi-bin/netinternal.cgi

The uptime of the system was 2.5 days, it was up since I replaced the power supply on the unit and moved it to a different battery backup. I thought that would ultimately be the fix but it went down again today … :thinking:

to me, this looks very different from the other issues. (again I don’t know why).

What is the GREEN and BLUE networks plugged into? Is it two different switches (or hubs)?

It is odd that BLUE & GREEN both stopped within a minute. And RED kept on running.

Blue is plugged in to a stand alone wireless access point that we use for guest internet access for customers who come into the building

Green is plugged into our core switch

Red is plugged into the Fiber modem provided by our ISP

I’m strongly considering refreshing the hardware of our IPFire, but upon reviewing the Lightningwire labs boxes that are built, they don’t have some of the features that I want. The specs that I’m looking at are this:

Rackmount Unit
Single Xeon CPU
8+ GB ECC Ram
RAID 1 w/ 2 SSD’s and controller card
Redundant power supplies
4 Intel NIC Controller Cards all 1Gbps, or possibly some 10Gbps cards for green to go between the IPFire and the Core Switch.
Onboard Encryption hardware chip

Thoughts? What vendors do you guys use?

Lots! Selfmade.

What brands of components do you like?

Never had probs with Advantech or Supermicro motherboards. Case → Chenbro. I don’t use hardware raid anymore → oldschool. For reg. ECC memory I use Samsung. SSD for enterprise usage WD red.

Is the core switch also plugged into a UPS?

Yes it is.

Any other thoughts on troubleshooting on why it went down? Should I just replace the hardware?

Chris

My thought is the following: this last event is different form the previous events, or at least it looks like different. Before, reboot. Now, loss of connectivity.

Also, despite all your best effort, you cannot pinpoint an obvious culprit, like a faulty hard disk or memory bank.

My opinion is that either there is a third cause outside your IPFire machine(s), or you have a strange and unlucky combination of several unrelated events.

I would favor the first hypothesis, but I saw more than once that the second happens more frequently that we would believe possible.

I regret I cannot be more specific, as your troubleshooting is clearly challenging. I hope you can find out what’s going on.

2 Likes

What’s your bios setting for ‘Power on after power fail’? Usually the system stays off. I’ve a NAS with a Jetway motherboard that had a bad power supply and also recently restartet even it should stay off after a power outage. However I replaced the power supply and the problem is gone.

2 Likes

Well, the problem continues, same exact circumstances. I had a reboot of IPFire unexpectedly today (when I was out of the building at lunch), and from the logs I see that it restarted at 5:25 PM last night as well. I have plugged the power supply of the hardware appliance directly into a Smart APC 2200 power supply, I have swapped the hardware appliance out, I have swapped out the power supply itself with another duplicate model. IPFire 171 and 170 units both had the same problems. On the backup unit at my desk I saw decreased hard drive performance compared to the live up and working unit, but it was still online.

Memory Graph:

System:

/var/log/messages shows normal DROP_INPUT traffic up until the line of failure which just shows ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ @Nov 7 13:10:05 ipfire syslogd 1.5.1: restart (remote reception) .

Could this be a bad batch of power supplies, with capacitors going out all at the same time? Bad batch of motherboards? Bad Ram? Bug in IPFire that affects all hardware no matter what it runs on?

Sunday night the system seems to have just rebooted then came back up on it’s own, today I was out of the office when it happened and my boss pulled the plug to reset it.

Thoughts?

One other strange thing that I noticed. I wondered if openvpn taking up a lot of resources has anything to do with this at all. I noticed that the openvpn authentication services was taking up a lot of CPU power. I stopped the Openvpn service from the GUI, yet in top in bash I could see that the openvpn authenticator service was still running and taking up a lot of CPU. It seems like the service stopping in the GUI did not kill the service. I killed the service in the shell, then the CPU usage went back down to normal.

Here’s the openvpn server.conf
[root@ipfire ovpn]# cat server.conf
#OpenVPN Server conf

daemon openvpnserver
writepid /var/run/openvpn.pid
#DAN prepare OpenVPN for listening on blue and orange
;local
dev tun
proto udp
port 1194
script-security 3
ifconfig-pool-persist /var/ipfire/ovpn/ovpn-leases.db 3600
client-config-dir /var/ipfire/ovpn/ccd
tls-server
ca /var/ipfire/ovpn/ca/cacert.pem
cert /var/ipfire/ovpn/certs/servercert.pem
key /var/ipfire/ovpn/certs/serverkey.pem
dh /var/ipfire/ovpn/ca/dh1024.pem
server 255.255.255.0
tun-mtu 1400
mssfix
keepalive 10 60
status-version 1
status /var/run/ovpnserver.log 30
ncp-disable
cipher AES-256-CBC
auth SHA512
tls-version-min 1.2
push “dhcp-option DNS ”
push “dhcp-option WINS ”
max-clients 100
tls-verify /usr/lib/openvpn/verify
crl-verify /var/ipfire/ovpn/crls/cacrl.pem
auth-user-pass-optional
reneg-sec 86400
user nobody
group nobody
persist-key
persist-tun
verb 4

Log clients connecting/disconnecting

client-connect “/usr/sbin/openvpn-metrics client-connect”
client-disconnect “/usr/sbin/openvpn-metrics client-disconnect”

Enable Management Socket

management /var/run/openvpn.sock unix
management-client-auth


See any concerns there?