Firewall crashes, Statistics partially missing

Hi,

one customer of us has a problem with a constantly crashing firewall. The syslog does not output any errors. The customer has to restart the device several times a day to get it working again.

I’v seen that there are many gaps in the system satistics (CPU,Memory,Traffic,Hardware), see the example:

I don’t know if the whole system is crashing or only some processes as I don’t have access to the device in the crashed state. I’ve checked the SMART status of the built in SSD which seems ok. There’s enough free space left on the device.

Do you have any suggesions how I can find out what is wrong?

I’ve recently updated to Core-Update 198.

Thank you in advance.

The syslog is stored for a longer time.
Can you see any suspicious messages around the time of outage?
The gaps in the graph represent the time between ‘crash’ and reboot, IMO. Is it possible to monitor with a higher resolution ( ‘day’ or ‘hour’ )?
Are there other anomalies in other graphs?

The gaps are way bigger than the timespan of failure. E.g. between Nov 19 and 21 it was in normal use (with the mentioned manual restarts in between).

Yes I can select the daily or hourly graph but then there is not much data visible. I will check the syslog more detailed later. We had some other issues with an orphaned IP of undocumented device, so the syslog is very big (10 errors per second). This was solved yesterday but the problem still persists.

1 Like

The data for the graphs are stored in ram and saved only once a day to minimize write access to ssd’s. IF the system crash this data will be lost.

You can change this by setting RAMDISK_MODE=0 in /etc/sysconfig/ramdisk

and reboot.

8 Likes

Thank you for the hint, this is very useful and explains the gaps. I will temporarily set the RAMDISK_MODE to 0 to have the values just before the crash.

@arne_f
nice find :rocket:
https://www.ipfire.org/docs/search?q=RAMDISK_MODE :man_shrugging:

@hensch
https://www.ipfire.org/docs/addons/mcelog

2 Likes

@current_user Thanks, I installed mcelog. Let’s see if it shows some useful information.

With Ramdisk set to 0 there is no sign of abnormal sensor or system parameters. I will be there tomorrow and place a new device to see if it’s a hardware failure.

1 Like

your welcome and let us know what mcelog was able to do for you. :man_detective:
i was once able to identify a defective cpu/ram-controller :trophy:
also it would be nice if you could name the hardware :man_shrugging:
the one that will be replaced and the one that replaces :wink:

After checking the system at the site with a screen connected it turned out it had a filesystem error caused by a hardware failure. I switched the SSD and after two reboots it freezed. So I replaced it with an identical but new system. It’s a MPC-4LAN-N3700 MiniPC.

Strange thing is that I did see nothing in the kernel log.

It seems to work fine now. Thank you for your help.

Thanks for sharing your experience, glad the issue has been solved

You replaced the PC box only or also the power supplier?

Are you willing to share more info and experience about this box?

I replaced the whole system including the power supply for safety.

We use these boxes since tree years for small companies or private customers. In total there are 16 of these boxes out there (and more to come). Including this case we had 2 failures so far. The other failure was a BIOS losing its settings after power loss (BIOS battery was good).

Nevertheless, the other systems are very reliable.

Thanks for sharing your experience, I’m glad you are satisfied about these boxes.

Acknowledging that an SBC-like appliances (several delivered from brand-name firewall) are less complex and with less parts than a full fledged computer (which these boxes are), a failure rate of 12,5% (2 out of 16) in three years is a tad on the “too much” side for my perception. Anyway, still a small sample to draw lines.