one customer of us has a problem with a constantly crashing firewall. The syslog does not output any errors. The customer has to restart the device several times a day to get it working again.
I’v seen that there are many gaps in the system satistics (CPU,Memory,Traffic,Hardware), see the example:
I don’t know if the whole system is crashing or only some processes as I don’t have access to the device in the crashed state. I’ve checked the SMART status of the built in SSD which seems ok. There’s enough free space left on the device.
Do you have any suggesions how I can find out what is wrong?
The syslog is stored for a longer time.
Can you see any suspicious messages around the time of outage?
The gaps in the graph represent the time between ‘crash’ and reboot, IMO. Is it possible to monitor with a higher resolution ( ‘day’ or ‘hour’ )?
Are there other anomalies in other graphs?
The gaps are way bigger than the timespan of failure. E.g. between Nov 19 and 21 it was in normal use (with the mentioned manual restarts in between).
Yes I can select the daily or hourly graph but then there is not much data visible. I will check the syslog more detailed later. We had some other issues with an orphaned IP of undocumented device, so the syslog is very big (10 errors per second). This was solved yesterday but the problem still persists.
Thank you for the hint, this is very useful and explains the gaps. I will temporarily set the RAMDISK_MODE to 0 to have the values just before the crash.
@current_user Thanks, I installed mcelog. Let’s see if it shows some useful information.
With Ramdisk set to 0 there is no sign of abnormal sensor or system parameters. I will be there tomorrow and place a new device to see if it’s a hardware failure.
your welcome and let us know what mcelog was able to do for you.
i was once able to identify a defective cpu/ram-controller
also it would be nice if you could name the hardware
the one that will be replaced and the one that replaces
After checking the system at the site with a screen connected it turned out it had a filesystem error caused by a hardware failure. I switched the SSD and after two reboots it freezed. So I replaced it with an identical but new system. It’s a MPC-4LAN-N3700 MiniPC.
Strange thing is that I did see nothing in the kernel log.
It seems to work fine now. Thank you for your help.
I replaced the whole system including the power supply for safety.
We use these boxes since tree years for small companies or private customers. In total there are 16 of these boxes out there (and more to come). Including this case we had 2 failures so far. The other failure was a BIOS losing its settings after power loss (BIOS battery was good).
Nevertheless, the other systems are very reliable.
Thanks for sharing your experience, I’m glad you are satisfied about these boxes.
Acknowledging that an SBC-like appliances (several delivered from brand-name firewall) are less complex and with less parts than a full fledged computer (which these boxes are), a failure rate of 12,5% (2 out of 16) in three years is a tad on the “too much” side for my perception. Anyway, still a small sample to draw lines.