ipFire crashed: How to analyze?

IPFire 2.27 (x86_64) - Core-Update 176

Hi,
since two days ipFire crashed during the night. At the morning no I-Net is available. After I molest my provider (NEVER before ipFire crashed). I check the ipFire. But I could not login. Keyboard is dead. → Restart.

In the middle of the day - Again …; Today in the morning … again …
(And it is not needed to say that there were no changes in the config before :wink:

In /var/log/messages I found no helpful entrys.
Last entry is a outgoing package and the next entry (some hours later, because it crashed early in the morning) is the restart of the syslogd.

Jul 24 05:21:04 goofy kernel: DROP_HOSTILE IN=red0 OUT= MAC=00:e0:67:2e:b2:33:00:1d:aa:88:67:74:08:00 SRC=77.90.185.143 DST=192.168.251.2 LEN=44 TOS=0x00 PREC=0x00 TTL=247 ID=54321 PROTO=TCP SPT=57842 DPT=49757 WINDOW=65535 RES=0x00 SYN URGP=0 
Jul 24 05:21:12 goofy kernel: INPUTFW IN=blue0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:d0:93:5d:99:70:08:00 SRC=192.168.253.22 DST=192.168.253.255 LEN=86 TOS=0x00 PREC=0x00 TTL=64 ID=5899 DF PROTO=UDP SPT=9522 DPT=9522 LEN=66 
Jul 24 05:21:12 goofy kernel: OUTGOINGFW IN= OUT=red0 SRC=192.168.251.2 DST=192.168.251.1 LEN=104 TOS=0x00 PREC=0x00 TTL=255 ID=1917 DF PROTO=ICMP TYPE=8 CODE=0 ID=54518 SEQ=34081 MARK=0x80000000 
Jul 24 05:23:16 goofy syslogd 1.5.1: restart (remote reception).
Jul 24 05:23:17 goofy unbound: [4132:0] notice: init module 0: validator
Jul 24 05:23:17 goofy unbound: [4132:0] notice: init module 1: iterator
Jul 24 05:23:17 goofy acpid: starting up with netlink and the input layer
Jul 24 05:23:17 goofy acpid: 1 rule loaded
Jul 24 05:23:17 goofy acpid: waiting for events: event logging is off
Jul 24 05:23:17 goofy unbound: [4132:0] in...

Where should I start my investigations?

BTW: Smart did not shown any interesting information It looks like a default:

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.30-ipfire] (IPFire 2.27)
Copyright (C) 2002-22  Bruce Allen  Christian Franke  www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Silicon Motion based SSDs
Device Model:     TS128GMSA230S
Serial Number:    H073080649
LU WWN Device Id: 5 7c3548 1e1316f49
Firmware Version: R0427GN1
User Capacity:    128 035 676 160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      M.2
TRIM Command:     Available  deterministic  zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3  6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jul 25 08:52:28 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0000   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0000   100   100   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       8957
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       17
148 Total_SLC_Erase_Ct      0x0000   100   100   000    Old_age   Offline      -       267
149 Max_SLC_Erase_Ct        0x0000   100   100   000    Old_age   Offline      -       20
150 Min_SLC_Erase_Ct        0x0000   100   100   000    Old_age   Offline      -       16
151 Average_SLC_Erase_Ct    0x0000   100   100   000    Old_age   Offline      -       17
159 DRAM_1_Bit_Error_Count  0x0000   100   100   000    Old_age   Offline      -       0
160 Uncorrectable_Error_Cnt 0x0000   100   100   000    Old_age   Offline      -       0
161 Valid_Spare_Block_Cnt   0x0000   100   100   000    Old_age   Offline      -       48
163 Initial_Bad_Block_Count 0x0000   100   100   000    Old_age   Offline      -       9
164 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       35354
165 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       81
166 Min_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       17
167 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       74
168 Max_Erase_Count_of_Spec 0x0000   100   100   000    Old_age   Offline      -       1500
169 Remaining_Lifetime_Perc 0x0000   100   100   000    Old_age   Offline      -       96
177 Wear_Leveling_Count     0x0000   100   100   050    Old_age   Offline      -       0
181 Program_Fail_Cnt_Total  0x0000   100   100   000    Old_age   Offline      -       0
182 Erase_Fail_Count_Total  0x0000   100   100   000    Old_age   Offline      -       0
192 Power-Off_Retract_Count 0x0000   100   100   000    Old_age   Offline      -       9
194 Temperature_Celsius     0x0000   100   100   000    Old_age   Offline      -       53
195 Hardware_ECC_Recovered  0x0000   100   100   000    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0000   100   100   016    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0000   100   100   050    Old_age   Offline      -       18
232 Available_Reservd_Space 0x0000   100   100   000    Old_age   Offline      -       100
241 Host_Writes_32MiB       0x0000   100   100   000    Old_age   Offline      -       102663
242 Host_Reads_32MiB        0x0000   100   100   000    Old_age   Offline      -       25978
245 TLC_Writes_32MiB        0x0000   100   100   000    Old_age   Offline      -       318186

Almost certainly you have an hardware problem. I would not rely on the smartmoontool info, it has a lot of false negatives. I would swap the disk to see if the problem goes away. Can you have a power supply problem?

3 Likes

Arg … sorry, this :zipper_mouth_face:forum SW is not my friend …

This are the related lines from the log:

Jul 25 03:50:30 goofy kernel: OUTGOINGFW IN= OUT=red0 SRC=192.168.251.2 DST=192.168.251.1 LEN=104 TOS=0x00 PREC=0x00 TTL=255 ID=13085 DF PROTO=ICMP TYPE=8 CODE=0 ID=61360 SEQ=1503 MARK=0x80000000 
Jul 25 03:50:31 goofy kernel: DROP_HOSTILE IN=red0 OUT= MAC=00:e0:67:2e:b2:33:00:1d:aa:88:67:74:08:00 SRC=194.26.135.166 DST=192.168.251.2 LEN=44 TOS=0x00 PREC=0x00 TTL=247 ID=4763 PROTO=TCP SPT=56240 DPT=9595 WINDOW=1024 RES=0x00 SYN URGP=0 
Jul 25 03:50:38 goofy kernel: FORWARDFW IN=orange0 OUT=red0 MAC=00:e0:67:2e:b2:35:7c:2f:80:a4:02:0e:08:00 SRC=192.168.252.18 DST=217.0.11.225 LEN=48 TOS=0x00 PREC=0x00 TTL=127 ID=60836 PROTO=UDP SPT=3478 DPT=3478 LEN=28 MARK=0x80000000 
Jul 25 03:50:38 goofy kernel: SNAT IN=orange0 OUT=red0 MAC=00:e0:67:2e:b2:35:7c:2f:80:a4:02:0e:08:00 SRC=192.168.252.18 DST=217.0.11.225 LEN=48 TOS=0x00 PREC=0x00 TTL=127 ID=60836 PROTO=UDP SPT=3478 DPT=3478 LEN=28 MARK=0x80000000 
Jul 25 03:50:51 goofy kernel: INPUTFW IN=blue0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:d0:93:5d:99:70:08:00 SRC=192.168.253.22 DST=192.168.253.255 LEN=86 TOS=0x00 PREC=0x00 TTL=64 ID=11351 DF PROTO=UDP SPT=9522 DPT=9522 LEN=66 
Jul 25 03:50:52 goofy kernel: DNAT IN=blue0 OUT= MAC=00:e0:67:2e:b2:34:00:d0:93:5d:99:70:08:00 SRC=192.168.253.22 DST=192.168.253.2 LEN=65 TOS=0x00 PREC=0x00 TTL=64 ID=14237 DF PROTO=UDP SPT=42459 DPT=53 LEN=45 
Jul 25 03:50:52 goofy kernel: INPUTFW IN=blue0 OUT= MAC=00:e0:67:2e:b2:34:00:d0:93:5d:99:70:08:00 SRC=192.168.253.22 DST=192.168.253.2 LEN=65 TOS=0x00 PREC=0x00 TTL=64 ID=14237 DF PROTO=UDP SPT=42459 DPT=53 LEN=45 
Jul 25 03:50:52 goofy kernel: OUTGOINGFW IN= OUT=red0 SRC=192.168.251.2 DST=149.233.24.179 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=61861 DF PROTO=TCP SPT=46310 DPT=80 WINDOW=64240 RES=0x00 SYN URGP=0 MARK=0x80000000 
Jul 25 07:11:26 goofy syslogd 1.5.1: restart (remote reception).
Jul 25 07:11:26 goofy unbound: [4135:0] notice: init module 0: validator
Jul 25 07:11:26 goofy unbound: [4135:0] notice: init module 1: iterator
Jul 25 07:11:26 goofy acpid: starting up with netlink and the input layer

There you see at 03:50:52 is the last normal action. At 07:11:26 I do the PowerOff/PowerOn.

When the problem happens are you able to still get a login screen via the console terminal or the serial console terminal?

If you can’t login via one of the consoles anymore that would suggest it has just stopped working and I would agree with @cfusco view that the likely problem is hardware related.

You could reboot IPFire and from the console terminal run the memtest to see if any issues show up with that.

No, at the console it is not possible to enter something it is frozen. Today the screen shows coloured snow …

During the restart there was a line about file system error. But it was gone to fast …

You should find the kernel boot log in /var/log/bootlog. The error could be due to a kernel crash leading to an improper shutdown of the filesystem. In that case if your file system is EXT4 you should have an automatic run of fsck during the boot, which would be logged.

If I were in your position I would make sure to have a good backup. Just in case.

Things I would check if you have kernel freeze without logs (in order of probability):

  1. Hard disk;
  2. RAM;
  3. power supply;
  4. CPU overheat (see /status/hardware graphs);
  5. PCIe or USB peripherals (disconnect them to test);
  6. BIOS or UEFI problems (do you have an UP to date firmware?);
  7. faulty network cards.

I personally experienced only the first three, the first one several times.

3 Likes

It reads like the box suddenly lost power at 03:50:52. So if nothing else in your location lost power (no blinking clocks, etc.) then it points to a power supply or whatever the power supply is plugged into.

I had a similar issue a while back with my IPF when it would just shutdown. Ran memtest and it would just stop and also shutdown. Turned out to be a CPU overheating issue. Replaced the mobo and issue fixed.

2 Likes

I am astonished … today the firewall is running since yesterday …
Tomorrow, it is possible to shut down and do the tests … I will report.

No,
03:50:52 was the last entry in the log before the FW froze. When I check the console, I see the normal screen (first time) or coloured snow (second time).
07:11:26 was the first entry after a power off/power on.

I had a very similar experience happen to me, after a bunch of outages and reboots like this, replacing the hardware resolved the issue. Get an old PC, put a second NIC Card in it, and do a fresh IPFire install and put that in place overnight. (Do not do a restore of backup at first, do a fresh install). Then run that for a few days, if there are no issues then restore a backup from the existing box and run that for a few days, if no crashes occur then you know it’s hardware related, then start swapping parts.

2 Likes

:shushing_face:

however

there is: https://mcelog.org/
wiki.ipfire.org - mcelog
try to find out if this works with your problem :scientist:
well, good luck :exclamation:

2 Likes

To get this thread to an end …

Indeed it was an HW-Error :frowning_with_open_mouth:
I replaced this box with a replaced one from the seller. (But it is the same type … :face_with_diagonal_mouth:
And placed a fan onto this ‘fanless’ box … :face_with_spiral_eyes:
Now I will see how long this works …

Thank you, for your hints!

edit: typo

1 Like