jon
(Jon)
3 August 2024 23:43
1
My test IPFire APU4D4 suddenly rebooted. It has been sitting idle for the past few days and I heard a familiar “beep” reboot at Aug 3 14:09:57.
FYI - There is nothing in the log at the time of reboot:
message.log.zip (4.0 KB)
but I did find this from a few days ago. This message popped up about 10 times since the beginning of July.
Jul 30 10:18:12 ipfireAPU2 The database has been updated recently
Jul 30 11:51:40 ipfireAPU2 kernel: mce: [Hardware Error]: Machine check events logged
Jul 30 11:51:41 ipfireAPU2 kernel: [Hardware Error]: Corrected error, no action required.
Jul 30 11:51:41 ipfireAPU2 kernel: [Hardware Error]: CPU:0 (16:30:1) MC1_STATUS[-|CE|-|AddrV|-|-|-]: 0x9400000000000151
Jul 30 11:51:41 ipfireAPU2 kernel: [Hardware Error]: Error Addr: 0x0000ffffbd41b770
Jul 30 11:51:41 ipfireAPU2 kernel: [Hardware Error]: MC1 Error: Data/tag array parity error for a tag hit.
Jul 30 11:51:41 ipfireAPU2 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Jul 30 11:55:48 ipfireAPU2 The database has been updated recently
I loaded up mcelog to try and decode but the APU does not have a supported CPU.
Any thoughts on what this might indicate?
jon
(Jon)
4 August 2024 00:38
2
I saw my firmware version was not up to date and I just did the update with www.ipfire.org - firmware-update .
Hopefully that will help!
jon
(Jon)
21 September 2024 02:23
3
I am still seeing these same errors. And occasional “reboots” for no know reason (besides the below).
Any ideas what this might mean?
Sep 19 03:20:24 ipfireAPU2 kernel: mce: [Hardware Error]: Machine check events logged
Sep 19 03:20:24 ipfireAPU2 kernel: [Hardware Error]: Corrected error, no action required.
Sep 19 03:20:24 ipfireAPU2 kernel: [Hardware Error]: CPU:0 (16:30:1) MC1_STATUS[-|CE|-|AddrV|-|-|-]: 0x9400000000000151
Sep 19 03:20:24 ipfireAPU2 kernel: [Hardware Error]: Error Addr: 0x0000ffffa2145270
Sep 19 03:20:24 ipfireAPU2 kernel: [Hardware Error]: MC1 Error: Data/tag array parity error for a tag hit.
Sep 19 03:20:24 ipfireAPU2 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
also mcelog
did not help me:
[root@ipfireAPU2 ~]# mcelog
mcelog: ERROR: AMD Processor family 22: mcelog does not support this processor. Please use the edac_mce_amd module instead.
CPU is unsupported
bonnietwin
(Adolf Belka)
21 September 2024 06:46
4
Found this in the kernel bugzilla.
https://bugzilla.kernel.org/show_bug.cgi?id=207907
While not exactly the same error, it appears that those type of messages are when a physical error in memory has occurred that has been able to be corrected by the kernel and as long as it can be corrected you have to live with it.
3 Likes
jon
(Jon)
21 September 2024 14:50
5
Thanks!
I am not sure it is really corrected since I also experience this…
I am running a memtest now since the error mentions a “parity error”.
jon
(Jon)
21 September 2024 15:11
6
Adolf,
I thought there was a memtest within the ipfire build. I see a LFS file
###############################################################################
# #
# IPFire.org - A linux based firewall #
# Copyright (C) 2007-2024 IPFire Team <info@ipfire.org> #
# #
# This program is free software: you can redistribute it and/or modify #
# it under the terms of the GNU General Public License as published by #
# the Free Software Foundation, either version 3 of the License, or #
# (at your option) any later version. #
# #
# This program is distributed in the hope that it will be useful, #
# but WITHOUT ANY WARRANTY; without even the implied warranty of #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #
# GNU General Public License for more details. #
# #
# You should have received a copy of the GNU General Public License #
# along with this program. If not, see <http://www.gnu.org/licenses/>. #
# #
###############################################################################
This file has been truncated. show original
but I cannot locate it within pakfire or on any of my ipfire devices. (Maybe I need more coffee)
Am I blind?!?
bonnietwin
(Adolf Belka)
21 September 2024 15:17
7
It goes into the cdrom menu system, not into the system itself I believe.
So you could use a usb stick with IPFire iso installed and then select the memtest on there and after running it just exit from the cdrom process so no installation change.
3 Likes
jon
(Jon)
22 September 2024 19:50
8
BIOS memtest has been running OK (no errors). So I am thinking it may be this:
opened 08:17PM - 09 May 21 UTC
Hi all,
I'm having some serious stability issues with APU2E4 and CPB with BIO… S 4.13.0.1 and 4.13.0.5
This is brand new hardware which was believed to be "DOA" but the replacement I got had the exact same issue.
After disabling CPB the system appears to be stable and has an uptime of a record high 4 days and going.
Operating system tested OPNsense 21.1 and 21.1.5.
I've tried booting from msata, sd card and USB but it gives me the same issue.
I've also tried multiple power adapters.
The CPU Temperature is typically in the range 54-56c and the system isn't even connected to any network just the console cable.
The system has been very unstable and is core dumping every 4-12 hours. BIOS 4.13.0.1, but did see similar issues when testing 4.13.0.5.
From the logs/console I see the following:
FreeBSD/amd64 (OPNsense.localdomain) (ttyu0)
login: MCA: Bank 1, Status 0x9400000000000151
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0
MCA: CPU 0 COR ICACHE L1 IRD error
MCA: Address 0x282060
[HBSD SEGVGUARD] [/usr/local/bin/python3 (5880)] Suspension expired.
-> pid: 5880 ppid: 1302 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>
And:
"root@OPNsense:/var/db/rrd # MCA: Bank 1, Status 0xd400000000000151
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0
MCA: CPU 0 COR OVER ICACHE L1 IRD error
MCA: Address 0xffff80d1ff60"
Let me know if additional details are required.
Broken hardware, bios bug, OPNsense/HardenedBSD compatibility issues?
I am going to disable CPB (core performance boost) in the BIOS and see what happens…
jon
(Jon)
5 October 2024 02:32
9
So far so good! No mce: Hardware errors and no spontaneous (unplanned!) reboots since Sep 20. Yay.
Steps:
go to BIOS via F10 on boot
click setup
disable Core Performance Boost
save it!
I do not know if there are any performance issues related to this change.
2 Likes