Kernel Bug: khugepage

maintech · 7 September 2023 10:28

WARNING: Kernel Errors Present
BUG: Bad page map in process khugepaged pte …: 25 Time(s)

This is my latest issue. It occurs once every 24 hours. Here is the complete log.

BUG: Bad page map in process khugepaged pte:8000400163dc9067 pmd:104b7f067
00:17:05 kernel: addr:000000000239f000 vm_flags:00100073 anon_vma:ffff9af603ce16e8 mapping:000000 0000000000 index:239f
00:17:05 kernel: file:(null) fault:0x0 mmap:0x0 read_folio:0x0
00:17:05 kernel: CPU: 0 PID: 61 Comm: khugepaged Tainted: G B 6.1.45-ipfire #1
00:17:05 kernel: Hardware name: Gigabyte Technology Co., Ltd. GA-78LMT-USB3/GA-78LMT-USB3, BIOS F A 04/23/2013
00:17:05 kernel: Call Trace:
00:17:05 kernel:
00:17:05 kernel: dump_stack_lvl+0x48/0x6a
00:17:05 kernel: print_bad_pte.cold+0x73/0xd6
00:17:05 kernel: ? _raw_spin_lock_irq+0x1d/0x50
00:17:05 kernel: ? _raw_spin_unlock_irq+0x1c/0x50
00:17:05 kernel: vm_normal_page+0xc3/0xe0
00:17:05 kernel: hpage_collapse_scan_pmd+0x2bc/0x580
00:17:05 kernel: khugepaged+0x510/0x990
00:17:05 kernel: ? collapse_pte_mapped_thp+0x3d0/0x3d0
00:17:05 kernel: kthread+0xed/0x120
00:17:05 kernel: ? kthread_complete_and_exit+0x20/0x20
00:17:05 kernel: ret_from_fork+0x22/0x30

Not sure where I should post this. Sorry.

bonnietwin · 7 September 2023 10:40

I searched on the term

BUG: Bad page map in process khugepaged pte:

and all the entries I found either related to problems with overclocking or with hardware problems sometimes, but not always, related to memory.

maintech · 7 September 2023 10:41

Thanks!!! At least I know now where to start looking!

cfusco · 7 September 2023 10:52

This is what GPT4 model surmised from the logs. Since the error is not associated to specific file, I would consider less likely an hardware failure due to the hard disk. As @bonnietwin said, it might be an issue with the clock speed, or it might also be a failure in the memory bank. A bit concerned about this “tainted” state of the kernel. I would also consider reinstalling the OS from scratch and see if the error persists.

The error message you provided is a kernel warning indicating a “Bad page map” in the process “khugepaged”. Let’s break down the different components of this message to understand it better:

Overview:

BUG: Bad page map in process khugepaged: This is the main error message, indicating that there is a bug involving a bad page map in the khugepaged process. The khugepaged is a kernel thread responsible for collapsing smaller pages into huge pages to optimize memory management.
pte:8000400163dc9067 pmd:104b7f067: These are details about the page table entries (PTE) and page middle directory (PMD) involved in the error.
addr:000000000239f000 vm_flags:00100073 anon_vma:ffff9af603ce16e8 mapping:000000 0000000000 index:239f: These are details about the memory address, virtual memory area flags, anonymous virtual memory area, and other memory mapping details involved in the error.
file:(null) fault:0x0 mmap:0x0 read_folio:0x0: These are details about the file and memory mapping involved in the error, which appear to be null in this case, indicating that the error is not associated with a specific file.
CPU: 0 PID: 61 Comm: khugepaged Tainted: G B 6.1.45-ipfire #1: This part of the message provides details about the CPU core, process ID, and the command that triggered the error. It also shows that the kernel is tainted, which means it has been modified or is in an unstable state.
Hardware name: Gigabyte Technology Co., Ltd. GA-78LMT-USB3/GA-78LMT-USB3, BIOS F A 04/23/2013: This part gives details about the hardware and BIOS version of the system where the error occurred.
Call Trace: This section provides a stack trace showing the function calls leading up to the error, which can be useful for debugging.

Analysis:

This error is a serious kernel issue, often indicative of a bug in the kernel itself or a hardware problem, possibly involving the system’s RAM. The khugepaged process is encountering an issue while trying to manage memory pages.

Steps to take:

Update Your System: Ensure your system is up-to-date, including the kernel. Sometimes such issues are resolved in newer kernel versions.
Hardware Issues: Consider the possibility of hardware issues. You might want to run a memory test using tools like Memtest86+ to check for RAM errors.
Kernel Parameters: Sometimes tweaking kernel parameters related to huge pages can help. You might consider disabling transparent huge pages to see if it resolves the issue.
Logs and Forums: Check system logs for any other related warnings or errors. You might also consider reporting this issue in IPFire forums or other Linux kernel communities with all the necessary details for more targeted assistance.
Expert Assistance: If you are unable to resolve the issue yourself, you might consider seeking assistance from someone with expertise in Linux kernel debugging to help identify and fix the issue.

Remember to always backup important data to prevent data loss in case of system instability.

cfusco · 7 September 2023 11:03

I asked GPT4 for more information on the flags associated with the error (G and B). Basically, the CPU is locked up in a state from which it can’t progress.

In the message “CPU: 0 PID: 61 Comm: khugepaged Tainted: G B 6.1.45-ipfire”, there are two taint flags present: ‘G’ and ‘B’. Here’s what each of them signifies:

‘G’ - Proprietary Module Loaded: This flag indicates that a proprietary (non-open-source) kernel module has been loaded. It suggests that the kernel has been exposed to code that is not open-source and hasn’t been reviewed by the wider Linux community, which might introduce instability or security risks.
‘B’ - Soft Lockup has Occurred: This flag indicates that a soft lockup has occurred, which is a situation where the kernel detects that a CPU is stuck in a condition where it is unable to make progress. It’s a serious issue that can indicate problems with the hardware or with the kernel itself.

So, in this specific kernel message, both the ‘G’ and ‘B’ taint flags are active, indicating the presence of both conditions described above at the time the message was generated. It suggests that the system is in a state that could potentially be unstable due to the loaded proprietary module and the occurrence of a soft lockup.

maintech · 7 September 2023 11:31

The install was a standard one and to my knowledge there was no Proprietary Modules loaded. If there were the system is the one that called for it. I have reset some things in the BIOS. I will see if it made any diffference.

And thank you for your replies. They help more than you can imagine.

cfusco · 7 September 2023 11:43

you might consider memtesting your ram as well.

bonnietwin · 7 September 2023 11:52

All the driver firmware that is loaded, such as those files from Intel, AMD, RPi, PC-Engines, DVB … are binary blobs, hence proprietary modules.

maintech · 7 September 2023 15:56

Your reply is obvious to anyone who has the ability to think.

maintech · 8 September 2023 11:58

My first try at resetting things in my BIOS was a failure. I actually made it worse. It went from 24 hours to about 2 hours before the bug killed the system. So on my second try I went the other direction in BIOS settings. It has now been 24 hours and so far no bug. Maybe I got it this time. If so, I owe it all to you guys for the guidance. I appreciate it more than I can express. Thank you…

cfusco · 8 September 2023 12:15

On the contrary, it pointed you in the right direction. When you are certain of the result, can you tell us which BIOS parameters you modified?

No “thank you” is needed (but it is appreciated). just pay it forward, helping the next IPFire user.

maintech · 9 September 2023 13:34

After manually trying to set the timing and voltages in BIOS and all failing I decided to try something I was sure would fail. I reset the BIOS to “failsafe”. I had run out of ideas on settings. After rebooting in failsafe I have had no bugs and it has run smoothly and with less issues. I haven’t looked at what the voltage and timing settings are now but I am sure they are set very low. I will have to wait and see over time but I feel it is now fixed. I wish to thank you again for your help.

maintech · 22 September 2023 03:04

I was wrong. That BIOS setting didn’t solve my problems.Here is a short copy of my logs.I hope you can tell what the issue is. Those logs are just a little too complex for my understanding. Any suggestions including additional logs to try to get enough information to make a determination of what is causing the issue will be gladly accepted. Here is the short kernel log.

BUG: Bad page map in process khugepaged pte:8000400117a26067 pmd:10476f067
addr:00000000073df000 vm_flags:00100073 anon_vma:ffff8bb607ec3270 mapping:000000 0000000000 index:73df
file:(null) fault:0x0 mmap:0x0 read_folio:0x0
CPU: 6 PID: 61 Comm: khugepaged Tainted: G B 6.1.45-ipfire #1
Hardware name: Gigabyte Technology Co., Ltd. GA-78LMT-USB3/GA-78LMT-USB3, BIOS F A 04/23/2013
Call Trace:

dump_stack_lvl+0x48/0x6a
print_bad_pte.cold+0x73/0xd6
? _raw_spin_lock_irq+0x1d/0x50
? _raw_spin_unlock_irq+0x1c/0x50
vm_normal_page+0xc3/0xe0
hpage_collapse_scan_pmd+0x2bc/0x580
khugepaged+0x510/0x990
? collapse_pte_mapped_thp+0x3d0/0x3d0
kthread+0xed/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30

Last question. That bug wouldn’t be because of one of my network cards would it?

arne_f · 22 September 2023 07:58

This is not sure. A driver bug can trigger such errors also. But more often it is caused by memory or disk errors. Because khugepaged is part of the swap/memory management.

bonnietwin · 22 September 2023 08:11

I have done some further reading of the kernel documentation, especially with regard to tainting.

The flags are listed as G & B and the kernel documentation gives the following explanation:-

Bit 0 is marked as G which means that all kernel modules loaded have a GPL or compatible licence. If any of the loaded modules were proprietary then bit 0 would be marked as P. Therefore the tainting is not related to non standard modules.

Bit 5 is marked as B which means that a page-release function has found a bad page reference or some unexpected page flags. This indicates a hardware problem or a kernel bug.

You can confirm the tainted state of your kernel by running
cat /proc/sys/kernel/tainted
which should result in zero if not tainted, which should normally be the case. In your case you should find that it comes up with 32. This is 0 for bit 0 and 32 for bit 5.

More details about tainted kernels can be seen in this
https://www.kernel.org/doc/html/v5.8/admin-guide/tainted-kernels.html

As the taint is regarding bad memory page references then I would not expect it to be related to network hardware. More likely to be related to memory issues.

The taint info also says that if it is not a hardware issue then it could be a kernel bug but from what I have read the kernel developers will not accept a bug report if the kernel is marked as tainted.

EDIT:
Reading the kernel documentation on bug reporting and hunting it also shows providing the stack dump and the example has a tainted kernel so maybe the kernel developers will accept bug reports with tainted kernels but with testing carried out to first confirm that non-kernel related issues have been eliminated.

maintech · 22 September 2023 10:13

cat /proc/sys/kernel/tainted outputs 0 or as you said “not tainted”.

bonnietwin · 22 September 2023 10:28

Once set the taint value stays in that file until the system is rebooted and then it is reset back to zero.

The fact that it stays at zero after reboot means that the problem with the bad page reference is normally okay but periodically has a failure that then finds the bad page reference and then sets the taint value.

You mention that this problem occurs every 24 hours and your original stack report had a time of around 00:17.

Looking through the standard crontab file there is nothing that occurs around that time that might be overloading the memory in some way. Some entries have their settings set nightly but random but then the problem would not be every 24 hours but more variable.

Is there some script that you are running around that time that might be triggering the problem.

Have you run a memtest on the ram?

maintech · 22 September 2023 11:32

I am not running any scripts at all. I have not run a memtest yet.

maintech · 18 October 2023 20:43

Switched out computers which changed me from AMD CPU to Intel. I haven’t had this issue on this computer. Quickest fix I could think of.