Large CPU usage (CPU waiting for IO)

Hello!

I noticed a large CPU load, but did not attach any importance to this. Several updates have passed and I looked at the event log and saw problems.

I did a cleaning of the graphs, because they stopped displaying information again:

Kernel and Firewall:
 WARNING:  Kernel Errors Present
             res 41/00:01:2b:88:e7/00:00:0f:00:00/40 Emask 0x1 (device error) ...:  3 Time(s)
             res 41/00:02:1a:23:e7/00:00:0f:00:00/40 Emask 0x1 (device error) ...:  32 Time(s)
             res 41/00:02:fa:99:e7/00:00:0f:00:00/40 Emask 0x1 (device error) ...:  32 Time(s)
             res 41/00:06:9e:84:e7/00:00:0f:00:00/40 Emask 0x1 (device error) ...:  167 Time(s)
             res 41/00:06:f6:22:e7/00:00:0f:00:00/40 Emask 0x1 (device error) ...:  95 Time(s)
             res 41/00:07:9d:84:e7/00:00:0f:00:00/40 Emask 0x1 (device error) ...:  31 Time(s)
             res 41/00:07:f5:22:e7/00:00:0f:00:00/40 Emask 0x1 (device error) ...:  32 Time(s)
             res 41/00:09:2b:88:e7/00:00:0f:00:00/40 Emask 0x1 (device error) ...:  2 Time(s)
             res 41/40:00:1a:23:e7/00:00:0f:00:00/40 Emask 0x409 (media error) <F> ...:  1 Time(s)
             res 41/40:00:2b:88:e7/00:00:0f:00:00/40 Emask 0x409 (media error) <F> ...:  2 Time(s)
             res 41/40:00:9d:84:e7/00:00:0f:00:00/40 Emask 0x409 (media error) <F> ...:  1 Time(s)
             res 41/40:00:9e:84:e7/00:00:0f:00:00/40 Emask 0x409 (media error) <F> ...:  2 Time(s)
             res 41/40:00:c3:21:e7/00:00:0f:00:00/40 Emask 0x409 (media error) <F> ...:  2 Time(s)
             res 41/40:00:f4:22:e7/00:00:0f:00:00/40 Emask 0x409 (media error) <F> ...:  1 Time(s)
             res 41/40:00:f5:22:e7/00:00:0f:00:00/40 Emask 0x409 (media error) <F> ...:  1 Time(s)
             res 41/40:00:f6:22:e7/00:00:0f:00:00/40 Emask 0x409 (media error) <F> ...:  5 Time(s)
             res 41/40:00:fb:99:e7/00:00:0f:00:00/40 Emask 0x409 (media error) <F> ...:  2 Time(s)
             res 51/40:01:1b:23:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  4 Time(s)
             res 51/40:01:2b:88:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  4 Time(s)
             res 51/40:01:53:36:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  2 Time(s)
             res 51/40:01:73:a3:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  1 Time(s)
             res 51/40:01:fb:99:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  4 Time(s)
             res 51/40:02:1a:23:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  2 Time(s)
             res 51/40:02:52:36:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  2 Time(s)
             res 51/40:03:d9:21:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  1 Time(s)
             res 51/40:04:d8:21:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  2 Time(s)
             res 51/40:05:0f:7b:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  2 Time(s)
             res 51/40:06:0e:7b:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  1 Time(s)
             res 51/40:06:9e:84:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  1 Time(s)
             res 51/40:06:d6:21:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  1 Time(s)
             res 51/40:07:0d:7b:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  4 Time(s)
             res 51/40:07:75:70:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  5 Time(s)
             res 51/40:08:74:70:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  2 Time(s)
             res 51/40:08:d4:21:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  3 Time(s)
             res 51/40:0e:06:a6:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  1 Time(s)
             res 51/40:0f:75:70:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  1 Time(s)
             res 51/40:12:52:36:e7/00:00:00:00:00/ef Emask 0x9 (media error) ...:  1 Time(s)
    I/O error, dev sda, sector ...:  71 Time(s)
    ata1.00: NCQ disabled due to excessive errors ...:  2 Time(s)
    ata1.00: error: { UNC } ...:  61 Time(s)
    ata1.00: failed to IDENTIFY (I/O error, err_mask=0x5) ...:  12 Time(s)
    sd 0:0:0:0: [sda] tag#0 Add. Sense: Unrecovered read error - auto reallocat ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#0 Sense Key : Medium Error [current]  ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#1 Add. Sense: Unrecovered read error - auto reallocat ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#1 Sense Key : Medium Error [current]  ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#10 Add. Sense: Unrecovered read error - auto reallocat ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#10 Sense Key : Medium Error [current]  ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#11 Add. Sense: Unrecovered read error - auto reallocat ...:  4 Time(s)
    sd 0:0:0:0: [sda] tag#11 Sense Key : Medium Error [current]  ...:  4 Time(s)
    sd 0:0:0:0: [sda] tag#12 Add. Sense: Unrecovered read error - auto reallocat ...:  4 Time(s)
    sd 0:0:0:0: [sda] tag#12 Sense Key : Medium Error [current]  ...:  4 Time(s)
    sd 0:0:0:0: [sda] tag#14 Add. Sense: Unrecovered read error - auto reallocat ...:  3 Time(s)
    sd 0:0:0:0: [sda] tag#14 Sense Key : Medium Error [current]  ...:  3 Time(s)
    sd 0:0:0:0: [sda] tag#15 Add. Sense: Unrecovered read error - auto reallocat ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#15 Sense Key : Medium Error [current]  ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#16 Add. Sense: Unrecovered read error - auto reallocat ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#16 Sense Key : Medium Error [current]  ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#17 Add. Sense: Unrecovered read error - auto reallocat ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#17 Sense Key : Medium Error [current]  ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#18 Add. Sense: Unrecovered read error - auto reallocat ...:  3 Time(s)
    sd 0:0:0:0: [sda] tag#18 Sense Key : Medium Error [current]  ...:  3 Time(s)
    sd 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto reallocat ...:  3 Time(s)
    sd 0:0:0:0: [sda] tag#19 Sense Key : Medium Error [current]  ...:  3 Time(s)
    sd 0:0:0:0: [sda] tag#2 Add. Sense: Unrecovered read error - auto reallocat ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#2 Sense Key : Medium Error [current]  ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#20 Add. Sense: Unrecovered read error - auto reallocat ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#20 Sense Key : Medium Error [current]  ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#21 Add. Sense: Unrecovered read error - auto reallocat ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#21 Sense Key : Medium Error [current]  ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#22 Add. Sense: Unrecovered read error - auto reallocat ...:  7 Time(s)
    sd 0:0:0:0: [sda] tag#22 Sense Key : Medium Error [current]  ...:  7 Time(s)
    sd 0:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read error - auto reallocat ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#23 Sense Key : Medium Error [current]  ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#24 Add. Sense: Unrecovered read error - auto reallocat ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#24 Sense Key : Medium Error [current]  ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#26 Add. Sense: Unrecovered read error - auto reallocat ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#26 Sense Key : Medium Error [current]  ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#29 Add. Sense: Unrecovered read error - auto reallocat ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#29 Sense Key : Medium Error [current]  ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#3 Add. Sense: Unrecovered read error - auto reallocat ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#3 Sense Key : Medium Error [current]  ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#30 Add. Sense: Unrecovered read error - auto reallocat ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#30 Sense Key : Medium Error [current]  ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#31 Add. Sense: Unrecovered read error - auto reallocat ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#31 Sense Key : Medium Error [current]  ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#4 Add. Sense: Unrecovered read error - auto reallocat ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#4 Sense Key : Medium Error [current]  ...:  1 Time(s)
    sd 0:0:0:0: [sda] tag#5 Add. Sense: Unrecovered read error - auto reallocat ...:  3 Time(s)
    sd 0:0:0:0: [sda] tag#5 Sense Key : Medium Error [current]  ...:  3 Time(s)
    sd 0:0:0:0: [sda] tag#7 Add. Sense: Unrecovered read error - auto reallocat ...:  3 Time(s)
    sd 0:0:0:0: [sda] tag#7 Sense Key : Medium Error [current]  ...:  3 Time(s)
    sd 0:0:0:0: [sda] tag#8 Add. Sense: Unrecovered read error - auto reallocat ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#8 Sense Key : Medium Error [current]  ...:  2 Time(s)
    sd 0:0:0:0: [sda] tag#9 Add. Sense: Unrecovered read error - auto reallocat ...:  4 Time(s)
    sd 0:0:0:0: [sda] tag#9 Sense Key : Medium Error [current]  ...:  4 Time(s)

I would backup up you system ASAP and order a new hard disk. Your present disk is soon to be dead.

ata1.00: failed to IDENTIFY (I/O error, err_mask=0x5) ...:  12 Time(s)

The drive is having trouble even identifying itself, which is a serious concern.

3 Likes

@cfusco there are backups, this is not a problem. But I don’t understand why there is a large CPU load?

It is not a heavy load. The yellow colour is the cpu waiting for I/O. So it is spending a lot of time waiting to get info from the disk that it has requested because it has to be tried many times to get a successful response.

It means that things started to get very bad with that drive around 4:00am this morning as that is when the CPU Waiting for IO increased to 40% but even before that you had around 2% to 10% occurring which is not good for a normal drive.

My IPFire system has an average of 0.01% for CPU waiting for IO.

1 Like

@bonnietwin Thank you! Maybe in future versions it will be possible to display a notification on the ipfire homepage in case of degradation of the hard disk?

That is not so easy to do as the CPU waiting for IO is not a sufficient indicator of hard disk failure as it could also be high for valid reasons, such as a File Server being used on IPFire with large files. You could also have hard disk failure that doesn’t show as higher CPU waiting for IO.

What do you see on the WUI Status - Media page re the disk access per day. Is there anything unusual in the values there.
What does SMART show if you press the SMART Information button on the same page.

1 Like
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.45-ipfire] (IPFire 2.27)
Copyright (C) 2002-22  Bruce Allen  Christian Franke  www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HM321HX
Serial Number:    S26VJ9FZ427726
LU WWN Device Id: 5 0024e9 2025b4487
Firmware Version: 2AJ10001
User Capacity:    320 072 933 376 bytes [320 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6  3.0 Gb/s
Local Time is:    Fri Aug 18 14:49:58 2023 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       3510
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   093   090   025    Pre-fail  Always       -       2224
  4 Start_Stop_Count        0x0032   093   093   000    Old_age   Always       -       7584
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       33170
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   086   086   000    Old_age   Always       -       14867
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       2139
191 G-Sense_Error_Rate      0x0022   099   099   000    Old_age   Always       -       13039
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   054   052   000    Old_age   Always       -       46 (Min/Max 2/48)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   099   099   000    Old_age   Always       -       66
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       755
223 Load_Retry_Count        0x0032   086   086   000    Old_age   Always       -       14867
225 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       2676178
1 Like

And that shows the problems with SMART interpretation. That shows no failing paramters at all but the logs indicate big problems.

SMART sometimes shows no messages and then the drive just dies.
Sometimes it shows messages and nothing happens to the drive. It just keeps working.
Sometimes it shows potential failures and those are an indication of real problems to come.

Hard Disk degradation evaluation is something that really requires multiple sources of information to be reviewed together with built up experience to be able to make a decision.

I had a hard disk on my desktop that showed some errors that were classified as the hard disk controller electronics starting to fail to be able to read from the disk.
I found that the sata connector was not seated well. Replaced with a new sata cable that has a clip to hold it in place and the drive works fine but as that parameter is considered critical for drive health it always flags up a failure with every SMART scan that I run at boot. The parameter has never increased again.

In your case you could look at checking the seating of the hard drive connector and see if the problem disappears completely but I would also purchase a new disk to have available as a precaution.

4 Likes

I completely agree and this is also my experience. Actually, I never saw a SMART table indicating failure, but plenty of disks died with a clean bill of health.

These are the three concerning parameters from the table:

  • Current_Pending_Sector: The value is 66. These are sectors that are currently unstable and might become reallocated sectors.

  • Load_Cycle_Count: This is at 2,676,178, which is a high count. Load cycles refer to the number of times the drive head parks itself. A high count indicates a lot of wear and tear, potentially leading to drive failure in the future.

  • Power_On_Hours: This drive has been powered on for 33,170 hours, which equates to nearly 3.8 years of continuous operation.

Considering this is a mechanical hard disk, it is probably simply wearing off and approaching the end of its life.

1 Like

Thank you for your help, there is definitely nothing wrong with the hard drive connector… Over the weekend, we will prepare new backup network equipment, since the current network hardware works in a production environment.

1 Like

Personal opinion/experience: Samsung was not greatest hard drive brand while stayed on market. SSD far better products, for reliability and usually performances.

If this is for corporate use, or even SMB, please consider asking your IT infrastructure team if they’re willing to donate to the IPFire project, if they do not already.

IPFire is community-supported, and a donation can go a long way in ensuring its continual development and improvement, especially since this thread has helped in potentially preventing an outage. :blush:

IPFire Donate

2 Likes

this kind of equipment was purchased a long time ago, and at that time there were other tasks. Time does not stand still, and with the obsolescence of equipment, the requirements also grow. Therefore, over time we will replace the old equipment completely, since it works 24/7 for a fairly long period in not very good conditions.

1 Like

I’ve taken the liberty of checking the warranty of the drive at Seagate (model: ST320LM000) and the warranty expired 28. April 2012. So if the warranty was 2 years it was produced in 2010, that’s a long time ago. :slightly_smiling_face:

so it is, everything comes to an end)

In fairness, if it’s operating correctly since 2010, 12 years of almost 24/7 operation put this sample at least among excellent performance about reliability.
However, according to my experience, this seems an exception, more than the average. However, reliability of this hard drive might be helped by good PSU and UPS for filtering power issues.
2000/2010 were tough years about PSUs, market was flooded with cheap but crappy products.

The electronics industry went from leaded solder to non-leaded solder in the early to mid 2000’s. So there were lots of learning curve issues related to poorly soldered components.

and to add to the misery!

There were LOTS of bad electrolytic capacitors manufactured in late 1990’s to near 2007. And PSUs need good (really good) electrolytic capacitors!

And the consumers got to pay for it all!

1 Like

I hope this will be the shortest OT Ever… So.
AFAIK RoHS kicked in in 2012, at least, the “tougher one” RoHS 2, which hurted a lot european automotive, and Ilmor-designed engines branded Mercedes (bye bye beryllium). This lead (pun intended) to several difficulties for make metals and electronics work… less crappier than with hazardous materials. Capacitors issue were the same during all electronics life, a Youtuber that I follow that restores a lot of electronic equipment as “standard procedure” hit all capacitors for evaluate the status, and change them if are damaged, not responding to specs or… simply crappy (due to original material). Devices are from '60, more '70, and '80.
Rollback to the issues: a lot of mass production changed gear around 2017/2018, most of the times changing processes and tinkering enough with soldering mixes.

In any case, whenever possible, we try to change the equipment every 5 years, since it works 24/7. Yes, even in new equipment, not very fresh components may come across… But I am glad that Ipfire is not very demanding on equipment and if we come across, for example, not very fast hard drives of the old generation, then we do not pay attention to it, the main thing is that it would work for a long time)