Can I make md (Linux software RAID) more fault tolerant?

Question

I have a particular hard drive in a RAID 1 mirror that gets failed out under heavy load, typically, when I run a full backup.

There is nothing wrong with the drive. Ok, it is getting one error when writing the superblock, but that's it. It's always this same disk and this same array that needs manually re-added every time I run the backup process.

Are there any settings that will make md more tolerant of whatever is causing this drive to get failed when under load?

This is Linux software RAID on Debian.

UPDATE: As requested, DMSG output at the time of failure:

[2347429.116507] print_req_error: I/O error, dev sda, sector 15751347328
[2347429.116511] sd 1:0:0:0: [sda] tag#1058 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[2347429.116516] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[2347429.116518] sd 1:0:0:0: [sda] tag#1058 CDB: Write(16) 8a 08 00 00 00 00 00 00 00 28 00 00 00 08 00 00
[2347429.116522] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[2347429.116523] print_req_error: I/O error, dev sda, sector 40
[2347429.116526] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[2347429.116529] print_req_error: I/O error, dev sda, sector 40
[2347429.116532] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[2347429.116533] md: super_written gets error=10
[2347429.116536] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[2347429.116538] md/raid1:md127: Disk failure on sda, disabling device.
                 md/raid1:md127: Operation continuing on 1 devices.

I also just ran a short SMART offline test, which Completed without error. The status is:

SMART overall-health self-assessment test result: PASSED

UPDATE 2: output of smartctl -a /dev/sda

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-25-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Ultrastar He10
Device Model:     HGST HUH721010ALE604
Serial Number:    1EK1W8WZ
LU WWN Device Id: 5 000cca 27eeb2150
Firmware Version: LHGNW384
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Nov 17 13:20:55 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   93) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (1167) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   151   151   024    Pre-fail  Always       -       429 (Average 442)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       99
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       28468
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       99
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1327
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1327
194 Temperature_Celsius     0x0002   181   181   000    Old_age   Always       -       33 (Min/Max 17/45)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       6783827

SMART Error Log Version: 1
ATA Error Count: 65535 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 65535 occurred at disk power-on lifetime: 28414 hours (1183 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 a1 db 40 00  14d+02:47:36.906  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.906  READ LOG EXT
  60 00 08 00 a2 db 40 00  14d+02:47:36.904  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.890  READ LOG EXT
  2f 00 01 10 00 00 00 00  14d+02:47:36.890  READ LOG EXT

Error 65534 occurred at disk power-on lifetime: 28414 hours (1183 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 a1 db 40 00  14d+02:47:36.890  READ FPDMA QUEUED
  60 00 08 00 a2 db 40 00  14d+02:47:36.880  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.865  READ LOG EXT
  2f 00 01 10 00 00 00 00  14d+02:47:36.865  READ LOG EXT
  60 00 08 00 a2 db 40 00  14d+02:47:36.856  READ FPDMA QUEUED

Error 65533 occurred at disk power-on lifetime: 28414 hours (1183 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 a1 db 40 00  14d+02:47:36.865  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.865  READ LOG EXT
  60 00 08 00 a2 db 40 00  14d+02:47:36.856  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.840  READ LOG EXT
  2f 00 01 10 00 00 00 00  14d+02:47:36.840  READ LOG EXT

Error 65532 occurred at disk power-on lifetime: 28414 hours (1183 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 a1 db 40 00  14d+02:47:36.840  READ FPDMA QUEUED
  60 00 08 00 a2 db 40 00  14d+02:47:36.836  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.823  READ LOG EXT
  2f 00 01 10 00 00 00 00  14d+02:47:36.823  READ LOG EXT
  60 00 08 00 a2 db 40 00  14d+02:47:36.820  READ FPDMA QUEUED

Error 65531 occurred at disk power-on lifetime: 28414 hours (1183 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 a1 db 40 00  14d+02:47:36.823  READ FPDMA QUEUED
  60 00 08 00 a2 db 40 00  14d+02:47:36.820  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.806  READ LOG EXT
  2f 00 01 10 00 00 00 00  14d+02:47:36.806  READ LOG EXT
  60 00 08 00 a2 db 40 00  14d+02:47:36.796  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     28440         -
# 2  Extended offline    Completed without error       00%        18         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Interesting to see SMART data of the said drive after it was kicked out and after the resync was completed (it is being completely rewritten which gives the drive a chance to remap bad blocks; that should be seen in SMART and that is what I am hoping to observe). If not, check the PSU and details of what happens when the drive is being kicked out (dmesg, etc.). Maybe, PSU voltages fall too low under continuous load and drive becomes unreliable because the power unit fails it. Or just power cable is seated poorly. If you determine this is not the case, the generic answer below says it all. — Nikita Kipriyanov, Nov 15 at 12:17
Can you share the brand and model numbers for all the drives in the raid array? — Criggie, Nov 16 at 2:54
That is not normal. There's a reason the drive is kicked out. Find out what that reason is, and fix it. It may well be that the drive is faulty in some way after all. — marcelm, Nov 16 at 8:42
"There is nothing wrong with the drive". Surely there is. Have you checked dmesg and logs? — jcaron, Nov 16 at 9:43
I didn't asked for the SMART self test. I asked for smartctl -a /dev/sda right after it was kicked out the array, and right after it finished rebuilding after you put it back. Also, please tell, the sector number it reports in the read error (the first line of your log excerpt) is always the same or changes each time? — Nikita Kipriyanov, Nov 16 at 17:13

vidarlo · Accepted Answer · 2023-11-15 12:02:19Z

30

When you do a backup, you read a lot of data. Probably the drive returns read errors, and is dropped for that reason. This may only happen for some specific area of the drive, which is not normally read.

The problem is that the drive is not reliable. You should replace the drive, not attempt to make MD accept it. MD drops it for a reason - it's not trustworthy.

answered Nov 15 at 12:02

vidarlo

8,1622 gold badges22 silver badges33 bronze badges

1

This is the correct answer. The last thing you need is to do heavy writes on a drive that's giving detectable errors. Any time those errors start showing up, the drive 100% will die, it's a matter of when. It's not like the drive "gets sick" and heals itself over-time.
– Nelson
Nov 16 at 8:56

Add a comment |

shodanshok · Accepted Answer · 2023-11-16 14:44:53Z

A single read-error will not kick-off the disk from the array, at least on post-2012 kernels.

From md man page:

In later kernels, a read-error will instead cause md to attempt a recovery by overwriting the bad block. i.e. it will find the correct data from elsewhere, write it over the block that failed, and then try to read it back again. If either the write or the re-read fail, md will treat the error the same way that a write error is treated, and will fail the whole device.

For the device to be remove from the array, one of these two things should happen:

a read error is followed by a write error for the affected sector
a link reset is issued by the kernel (you can find it via dmesg)

If you are sure the disk is OK, try re-seating it and/or change SATA/power cables.

If issues persist, replace it.

EDIT: your dmesg output clearly shows how sda has some serious problem. I would replace it as soon as possible.

If issues persist, replace it. Before replacing any disks, I'd check all system temperatures during backup, too. Maybe the system just needs its fans cleaned. Then I'd also calculate the power requirements of a fully-loaded system and check that against the actual power supply capacity. Are there any voltage drop(s) when the backups are running? — Andrew Henle, Nov 16 at 15:02

bobflux · Accepted Answer · 2023-11-16 00:10:04Z

13

I had a similar problem. I first checked the drive's SMART info for read error count, but there weren't any. Yet the OS reported errors and the drive got kicked out of the RAID.

It turned out to be a faulty SATA cable.

answered Nov 16 at 0:10

bobflux

3611 silver badge5 bronze badges

3

Good answer as it is always important to remember these things can be caused by faulty cables - even brand-new cables - and frequently nothing really points to it. I blogged about how this bit me some years ago.
– davidbak
Nov 16 at 21:16
1

I'd like to extend this to checking the power connection as well. I had odd issues with hardware and simply moving the connection from the middle header in the power cable to the terminal/end header resolved it like magic.
– Joshua K
Nov 17 at 15:53
1

The "UDMA_CRC_Error_Count" in the SMART output that just got posted is strongly suggestive of a bad cable.
– Mark
Nov 17 at 22:30
It's very likely. The drive gets hotter under continuous load, and due to thermal expansion a loose cable contacts become faulty. Otherwise, the drive itself looks pretty OK and reliable. Replace or at least reseat the SATA cable and power connection. And, had you provide this information from the very beginning, it could be the only answer.
– Nikita Kipriyanov
Nov 18 at 4:32

Add a comment |

Stack Exchange Network

Can I make md (Linux software RAID) more fault tolerant?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
debian
raid
hard-drive
software-raid
.

Hot Network Questions

Can I make md (Linux software RAID) more fault tolerant?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged debianraidhard-drivesoftware-raid.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
debian
raid
hard-drive
software-raid
.