It happens, you just lost a disk on your RAID5 MD array, or things are not how it should look like… How do we troubleshoot this?
First things first, what’s the name of your MD device. You can easily learn that by issuing:
This should output something similar to:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid5 sdd sda sdb 2929890816 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/4] [.AAA] bitmap: 0/8 pages [0KB], 65536KB chunk
Here we have a MD device
/dev/md0. (If you don’t see a response to this, you might have lost your MD device, which could be a bigger issue!)
Another thing that we see (Or we don’t see) here is that sda/sdb/sdd are here in the raid but
sdc is nowhere to be found! This is our problem.
For some reason
/dev/sdc is not in the RAID group anymore. Let’s see what’s going on with
mdadm --examine /dev/sdc
In my example this was hanging for a long time. When I issue
dmesg on another console, I was getting a lot of I/O errors about this disk. This is telling me that the disk is malfunctioning.
I shutdown the server and wiggled the disk. Rebooted and it was back online. My array has now four disks however only 3 of them are “functioning” since after the reboot MD kicked
/dev/sdc out of the array.
We need to reassemble the array and let RAID5 do its magic. First stop the MD device
mdadm --stop /dev/md0
Then we need to add
/dev/sdc back into the array:
mdadm /dev/md0 -a /dev/sdc
Then depending on the situation we might need to reassemble the array:
mdadm --assemble /dev/md0 /dev/sd[abcd] --verbose --force
/dev/sdc is now back in your array now. This should start a long(er) process to sync up the array state to all disks and hopefully you now have your array back!
After the sync completes, I would still do a fsck on the
$ fsck.ext4 /dev/md0 e2fsck 1.45.5 (07-Jan-2020) data: recovering journal JBD2: Invalid checksum recovering block 185073680 in log JBD2: Invalid checksum recovering block 89 in log Journal checksum error found in data data was not cleanly unmounted, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong (704006059, counted=696320594). Fix? yes Free inodes count wrong (182701042, counted=182694547). Fix? yes data: FILE SYSTEM WAS MODIFIED data: 429421/183123968 files (0.2% non-contiguous), 36152110/732472704 blocks
You can use this same steps (or similar) to remove /dev/sdc and replace with a brand new hard drive. In my case wiggling solved the problem for now. (I probably will need a drive in the near future)
I hope this helped someone. It surely will help me when I will have to do this again 😛