Adventures in rebuilding a Intel Rapid Storage Technology enterprise Mirror on a Supermicro when one disk is problematic

An adventure with mdadm on a broken hard drive

A Supermicro server started acting up. The problem was evident on startup – a lot of “wait” time for RAID to initialize and not really succeeding, until eventually timing out. Quite a number of times emergency mode was invoked. Sometimes leaving the system for quite long meant it kind of rebuilt. The symptoms on startup were something like:

A start job is running for dev/md124p1.device (30s / 1min 30s).

Also running smartctl on the four drives in the system was interesting – three disk were rather snappy and the fourth always lagged. Once could also see with cat /proc/mdstat that the drive speed of the one mirror was excruciatingly slow.

I don’t think there is an exact formula for fixing a broken drive, but at least in the case having a mirror helped. However, detaching this drive from the mirror became a nightmare. Most of this adventure is about trying to detach the mirror. In total there were 4 drives, Mirror A (faulty) and Mirror B (working fine). All drives are Western Digital 2.7 TB drives.

Eventually the call was made to remove this drive, first from the software configuration, and then from the system. But how? How to not get drowned in terminology? It’s really hard to google. This is somewhat complicated stuff but for your reading pleasure we give you a transcript:

First of all, cat /proc/mdstat is rather useful. It showed the problem quite clearly. Continuous rebuilding taking exceeding long, hours. I could compare the healthy mirror to the broken mirror and speed was also an obvious problem. After a long time the mirror then actually appears fine, but rebooting start the whole process from scratch again.

Personalities : [raid1]
md124 : active raid1 sdc[1] sdd[0]
2783756288 blocks super external:/md125/0 [2/1] [U_]
[>....................] recovery = 0.0% (197568/2783756288) finish=77145.4min speed=600K/sec

md125 : inactive sdd[1](S) sdc[0](S)
10402 blocks super external:imsm

md126 : active raid1 sda[1] sdb[0]
2783756288 blocks super external:/md127/0 [2/2] [UU]

md127 : inactive sda[1](S) sdb[0](S)
10402 blocks super external:imsm

unused devices: <none>

But eventually this got boring so we moved to strike the disk. The point here was as long as that thing said recovery one could break the mirror.

To examine more closely, let’s look at the two disks which are okay:

mdadm --detail /dev/md126p2
/dev/md126p2:
Container : /dev/md/imsm0, member 0
Raid Level : raid1
Array Size : 1048576 (1024.00 MiB 1073.74 MB)
Used Dev Size : 18446744073709551615
Raid Devices : 2
Total Devices : 2

State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Consistency Policy : resync


UUID : 39900a34:f3e2b971:5d553540:b110c1e7
Number Major Minor RaidDevice State
1 8 0 0 active sync /dev/sda
0 8 16 1 active sync /dev/sdb

Moving on to the disks that are broken. Through observation we deduced it was /dev/sdd

mdadm --detail /dev/md124p1
/dev/md124p1:
Container : /dev/md/imsm1, member 0
Raid Level : raid1
Array Size : 2783755247 (2654.80 GiB 2850.57 GB)
Used Dev Size : 2783756288 (2654.80 GiB 2850.57 GB)
Raid Devices : 2
Total Devices : 2

State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

Consistency Policy : resync

Rebuild Status : 0% complete


UUID : c6956ed2:f692b781:631f768b:8404a220
Number Major Minor RaidDevice State
1 8 32 0 active sync /dev/sdc
0 8 48 1 spare rebuilding /dev/sdd

On the surface this looks good – but since we tried this 10 times already and saw that the end result is wait for nothing but errors, let’s move on to removing this disk:

mdadm --fail /dev/md/imsm1 --remove /dev/sdd
mdadm: /dev/sdd is still in use, cannot remove.

Whilst rebuilding, you can’t remove it. You have to wait. You can even umount it, it will still be in use.

So let’s wait. Then eventually:

mdadm --fail /dev/md/imsm1 --remove /dev/sdd
mdadm: hot removed /dev/sdd from /dev/md/imsm1

Now that’s hot. Okay, with most of the heavy terminology out of the way, what next. It’s removed, now what?

Please note above imsm1 was chosen. Who knows why? Something about parent containers.

Once it’s removed it’s not really removed. You have more work to do. The information says you have to wipe stuff. Like this:

wipefs -a /dev/sdc1

This doesn’t give any output.

Finally this is supposed to reduce it from 2 to 1 disk:

Finally now our “mirror” has one drive. How do we tell it about it? We tried all of these without success:

# mdadm --grow /dev/md124 --raid-devices=1
mdadm: '1' is an unusual number of drives for an array, so it is probably
a mistake. If you really mean it you will need to specify --force before
setting the number of drives.

For now just a reboot to observe behavior.

What is it’s still in use?

We tried all of the below to try and stop the disk whilst rebuilding and recovery but it never worked:

echo "idle" > /sys/block/md124/md/sync_action

That didn’t work, so we tried this:

echo 0 > /proc/sys/dev/raid/speed_limit_max

Much more googling, then this:

echo frozen > /sys/block/md0/md/sync_action
echo none > /sys/block/md0/md/resync_start
echo idle > /sys/block/md0/md/sync_action

After all of this, cat /proc/mdstat still shows recovery and apparently disk still in use. It appears on a broken disk once it’s “locked” in the operating system it’s a real nightmare to remove. We had to wait it out.

Eventually this worked again and we could remove i:

mdadm --fail /dev/md/imsm1 --remove /dev/sddmdadm: hot removed /dev/sdd from /dev/md/imsm1

At this point it’s not a member anymore, but rebooting makes it a member again. So you have to zero out some bits as described above and do more work.

References

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top