Anatomy of a disk crash caused by freezing / unlocking WHM/cPanel and stopping a Proxmox VM

Unfortunately I had WHM/cPanel disk crash due to this problem:

  1. Backup VM using PBS
  2. Experience freeze
  3. Press stop in the backup user interface
  4. QM Unlock VM
  5. Stop VM

The next minute, the server never started again. The last output of the VM is displayed below. All I could see is this:

The server was stuck on the last line:

[   2.639060] Btrfs loaded, crc32c=crc32c-generic

It never came back up.

The disks are SSDs in a RAID 5 array.

On the Proxmox Hypervisor  I got this:

hv7 login: [39419.081982] megaraid_sas 0000:17:00.0: FW in FAULT state Fault code:0x10000 subcode:0x0 func:megasas_wait_for_outstanding_fusion

I’ve captured this additional output:

[1402256.673421] INFO: task loop6:574 blocked for more than 120 seconds.
[1402256.674685]       Not tainted 5.4.0-174-generic #193-Ubuntu
[1402256.675962] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1402256.678842] INFO: task WTCheck.tThread:1419 blocked for more than 120 seconds.
[1402256.680226]       Not tainted 5.4.0-174-generic #193-Ubuntu
[1402256.681250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1402256.682823] INFO: task JournalFlusher:1421 blocked for more than 120 seconds.
[1402256.684193]       Not tainted 5.4.0-174-generic #193-Ubuntu
[1402256.685123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1402256.686440] INFO: task tailwatchd:1143 blocked for more than 120 seconds.
[1402256.687535]       Not tainted 5.4.0-174-generic #193-Ubuntu
[1402256.688544] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1402256.689930] INFO: task queueprocd - pr:50043 blocked for more than 120 seconds.
[1402256.691117]       Not tainted 5.4.0-174-generic #193-Ubuntu
[1402256.692095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1402256.693421] INFO: task php-fpm:54593 blocked for more than 120 seconds.
[1402256.694497]       Not tainted 5.4.0-174-generic #193-Ubuntu
[1402256.695441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1402256.696783] INFO: task cPhulkd - dbpro:56688 blocked for more than 120 seconds.
[1402256.698018]       Not tainted 5.4.0-174-generic #193-Ubuntu
[1402256.698911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1402256.700163] INFO: task nscd:56700 blocked for more than 120 seconds.
[1402256.701175]       Not tainted 5.4.0-174-generic #193-Ubuntu
[1402256.702135] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1402256.703448] INFO: task nscd:56701 blocked for more than 120 seconds.
[1402256.704554]       Not tainted 5.4.0-174-generic #193-Ubuntu
[1402256.705718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1402256.707043] INFO: task nscd:56702 blocked for more than 120 seconds.
[1402256.708133]       Not tainted 5.4.0-174-generic #193-Ubuntu
[1402256.709067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Additional output retrieved at various point of the crash:

qemu-img: iSCSI GET_LBA_STATUS failed at lba 37748727: SENSE KEY:ILLEGAL_REQUEST(5) ASCQ:INVALID_FIELD_IN_CDB(0x2400)

The final output, that could indicate how to repair, is this:

Begin: Running /scripts/local-premount ... [    2.646415] Btrfs loaded, crc32c=crc32c-generic
Scanning for Btrfs filesystems
done.
Begin: Will now check root file system ... fsck from util-linux 2.34
[/usr/sbin/fsck.ext4 (1) -- /dev/sda1] fsck.ext4 -a -C0 /dev/sda1
cloudimg-rootfs: Superblock needs_recovery flag is clear, but journal has data.
cloudimg-rootfs: Run journal anyway

cloudimg-rootfs: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
        (i.e., without -a or -p options)
fsck exited with status code 4
done.
Failure: File system check of the root filesystem failed
The root filesystem on /dev/sda1 requires a manual fsck

BusyBox v1.30.1 (Ubuntu 1:1.30.1-4ubuntu6.4) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs) Connection to 41.72.151.242 closed.

If you’re inexperienced doing disk repair under high pressure is really hard.

The crash happened over a weekend and I opted to restore from backup.

Restoring the backups took two and a half days and was performed by two people. It was no much fun at all.

Conclusion

Don’t use PBS with cPanel and qemu-guest-agent

 

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top