Unfortunately I had WHM/cPanel disk crash due to this problem:
- Backup VM using PBS
- Experience freeze
- Press
stop
in the backup user interface - QM Unlock VM
- Stop VM
The next minute, the server never started again. The last output of the VM is displayed below. All I could see is this:
The server was stuck on the last line:
[ 2.639060] Btrfs loaded, crc32c=crc32c-generic
It never came back up.
The disks are SSDs in a RAID 5 array.
On the Proxmox Hypervisor I got this:
hv7 login: [39419.081982] megaraid_sas 0000:17:00.0: FW in FAULT state Fault code:0x10000 subcode:0x0 func:megasas_wait_for_outstanding_fusion
I’ve captured this additional output:
[1402256.673421] INFO: task loop6:574 blocked for more than 120 seconds. [1402256.674685] Not tainted 5.4.0-174-generic #193-Ubuntu [1402256.675962] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1402256.678842] INFO: task WTCheck.tThread:1419 blocked for more than 120 seconds. [1402256.680226] Not tainted 5.4.0-174-generic #193-Ubuntu [1402256.681250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1402256.682823] INFO: task JournalFlusher:1421 blocked for more than 120 seconds. [1402256.684193] Not tainted 5.4.0-174-generic #193-Ubuntu [1402256.685123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1402256.686440] INFO: task tailwatchd:1143 blocked for more than 120 seconds. [1402256.687535] Not tainted 5.4.0-174-generic #193-Ubuntu [1402256.688544] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1402256.689930] INFO: task queueprocd - pr:50043 blocked for more than 120 seconds. [1402256.691117] Not tainted 5.4.0-174-generic #193-Ubuntu [1402256.692095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1402256.693421] INFO: task php-fpm:54593 blocked for more than 120 seconds. [1402256.694497] Not tainted 5.4.0-174-generic #193-Ubuntu [1402256.695441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1402256.696783] INFO: task cPhulkd - dbpro:56688 blocked for more than 120 seconds. [1402256.698018] Not tainted 5.4.0-174-generic #193-Ubuntu [1402256.698911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1402256.700163] INFO: task nscd:56700 blocked for more than 120 seconds. [1402256.701175] Not tainted 5.4.0-174-generic #193-Ubuntu [1402256.702135] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1402256.703448] INFO: task nscd:56701 blocked for more than 120 seconds. [1402256.704554] Not tainted 5.4.0-174-generic #193-Ubuntu [1402256.705718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1402256.707043] INFO: task nscd:56702 blocked for more than 120 seconds. [1402256.708133] Not tainted 5.4.0-174-generic #193-Ubuntu [1402256.709067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Additional output retrieved at various point of the crash:
qemu-img: iSCSI GET_LBA_STATUS failed at lba 37748727: SENSE KEY:ILLEGAL_REQUEST(5) ASCQ:INVALID_FIELD_IN_CDB(0x2400)
The final output, that could indicate how to repair, is this:
Begin: Running /scripts/local-premount ... [ 2.646415] Btrfs loaded, crc32c=crc32c-generic Scanning for Btrfs filesystems done. Begin: Will now check root file system ... fsck from util-linux 2.34 [/usr/sbin/fsck.ext4 (1) -- /dev/sda1] fsck.ext4 -a -C0 /dev/sda1 cloudimg-rootfs: Superblock needs_recovery flag is clear, but journal has data. cloudimg-rootfs: Run journal anyway cloudimg-rootfs: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options) fsck exited with status code 4 done. Failure: File system check of the root filesystem failed The root filesystem on /dev/sda1 requires a manual fsck BusyBox v1.30.1 (Ubuntu 1:1.30.1-4ubuntu6.4) built-in shell (ash) Enter 'help' for a list of built-in commands. (initramfs) Connection to 41.72.151.242 closed.
If you’re inexperienced doing disk repair under high pressure is really hard.
The crash happened over a weekend and I opted to restore from backup.
Restoring the backups took two and a half days and was performed by two people. It was no much fun at all.
Conclusion
Don’t use PBS with cPanel and qemu-guest-agent