Qemu Guest Agent on WHM Servers causes Proxmox Backup Backup server to never unfreeze

Scenario

You’re trying to backup a WHM/cPanel server using Proxmox Backup Server. The server is running qemu-guest-agent.

The next minute, instead of backing up the server, the server freezes. You examine the log and see it’s stuck on “issuing freeze”:

INFO: starting new backup job: vzdump 123 --remove 0 --mailto [email protected] --mode snapshot --node hv07 --storage backup.example.com --notes-template '{{guestname}}'
INFO: Starting Backup of VM 123 (qemu)
INFO: Backup started at 2024-04-20 08:59:14
INFO: status = running
INFO: VM Name: whm-cpanel.example.com
INFO: include disk 'scsi0' 'local-lvm:vm-123-disk-0' 300G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/123/2024-04-20T06:59:14Z'
INFO: issuing guest-agent 'fs-freeze' command

You panic. It’s been more than 2 minutes and the server can’t ping. You press the stop command and get this additional output:

closing with read buffer at /usr/share/perl5/IO/Multiplex.pm line 927.
ERROR: interrupted by signal
INFO: issuing guest-agent 'fs-thaw' command

The server doesn’t unthaw. It still doesn’t ping. You issue this command:

qm unlock 112

The server is still offline. You have no choice but to stop it.

Unfortunately this is a known problem with WHM/cPanel servers running qemu-guest-agent and being backed up by Proxmox Backup Server. The problem is nice and complicated and basically goes like this:

The issue is  to Qemu Guest Agent, which does not freeze the file system(s) correctly.

What is actually happening: When VM backup is invoked, the qemu-guest-agent freezes the file systems, so no single change will be made during the backup. But qemu-guest-agent does not respect the loop* devices in freezing order (we have checked its sources), which leads to the next situation:

1) freeze loopback fs —> send async reqs to loopback thread
2) freeze main fs
3) loopback thread wakes up and trying to write data to the main fs, which is still frozen, and this finally leads to the hung task and kernel crash.

Reference: https://support.cpanel.net/hc/en-us/community/posts/19161470868375

The Proxmox forum has a number of users reported this problem and they are all using CloudLinux and WHM/cPanel.

Solutions?

There are no official solutions for this problem, except you may want to consider this one:

  • Don’t use qemu-guest-agent

For some users this will be a problem but for others not. Basically qemu-guest-agent is really useful and most VM hosts don’t want to switch it off.

References

Three unsanctioned fixes here, including manipulating /tmp and jails:

https://forum.proxmox.com/threads/whats-the-difference-between-clone-and-move-disk-fs-freeze-gets-stuck-in-snapshot-mode-schedule-backups.107962/#post-464386

This article then in turns refers to https://docs.cpanel.net/knowledge-base/security/tips-to-make-your-server-more-secure/#harden-your-tmp-partition

Other notable mentions on the forum:

This post outlines some useful fsfreeze command. However, since your server will have crashed and qemu-guest-agent won’t be running anymore, they are pretty useless:

https://forum.proxmox.com/threads/snapshot-stopping-vm.59701/

More:

https://forum.proxmox.com/threads/vm-hang-during-backup-fs-freeze.80152/

https://forum.proxmox.com/threads/error-vm-100-qmp-command-guest-fsfreeze-thaw-failed-got-timeout.68082/

https://forum.proxmox.com/threads/proxmox-vm-freezes-for-real-when-issuing-guest-agent-fs-freeze.105868/

 

 

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top