Proxmox Cluster Quick Start

Running a Cluster over a WAN

You shouldn’t. Corosync needs 5 to 10 ms latency and is highly sensitive to jitter. We tried running it over a WAN with 20 milliseconds and you’ll see this in the syslog:

Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 94 96
Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 94 96
Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 94 96
Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 94 96
Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 94 96
Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 98
Nov 21 13:26:33 host corosync[2269607]: [KNET ] link: host: 4 link: 0 is down
Nov 21 13:26:33 host corosync[2269607]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 21 13:26:33 host corosync[2269607]: [KNET ] host: host: 4 has no active links
Nov 21 13:26:35 host corosync[2269607]: [KNET ] rx: host: 4 link: 0 is up

Unfortunately you now lose a lot of functionality if you have multiple data centres. You’ll loose high availability, migrations between nodes, and possibly the ability to use PBS across one cluster. If our case PBS worked well for a month, but when the cluster became unreliable things broke down miserably.

So yes it’s possible. No, the documentation makes it clear it’s better to have a dedicated network for Corosync which is the underlying synchronization protocol.

Removing Cluster Nodes – still in UI

You need to remove the whole directory /etc/pve/nodes/<nodename>

Useful Troubleshooting Commands

Change Quorum Quantity

Change quorum quantity (super risky so read about risks first):

pvecm expected X

Status

pvecm status

Deleting a node

pvecm delnode id

The documentation refers to names, but you can just use the ID that the pvecm status command gives you.

Upgrading a Cluster

Note 1: You can’t upgrade Proxmox if your cluster isn’t healthy.

Note 2: Multicast versus unicast and knet

We had multicast issues over a WAN and decided to try a switch to Unicast by way of the following switch:

cat /etc/corosync/corosync.conf
...
  totem {
  cluster_name: cluster
  config_version: 7
  transport: udpu
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
...

Then during upgrade from 7 to 8, we had this warning:

Checking totem settings..
FAIL: Corosync transport explicitly set to 'udpu' instead of implicit default!

Fixing Corosync.conf

This is tricky because it can be overwritten very quickly and you can easily break the cluster.

Here is the part of the manual that talks about it:
https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_configuration

  • You have to make a copy
  • You have to work on the copy
  • Then make another backup of the main
  • Then move the copy over the main

Now that you’re ready for high work, this is it:

cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
edit
cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
 Service Commands
systemctl status corosync.service
journalctl -b -u corosync
systemctl status pve-cluster.service

Other Caveats

  1. You probably shouldn’t mix and match Proxmox 7.4 and PBS 8.x.
  2. It seems you have to trash your VMs when you disjoin the cluster. Well you back them up first, and take them offline, but this is a big problem for most production environments.

Cluster Error about existing virtual machine guests

Joining a Proxmox Cluster is trivial. But what if you can’t? Here are two common errors:

detected the following error(s):
* this host already contains virtual guests
* local node address: cannot use IP ‘192.168.100.2’, not found on local node!

Also please read the official Proxmox material as their documentation is really good:
https://pve.proxmox.com/wiki/Cluster_Manager

  • To form a quorum you need at least three nodes
  • Changing the hostname and IP is not possible after cluster creation

Once you have this working, you’ll see this output instead of more errors:

Establishing API connection with host '192.168.1.24'
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '192.168.1.21'
Request addition of this node

At this point you might see: “Connection error”, but when you go to another node you’ll see this:

Fixing this host already contains virtual guests

You can’t join a cluster when you have existing virtual machines. You have to back them up first and you’ll have to restore them after you’ve joined the cluster.

Important note: If you’ve used the built-in Datacenter => Backup facility of Proxmox you will find after joining the cluster that your backup job confirmation is gone. Simple use the following command line to restore the VMs:

ssh to Proxmox VE host

cd /var/lib/vz/dump
qmrestore vzdump-qemu-101-2023_10_16-11_05_33.vma.zst 101 -storage local-lvm

Fixing local node address: cannot use IP ‘a.b.c.d’, not found on local node

This will happen if you’ve changed the IP address of your Proxmox hypervisor.

When changing the IP of the box it seems that you have to manually change the /etc/hosts entry as well:
https://forum.proxmox.com/threads/adding-node-to-cluster-local-node-address-cannot-use-ip-error.57100/

 

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top