Proxmox Cluster Quick Start

Running a Cluster over a WAN

You shouldn’t. Corosync needs 5 to 10 ms latency and is highly sensitive to jitter. We tried running it over a WAN with 20 milliseconds and you’ll see this in the syslog:

Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 94 96
Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 94 96
Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 94 96
Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 94 96
Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 94 96
Nov 21 13:26:32 host corosync[2269607]: [TOTEM ] Retransmit List: 98
Nov 21 13:26:33 host corosync[2269607]: [KNET ] link: host: 4 link: 0 is down
Nov 21 13:26:33 host corosync[2269607]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 21 13:26:33 host corosync[2269607]: [KNET ] host: host: 4 has no active links
Nov 21 13:26:35 host corosync[2269607]: [KNET ] rx: host: 4 link: 0 is up

Unfortunately you now lose a lot of functionality if you have multiple data centres. You’ll loose high availability, migrations between nodes, and possibly the ability to use PBS across one cluster. If our case PBS worked well for a month, but when the cluster became unreliable things broke down miserably.

So yes it’s possible. No, the documentation makes it clear it’s better to have a dedicated network for Corosync which is the underlying synchronization protocol.

Removing Cluster Nodes – still in UI

You need to remove the whole directory /etc/pve/nodes/<nodename>

There are other tricks too, but be careful if it’s name, or ID, or something else. Example:

pvecm status shows the node is gone:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 a.b.c.1 (local)
0x00000003 1 a.b.c.2
0x00000004 1 a.b.c.3
0x00000005 1 a.b.c.4
0x00000006 1 a.b.c.5
0x00000007 1 a.b.c.6

[email protected]:~# pvecm delnode by_ip_address
400 Parameter verification failed.
node: invalid format - value does not look like a valid node name

pvecm delnode <node>

[email protected]:~# ls /etc/pve/nodes/
name1/ name2/ name3/ name4/ name5/ name6/ deleted-name/
[email protected]:~# pvecm delnode deleted-name
Could not kill node (error = CS_ERR_NOT_EXIST)
Killing node 1

[email protected]:~# pvecm delnode deleted-name
Node/IP: deleted-name is not a known host of the cluster.

Useful Troubleshooting Commands

Change Quorum Quantity

Change quorum quantity (super risky so read about risks first):

pvecm expected X

Status

pvecm status

Deleting a node

Official documentation: https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

The official documentation clearly states that a node “must be powered off” by removal, but you’ll find these nodes still lingering in your User Interface. That might mean delnode didn’t work properly or you have to refresh your browser.

The command to delete a node is delnode, but when do you do it? There are at least these scenarios:

  1. The node has died. You will never bring it up again. Perhaps the hardware broke or the hardware was removed or recommissioned.
  2. The node has died. You’re panicking and you want to bring it up but you’re concerned about your cluster being inconsistent.
  3. You have properly decommissioned the node, and you’ve switch it off.
  4. You have properly decommissioned the node, but you haven’t switched it off yet.

In the next section, we’ll only cover #4, the best case scenario.

Properly Decommissioning a Proxmox Cluster Node

As per the documentation:

Log in to a different cluster node (not hp4), and issue a pvecm nodes command to identify the node ID to remove:

Always remember to have IPMI access before taking this action, because, by chance, you might want to switch the host back on. Also note if you commit to this procedure *do not switch on the node again*! This is a one way street, mostly.

Here are the full steps, 1 to 6.

  1. Ensure the node you want to remove is still on
  2. Ensure there are no more VMs on the node.
  3. Log into a different node and do this: pvecm nodes
  4. Make a note of the node name (3rd column)
  5. Power down the node you want to remove.
  6. Do pvecm delnode name

This documentation refers to the command below, but we couldn’t get it to work:

pvecm delnode hvX
Could not kill node (error = CS_ERR_NOT_EXIST)
Killing node 3

The documentation refers to names, but you can just use the ID that the pvecm status command gives you. Also make sure you read the official documentation about removing a node.

Upgrading a Cluster

Note 1: You can’t upgrade Proxmox if your cluster isn’t healthy.

Note 2: Multicast versus unicast and knet

We had multicast issues over a WAN and decided to try a switch to Unicast by way of the following switch:

cat /etc/corosync/corosync.conf
...
  totem {
  cluster_name: cluster
  config_version: 7
  transport: udpu
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
...

Then during upgrade from 7 to 8, we had this warning:

Checking totem settings..
FAIL: Corosync transport explicitly set to 'udpu' instead of implicit default!

Fixing Corosync.conf

This is tricky because it can be overwritten very quickly and you can easily break the cluster.

Here is the part of the manual that talks about it:
https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_configuration

  • You have to make a copy
  • You have to work on the copy
  • Then make another backup of the main
  • Then move the copy over the main

Now that you’re ready for high work, this is it:

cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
edit
cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
 Service Commands
systemctl status corosync.service
journalctl -b -u corosync
systemctl status pve-cluster.service

Other Caveats

  1. You probably shouldn’t mix and match Proxmox 7.4 and PBS 8.x.
  2. It seems you have to trash your VMs when you disjoin the cluster. Well you back them up first, and take them offline, but this is a big problem for most production environments.

Cluster Error about existing virtual machine guests

Joining a Proxmox Cluster is trivial. But what if you can’t? Here are two common errors:

detected the following error(s):
* this host already contains virtual guests
* local node address: cannot use IP ‘192.168.100.2’, not found on local node!

Also please read the official Proxmox material as their documentation is really good:
https://pve.proxmox.com/wiki/Cluster_Manager

  • To form a quorum you need at least three nodes
  • Changing the hostname and IP is not possible after cluster creation

Once you have this working, you’ll see this output instead of more errors:

Establishing API connection with host '192.168.1.24'
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '192.168.1.21'
Request addition of this node

At this point you might see: “Connection error”, but when you go to another node you’ll see this:

Fixing this host already contains virtual guests

You can’t join a cluster when you have existing virtual machines. You have to back them up first and you’ll have to restore them after you’ve joined the cluster.

Important note: If you’ve used the built-in Datacenter => Backup facility of Proxmox you will find after joining the cluster that your backup job confirmation is gone. Simple use the following command line to restore the VMs:

ssh to Proxmox VE host

cd /var/lib/vz/dump
qmrestore vzdump-qemu-101-2023_10_16-11_05_33.vma.zst 101 -storage local-lvm

Fixing local node address: cannot use IP ‘a.b.c.d’, not found on local node

This will happen if you’ve changed the IP address of your Proxmox hypervisor.

When changing the IP of the box it seems that you have to manually change the /etc/hosts entry as well:
https://forum.proxmox.com/threads/adding-node-to-cluster-local-node-address-cannot-use-ip-error.57100/

 

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top