Troubleshoot clustered pools
Citrix Hypervisor pools that use GFS2 to thin provision their shared block storage are clustered. These pools behave differently to pools that use shared file-based storage or LVM with shared block storage. As a result, there are some specific issues that might occur in Citrix Hypervisor clustered pools and GFS2 environments.
Use the following information to troubleshoot minor issues that might occur when using this feature.
All my hosts can ping each other, but I can’t create a cluster. Why?
The clustering mechanism uses specific ports. If your hosts can’t communicate on these ports (even if they can communicate on other ports), you can’t enable clustering for the pool.
Ensure that the hosts in the pool can communicate on the following ports:
- TCP: 8892, 8896, 21064
- UDP: 5404, 5405 (not multicast)
If there are any firewalls or similar between the hosts in the pool, ensure that these ports are open.
If you have previously configured HA in the pool, disable the HA before enabling clustering.
Why am I getting an error when I try to join a new host to an existing clustered pool?
When clustering is enabled on a pool, every pool membership change must be agreed by every member of the cluster before it can succeed. If a cluster member isn’t contactable, operations that change cluster membership (such as host add or host remove) fail.
To add your new host to the clustered pool:
-
Ensure that all of your hosts are online and can be contacted.
-
Ensure that the hosts in the pool can communicate on the following ports:
- TCP: 8892, 8896, 21064
- UDP: 5404, 5405 (not multicast)
-
Ensure that the joining host has an IP address allocated on the NIC that joins the cluster network of the pool.
-
Ensure that no host in the pool is offline when a new host is trying to join the clustered pool.
-
If an offline host cannot be recovered, mark it as dead to remove it from the cluster. For more information, see A host in my clustered pool is offline and I can’t recover it. How do I remove the host from my cluster?
What do I do if some members of the clustered pool aren’t joining the cluster automatically?
This issue might be caused by members of the clustered pool losing synchronization.
To resync the members of the clustered pool, use the following command:
xe cluster-pool-resync cluster-uuid=<cluster_uuid>
If the issue persists, you can try to reattach the GFS2 SR. You can do this task by using the xe CLI or through XenCenter.
Reattach the GFS2 SR by using the xe CLI:
-
Detach the GFS2 SR from the pool. On each host, run the xe CLI command
xe pbd-unplug uuid=<uuid_of_pbd>
. -
Disable the clustered pool by using the command
xe cluster-pool-destroy cluster-uuid=<cluster_uuid>
If the preceding command is unsuccessful, you can forcibly disable a clustered pool by running
xe cluster-host-force-destroy uuid=<cluster_host>
on every host in the pool. -
Enable the clustered pool again by using the command
xe cluster-pool-create network-uuid=<network_uuid> [cluster-stack=cluster_stack] [token-timeout=token_timeout] [token-timeout-coefficient=token_timeout_coefficient]
-
Reattach the GFS2 SR by running the command
xe pbd-plug uuid=<uuid_of_pbd>
on each host.
Alternatively, to use XenCenter to reattach the GFS2 SR:
- In the pool Storage tab, right-click on the GFS2 SR and select Detach….
- From the toolbar, select Pool > Properties.
- In the Clustering tab, deselect Enable clustering.
- Click OK to apply your change.
- From the toolbar, select Pool > Properties.
- In the Clustering tab, select Enable clustering and choose the network to use for clustering.
- Click OK to apply your change.
- In the pool Storage tab, right-click on the GFS2 SR and select Repair.
How do I know if my host has self-fenced?
If your host self-fenced, it might have rejoined the cluster when it restarted. To see if a host has self-fenced and recovered, you can check the /var/opt/xapi-clusterd/boot-times
file to see the times the host started. If there are start times in the file that you did not expect to see, the host has self-fenced.
Why is my host offline? How can I recover it?
There are many possible reasons for a host to go offline. Depending on the reason, the host can either be recovered or not.
The following reasons for a host to be offline are more common and can be addressed by recovering the host:
- Clean shutdown
- Forced shutdown
- Temporary power failure
- Reboot
The following reasons for a host to be offline are less common:
- Permanent host hardware failure
- Permanent host power supply failure
- Network partition
- Network switch failure
These issues can be addressed by replacing hardware or by marking failed hosts as dead.
A host in my clustered pool is offline and I can’t recover it. How do I remove the host from my cluster?
You can tell the cluster to forget the host. This action removes the host from the cluster permanently and decreases the number of live hosts required for quorum.
To remove an unrecoverable host, use the following command:
xe host-forget uuid=<host_uuid>
This command removes the host from the cluster permanently and decreases the number of live hosts required for quorum.
Note:
If the host isn’t offline, this command can cause data loss. You’re asked to confirm that you’re sure before proceeding with the command.
After a host is marked as dead, it can’t be added back into the cluster. To add this host back into the cluster, you must do a fresh installation of Citrix Hypervisor on the host.
I’ve repaired a host that was marked as dead. How do I add it back into my cluster?
A Citrix Hypervisor host that has been marked as dead can’t be added back into the cluster. To add this system back into the cluster, you must do a fresh installation of XenServer. This fresh installation appears to the cluster as a new host.
What do I do if my cluster keeps losing quorum and its hosts keep fencing?
If one or more of the Citrix Hypervisor hosts in the cluster gets into a fence loop because of continuously losing and gaining quorum, you can boot the host with the nocluster
kernel command-line argument. Connect to the physical or serial console of the host and edit the boot arguments in grub.
Example:
/boot/grub/grub.cfg
menuentry 'XenServer' {
search --label --set root root-oyftuj
multiboot2 /boot/xen.gz dom0_mem=4096M,max:4096M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=192M,below=4G console=vga vga=mode-0x0311
module2 /boot/vmlinuz-4.4-xen root=LABEL=root-oyftuj ro nolvm hpet=disable xencons=hvc console=hvc0 console=tty0 quiet vga=785 splash plymouth.ignore-serial-consoles nocluster
module2 /boot/initrd-4.4-xen.img
}
menuentry 'Citrix Hypervisor (Serial)' {
search --label --set root root-oyftuj
multiboot2 /boot/xen.gz com1=115200,8n1 console=com1,vga dom0_mem=4096M,max:4096M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=192M,below=4G
module2 /boot/vmlinuz-4.4-xen root=LABEL=root-oyftuj ro nolvm hpet=disable console=tty0 xencons=hvc console=hvc0 nocluster
module2 /boot/initrd-4.4-xen.img
}
<!--NeedCopy-->
What happens when the pool master gets restarted in a clustered pool?
In most cases, the behavior when the pool master is shut down or restarted in a clustered pool is the same as that when another pool member shuts down or restarts.
How the host is shut down or restarted can affect the quorum of the clustered pool. For more information about quorum, see Quorum.
The only difference in behavior depends on whether HA is enabled in your pool:
- If HA is enabled, a new master is selected and general service is maintained.
- If HA is not enabled, there is no master for the pool. Running VMs on the remaining hosts continue to run. Most administrative operations aren’t available until the master restarts.
Why has my pool vanished after a host in the clustered pool is forced to shut down?
If you shut down a host normally (not forcibly), it’s temporarily removed from quorum calculations until it’s turned back on. However, if you forcibly shut down a host or it loses power, that host still counts towards quorum calculations. For example, if you had a pool of 3 hosts and forcibly shut down 2 of them the remaining host fences because it no longer has quorum.
Try to always shut down hosts in a clustered pool cleanly. For more information, see Manage your clustered pool.
Why did all hosts within the clustered pool restart at the same time?
All hosts in an active cluster are considered to have lost quorum when the number of contactable hosts in the pool is less than these values:
- For a pool with an even number of hosts: n/2
- For a pool with an odd number of hosts: (n+1)/2
The letter n indicates the total number of hosts in the clustered pool. For more information about quorum, see Quorum.
In this situation, all hosts self-fence and you see all hosts restarting.
To diagnose why the pool lost quorum, the following information can be useful:
- In XenCenter, check the Notifications section for the time of the issue to see whether self-fencing occurred.
- On the cluster hosts, check
/var/opt/xapi-clusterd/boot-times
to see whether a reboot occurred at an unexpected time. - In
Crit.log
, check whether any self-fencing messages are outputted. -
Review the
dlm_tool status
command output for fencing information.Example
dlm_tool status
output:dlm_tool status cluster nodeid 1 quorate 1 ring seq 8 8 daemon now 4281 fence_pid 0 node 1 M add 3063 rem 0 fail 0 fence 0 at 0 0 node 2 M add 3066 rem 0 fail 0 fence 0 at 0 0 <!--NeedCopy-->
When collecting logs for debugging, collect diagnostic information from all hosts in the cluster. In the case where a single host has self-fenced, the other hosts in the cluster are more likely to have useful information.
Collect full server status reports for the hosts in your clustered pool. For more information, see Citrix Hypervisor server status reports.
Why can’t I recover my clustered pool when I have quorum?
If you have a clustered pool with an even number of hosts, the number of hosts required to achieve quorum is one more than the number of hosts required to retain quorum. For more information about quorum, see Quorum.
If you are in an even-numbered pool and have recovered half of the hosts, you must recover one more host before you can recover the cluster.
Invalid token
error when changing the cluster settings?
Why do I see an When updating the configuration of your cluster, you might receive the following error message about an invalid token ("[[\"InternalError\",\"Invalid token\"]]")
.
You can resolve this issue by completing the following steps:
-
(Optional) Back up the current cluster configuration by collecting a server status report that includes the xapi-clusterd and system logs.
-
Use XenCenter to detach the GFS2 SR from the clustered pool.
In the pool Storage tab, right-click on the GFS2 SR and select Detach….
-
On any host in the cluster, run this command to forcibly destroy the cluster:
xe cluster-pool-force-destroy cluster-uuid=<uuid>
-
Use XenCenter to reenable clustering on your pool.
- From the toolbar, select Pool > Properties.
- In the Clustering tab, select Enable clustering and choose the network to use for clustering.
- Click OK to apply your change
-
Use XenCenter to reattach the GFS2 SR to the pool
In the pool Storage tab, right-click on the GFS2 SR and select Repair.
In this article
- All my hosts can ping each other, but I can’t create a cluster. Why?
- Why am I getting an error when I try to join a new host to an existing clustered pool?
- What do I do if some members of the clustered pool aren’t joining the cluster automatically?
- How do I know if my host has self-fenced?
- Why is my host offline? How can I recover it?
- A host in my clustered pool is offline and I can’t recover it. How do I remove the host from my cluster?
- I’ve repaired a host that was marked as dead. How do I add it back into my cluster?
- What do I do if my cluster keeps losing quorum and its hosts keep fencing?
- What happens when the pool master gets restarted in a clustered pool?
- Why has my pool vanished after a host in the clustered pool is forced to shut down?
- Why did all hosts within the clustered pool restart at the same time?
- Why can’t I recover my clustered pool when I have quorum?
- Why do I see an Invalid token error when changing the cluster settings?