Cope with machine failures
Important:
Citrix Hypervisor 8.2 Cumulative Update 1 becomes End of Life on June 25, 2025. Plan your upgrade to XenServer 8 now to ensure a smooth transition and continued support. For more information, see Upgrade.
If you are using your Citrix Virtual Apps and Desktops license files to license your Citrix Hypervisor 8.2 Cumulative Update 1 hosts, these license files are not compatible with XenServer 8. Before upgrading you must acquire XenServer Premium Edition socket license files to use with XenServer 8. These socket license files are available as an entitlement of the Citrix for Private Cloud, Citrix Universal Hybrid Multi-Cloud, Citrix Universal MSP, and Citrix Platform License subscriptions for running your Citrix workloads. Citrix customers who have not yet transitioned to these new subscriptions can request to participate in a no-cost promotion for 10,000 XenServer Premium Edition socket licenses. For more information, see XenServer.
If you do not get a compatible license for XenServer 8 before upgrading, when you upgrade your hosts they revert to the 90-day Trial Edition. Trial Edition provides the same features as Premium Edition with some limitations. For more information, see XenServer 8 Licensing Overview.
This section provides details of how to recover from various failure scenarios. All failure recovery scenarios require the use of one or more of the backup types listed in Backup.
Member failures
In the absence of HA, master nodes detect the failures of members by receiving regular heartbeat messages. If no heartbeat has been received for 600 seconds, the master assumes the member is dead. There are two ways to recover from this problem:
-
Repair the dead host (for example, by physically rebooting it). When the connection to the member is restored, the master marks the member as alive again.
-
Shut down the host and instruct the master to forget about the member node using the
xe host-forget
CLI command. Once the member has been forgotten, all the VMs which were running there are marked as offline and can be restarted on other Citrix Hypervisor servers.It is important to ensure that the Citrix Hypervisor server is actually offline, otherwise VM data corruption might occur.
Do not to split your pool into multiple pools of a single host by using
xe host-forget
. This action might result in them all mapping the same shared storage and corrupting VM data.
Warning:
- If you are going to use the forgotten host as an active host again, perform a fresh installation of the Citrix Hypervisor software.
- Do not use
xe host-forget
command if HA is enabled on the pool. Disable HA first, then forget the host, and then re-enable HA.
When a member Citrix Hypervisor server fails, there might be VMs still registered in the running state. If you are sure that the member Citrix Hypervisor server is definitely down, use the xe vm-reset-powerstate
CLI command to set the power state of the VMs to halted
. See vm-reset-powerstate for more details.
Warning:
Incorrect use of this command can lead to data corruption. Only use this command if necessary.
Before you can start VMs on another Citrix Hypervisor server, you are also required to release the locks on VM storage. Only on host at a time can use each disk in an SR. It is key to make the disk accessible to other Citrix Hypervisor servers once a host has failed. To do so, run the following script on the pool master for each SR that contains disks of any affected VMs: /opt/xensource/sm/resetvdis.py
host_UUID SR_UUID master
You need only supply the third string (“master”) if the failed host was the SR master at the time of the crash. (The SR master is the pool master or a Citrix Hypervisor server using local storage.)
Warning:
Be sure that the host is down before running this command. Incorrect use of this command can lead to data corruption.
If you attempt to start a VM on another Citrix Hypervisor server before running the resetvdis.py
script, then you receive the following error message: VDI <UUID> already attached RW
.
Master failures
Every member of a resource pool contains all the information necessary to take over the role of master if necessary. When a master node fails, the following sequence of events occurs:
-
If HA is enabled, another master is elected automatically.
-
If HA is not enabled, each member waits for the master to return.
If the master comes back up at this point, it re-establishes communication with its members, and operation returns to normal.
If the master is dead, choose one of the members and run the command xe pool-emergency-transition-to-master
on it. Once it has become the master, run the command xe pool-recover-slaves
and the members now point to the new master.
If you repair or replace the server that was the original master, you can simply bring it up, install the Citrix Hypervisor software, and add it to the pool. Since the Citrix Hypervisor servers in the pool are enforced to be homogeneous, there is no real need to make the replaced server the master.
When a member Citrix Hypervisor server is transitioned to being a master, check that the default pool storage repository is set to an appropriate value. This check can be done using the xe pool-param-list
command and verifying that the default-SR
parameter is pointing to a valid storage repository.
Pool failures
In the unfortunate event that your entire resource pool fails, you must recreate the pool database from scratch. Be sure to regularly back up your pool-metadata using the xe pool-dump-database
CLI command (see pool-dump-database
).
To restore a completely failed pool:
-
Install a fresh set of hosts. Do not pool them up at this stage.
-
For the host nominated as the master, restore the pool database from your backup using the
xe pool-restore-database
command (see pool-restore-database). -
Connect to the master host using XenCenter and ensure that all your shared storage and VMs are available again.
-
Perform a pool join operation on the remaining freshly installed member hosts, and start up your VMs on the appropriate hosts.
Cope with failure due to configuration errors
If the physical host machine is operational but the software or host configuration is corrupted:
-
Run the following command to restore host software and configuration:
xe host-restore host=host file-name=hostbackup <!--NeedCopy-->
-
Reboot to the host installation CD and select Restore from backup.
Physical machine failure
If the physical host machine has failed, use the appropriate procedure from the following list to recover.
Warning:
Any VMs running on a previous member (or the previous host) which have failed are still marked as
Running
in the database. This behavior is for safety. Simultaneously starting a VM on two different hosts would lead to severe disk corruption. If you are sure that the machines (and VMs) are offline you can reset the VM power state toHalted
:
xe vm-reset-powerstate vm=vm_uuid --force
VMs can then be restarted using XenCenter or the CLI.
To replace a failed master with a still running member:
-
Run the following commands:
xe pool-emergency-transition-to-master xe pool-recover-slaves <!--NeedCopy-->
-
If the commands succeed, restart the VMs.
To restore a pool with all hosts failed:
-
Run the command:
xe pool-restore-database file-name=backup <!--NeedCopy-->
Warning:
This command only succeeds if the target machine has an appropriate number of appropriately named NICs.
-
If the target machine has a different view of the storage than the original machine, modify the storage configuration using the
pbd-destroy
command. Next use thepbd-create
command to recreate storage configurations. See pbd commands for documentation of these commands. -
If you have created a storage configuration, use
pbd-plug
or Storage > Repair Storage Repository menu item in XenCenter to use the new configuration. -
Restart all VMs.
To restore a VM when VM storage is not available:
-
Run the following command:
xe vm-import filename=backup metadata=true <!--NeedCopy-->
-
If the metadata import fails, run the command:
xe vm-import filename=backup metadata=true --force <!--NeedCopy-->
This command attempts to restore the VM metadata on a ‘best effort’ basis.
-
Restart all VMs.