XenServer

Cope with machine failures

May 22, 2023

Contributed by:

This section provides details of how to recover from various failure scenarios. All failure recovery scenarios require the use of one or more of the backup types listed in Backup.

Member failures

In the absence of HA, pool coordinator nodes detect the failures of members by receiving regular heartbeat messages. If no heartbeat has been received for 600 seconds, the pool coordinator assumes the member is dead. There are two ways to recover from this problem:

Repair the dead host (for example, by physically rebooting it). When the connection to the member is restored, the pool coordinator marks the member as alive again.
Shut down the host and instruct the pool coordinator to forget about the member node using the xe host-forget CLI command. Once the member has been forgotten, all the VMs which were running there are marked as offline and can be restarted on other XenServer hosts.

It is important to ensure that the XenServer host is actually offline, otherwise VM data corruption might occur.

Do not to split your pool into multiple pools of a single host by using xe host-forget. This action might result in them all mapping the same shared storage and corrupting VM data.

Warning:

If you are going to use the forgotten host as an active host again, perform a fresh installation of the XenServer software.

Do not use xe host-forget command if HA is enabled on the pool. Disable HA first, then forget the host, and then re-enable HA.

When a member XenServer host fails, there might be VMs still registered in the running state. If you are sure that the member XenServer host is definitely down, use the xe vm-reset-powerstate CLI command to set the power state of the VMs to halted. See vm-reset-powerstate for more details.

Warning:

Incorrect use of this command can lead to data corruption. Only use this command if necessary.

Before you can start VMs on another XenServer host, you are also required to release the locks on VM storage. Only on host at a time can use each disk in an SR. It is key to make the disk accessible to other XenServer hosts once a host has failed. To do so, run the following script on the pool coordinator for each SR that contains disks of any affected VMs: /opt/xensource/sm/resetvdis.py host_UUID SR_UUID master

You need only supply the third string (“master”) if the failed host was the SR pool coordinator at the time of the crash. (The SR pool coordinator is the pool coordinator or a XenServer host using local storage.)

Warning:

Be sure that the host is down before running this command. Incorrect use of this command can lead to data corruption.

If you attempt to start a VM on another XenServer host before running the resetvdis.py script, then you receive the following error message: VDI <UUID> already attached RW.

Pool coordinator failures

Every member of a resource pool contains all the information necessary to take over the role of pool coordinator if necessary. When a pool coordinator node fails, the following sequence of events occurs:

If HA is enabled, another pool coordinator is elected automatically.
If HA is not enabled, each member waits for the pool coordinator to return.

If the pool coordinator comes back up at this point, it re-establishes communication with its members, and operation returns to normal.

If the pool coordinator is dead, choose one of the members and run the command xe pool-emergency-transition-to-master on it. Once it has become the pool coordinator, run the command xe pool-recover-slaves and the members now point to the new pool coordinator.

If you repair or replace the host that was the original pool coordinator, you can simply bring it up, install the XenServer software, and add it to the pool. Since the XenServer hosts in the pool are enforced to be homogeneous, there is no real need to make the replaced host the pool coordinator.

When a member XenServer host is transitioned to being a pool coordinator, check that the default pool storage repository is set to an appropriate value. This check can be done using the xe pool-param-list command and verifying that the default-SR parameter is pointing to a valid storage repository.

Pool failures

In the unfortunate event that your entire resource pool fails, you must recreate the pool database from scratch. Be sure to regularly back up your pool-metadata using the xe pool-dump-database CLI command (see pool-dump-database).

To restore a completely failed pool:

Install a fresh set of hosts. Do not pool them up at this stage.
For the host nominated as the pool coordinator, restore the pool database from your backup using the xe pool-restore-database command (see pool-restore-database).
Connect to the pool coordinator by using XenCenter and ensure that all your shared storage and VMs are available again.
Perform a pool join operation on the remaining freshly installed member hosts, and start up your VMs on the appropriate hosts.

Cope with failure due to configuration errors

If the physical host machine is operational but the software or host configuration is corrupted:

Run the following command to restore host software and configuration:
```
xe host-restore host=host file-name=hostbackup
 
```
Reboot to the host installation CD and select Restore from backup.

Physical machine failure

If the physical host machine has failed, use the appropriate procedure from the following list to recover.

Warning:

Any VMs running on a previous member (or the previous host) which have failed are still marked as Running in the database. This behavior is for safety. Simultaneously starting a VM on two different hosts would lead to severe disk corruption. If you are sure that the machines (and VMs) are offline you can reset the VM power state to Halted:

xe vm-reset-powerstate vm=vm_uuid --force

VMs can then be restarted using XenCenter or the CLI.

To replace a failed pool coordinator with a still running member:

Run the following commands:

xe pool-emergency-transition-to-master
xe pool-recover-slaves
 

If the commands succeed, restart the VMs.

To restore a pool with all hosts failed:

Run the command:
```
xe pool-restore-database file-name=backup
 
```
Warning:

This command only succeeds if the target machine has an appropriate number of appropriately named NICs.
If the target machine has a different view of the storage than the original machine, modify the storage configuration using the pbd-destroy command. Next use the pbd-create command to recreate storage configurations. See pbd commands for documentation of these commands.
If you have created a storage configuration, use pbd-plug or Storage > Repair Storage Repository menu item in XenCenter to use the new configuration.
Restart all VMs.

To restore a VM when VM storage is not available:

Run the following command:

xe vm-import filename=backup metadata=true
 

If the metadata import fails, run the command:
```
xe vm-import filename=backup metadata=true --force
 
```
This command attempts to restore the VM metadata on a ‘best effort’ basis.
Restart all VMs.

The official version of this content is in English. Some of the Cloud Software Group documentation content is machine translated for your convenience only. Cloud Software Group has no control over machine-translated content, which may contain errors, inaccuracies or unsuitable language. No warranty of any kind, either expressed or implied, is made as to the accuracy, reliability, suitability, or correctness of any translations made from the English original into any other language, or that your Cloud Software Group product or service conforms to any machine translated content, and any warranty provided under the applicable end user license agreement or terms of service, or any other agreement with Cloud Software Group, that the product or service conforms with any documentation shall not apply to the extent that such documentation has been machine translated. Cloud Software Group will not be held responsible for any damage or issues that may arise from using machine-translated content.

Was this helpful

Cope with machine failures

May 22, 2023

Contributed by:

Cope with machine failures

Member failures

Pool coordinator failures

Pool failures

Cope with failure due to configuration errors

Physical machine failure

In this article