On Github SUSE / compute-ha-training
SOLVED
(mostly)
neutron HA is tricky, but out of the scope of this talk.
The control plane on the LHS is HA, but VMs live on the RHS, so what happens if one of the compute nodes blows up? That's the topic of the rest of this talk!
The previous slide suggests there is a problem which needs solving, but does it always need solving?
Compute node HA is a controversial feature, because some people think it's an anti-pattern which does not belong, in clouds, whereas other people feel a strong need for it. To understand when it's needed, first we have to understand the different types of workload which people want to run in the cloud.
But what are pets?
… to handle failures at scale, we need to automatically restart VMs somehow.
... otherwise over time, our service becomes more and more degraded, and manually restarting is a waste of time and unreliable due to the human element.
Heat used to support this, but HARestarter deprecated since Kilo Heat is gaining convergence / self-healing capabilities but nothing concrete currently planned for instance auto-restarting.
http://docs.openstack.org/developer/heat/ says "templates […] allow some more advanced functionality such as instance high availability […]" but according to Thomas Herve (current Heat PTL) this is no longer supported.
Heat/HA wiki out of date
… we have to resurrect very carefully in order to avoid any zombie pets!
This case is more complex than resurrecting cattle, due to the risk of zombie pets.
A zombie is a VM which appeared dead but didn't actually die properly - it could conflict with its resurrected twin.
So to sum up, my vote is yes, because even cattle need compute node HA.
Also, rather than painful "big bang" migrations to cloud-aware workloads, it's easier to deprecate legacy workloads, let them reach EOL whilst gradually migrating over to next-generation architectures.
This is a controversial topic, but naysayers tend to favour idealism over real world pragmatism.
If this really is needed functionality, why hasn't it already been done? The answer is that it's actually surprisingly tricky to implement in a reliable manner.
Different cloud operators will want to support different SLAs with different workflows, e.g.
There is no one-size-fits-all solution to compute HA.
Clouds will often scale to many compute nodes
Typical clustering software uses fully connected mesh topology, which doesn't scale to a large number of nodes, e.g. corosync supports a maximum of 32 nodes.
The obvious workarounds are ugly!
Scalability issue solved by pacemaker_remote
Needs to protect critical data ⇒ requires fencing of either
Needs to handle failure or (temporary) freeze of:
We assume that Pacemaker is reliable, otherwise we're sunk!
First delete any existing role assignments by clicking Remove all.
root@crowbar:~ # tail -f /var/log/crowbar/production.log root@crowbar:~ # tail -f /var/log/crowbar/chef-client/*.log
Login to one of the controller nodes, and do:
OCF RA approach is supported in RHEL OSP. Setup is manual; here is a fragment of the installation instructions.
SUSE's solution is incredibly easy to deploy, as we'll see next!
This is a good time to introduce nova evacuate.
Two key areas:
If /var/lib/nova/instances is shared:
Otherwise:
Either way, /var/lib/glance/images should be shared across all controllers (unless using Swift / Ceph)
We're using admin server's NFS server:
batch is a subcommand of the crowbar client (typically run on the admin node).
Unattended batch setup of barclamps:
root@crowbar:~ # crowbar batch build my-cloud.yaml
Dump current barclamps as YAML:
root@crowbar:~ # crowbar batch export
- barclamp: pacemaker name: services attributes: stonith: mode: libvirt libvirt: hypervisor_ip: 192.168.217.1 drbd: enabled: true deployment: elements: hawk-server: - "@@controller1@@" - "@@controller2@@" pacemaker-cluster-member: - "@@controller1@@" - "@@controller2@@" pacemaker-remote: - "@@compute1@@" - "@@compute2@@"
- barclamp: nova attributes: use_migration: true kvm: ksm_enabled: true deployment: elements: nova-controller: - cluster:cluster1 nova-compute-kvm: - remotes:cluster1
Let's boot a VM to test compute node HA!
Connect to one of the controller nodes, and get image / flavor / net names:
source .openrc openstack image list openstack flavor list neutron net-list
Boot the VM using these ids:
nova boot --image image --flavor flavor --nic net-id=net testvm
Test it's booted:
nova show testvm
Create floating IP:
neutron floatingip-create floatingnet
Get VM IP:
nova list
Get port id:
neutron port-list | grep vmIP
Associate floating IP with VM port:
neutron floatingip-associate floatingipID portID
The VM uses the default security group. Make sure it has ICMP.
Ping VM:
ping vmFloatingIP
Ping host where the VM is running:
nova list --fields host,name ping hostIP
Check log messages for NovaEvacuate workflow:
tail -f /var/log/messages | grep NovaEvacuate
Monitor cluster status:
crm_mon
Login to compute node where VM runs, and type:
pkill -9 -f pacemaker_remoted
This will cause fencing! (Why?)
Pacemaker cluster will now lose connectivity to the compute node, so has no way of knowing whether it's dead or not. So the only way to safely recover resources to another remote is by first fencing the node.
NovaEvacuate [...] Initiating evacuation NovaEvacuate [...] Completed evacuation
nova list --fields host,name
Pacemaker monitors compute nodes via pacemaker_remote.
If compute node failure detected:
compute node is fencedcrm configure show fencing_topology
Find node running fence_compute:
crm resource show fence-nova
fence_compute script:
tells nova server that node is down updates attribute on compute node to indicate node needs recoveryLog files:
Verify attribute state via:
attrd_updater --query --all --name=evacuate
root@controller1:~ # crm resource show nova-evacuate resource nova-evacuate is running on: d52-54-77-77-77-02nova resurrects VM on other node
root@controller2:~ # grep nova-evacuate /var/log/messages NovaEvacuate [...] Initiating evacuation NovaEvacuate [...] Completed evacuation
Warning: no retries if resurrection fails!
pacemaker_remote looks after key compute node services.
Add conclusion notes here.