About this presentation

Press ? for help on navigating these slides
Press m for a slide menu
Press s for speaker notes

Compute node HA

Hands-On Training

Nürnberg, Wednesday 11th May 2016

Adam Spiers

Senior Software Engineer, SUSE

aspiers@suse.com

Agenda

HA in a typical OpenStack cloud today
When do we need HA for compute nodes?
Architectural challenges
Solution in SUSE OpenStack Cloud
Failover testing

HA in OpenStack today

Typical HA control plane

Automatic restart of controller services
Increases uptime of cloud

Active / active API services with load balancing
DB + MQ either active / active or active / passive

Under the covers

Recommended by official HA guide

SOLVED

(mostly)

HAProxy distributes service requests
Pacemaker monitors and controls nodes and services
These days, to a large extent this is a solved problem!

neutron HA is tricky, but out of the scope of this talk.

If only the control plane is HA …

The control plane on the LHS is HA, but VMs live on the RHS, so what happens if one of the compute nodes blows up? That's the topic of the rest of this talk!

When is compute HA important?

The previous slide suggests there is a problem which needs solving, but does it always need solving?

Addressing the white elephant in the room

Compute node HA is a controversial feature, because some people think it's an anti-pattern which does not belong, in clouds, whereas other people feel a strong need for it. To understand when it's needed, first we have to understand the different types of workload which people want to run in the cloud.

But what are pets?

Pets vs. cattle

Pets are given names like mittens.mycompany.com
Each one is unique, lovingly hand-raised and cared for
When they get ill, you nurse them back to health

Cattle are given names like vm0213.cloud.mycompany.com
They are almost identical to other cattle
When one gets ill, you get another one

Pets are typically given unique names, whereas cattle aren't.
This reflects that pets take a lot of work to create and look after, whereas cattle don't.
Similarly, when something goes wrong with a pet, you need to invest a lot of effort to fix it, whereas with cattle you just get another one.
thanks to CERN for this slide, and Bill Baker for the original terminology

What does that mean in practice?

Service downtime when a pet dies
VM instances often stateful, with mission-critical data
Needs automated recovery with data protection

Service resilient to instances dying
Stateless, or ephemeral (disposable) storage
Already ideal for cloud … but automated recovery still needed!

If compute node is hosting cattle …

… to handle failures at scale, we need to automatically restart VMs somehow.

... otherwise over time, our service becomes more and more degraded, and manually restarting is a waste of time and unreliable due to the human element.

Heat used to support this, but HARestarter deprecated since Kilo Heat is gaining convergence / self-healing capabilities but nothing concrete currently planned for instance auto-restarting.

http://docs.openstack.org/developer/heat/ says "templates […] allow some more advanced functionality such as instance high availability […]" but according to Thomas Herve (current Heat PTL) this is no longer supported.

Heat/HA wiki out of date

If compute node is hosting pets …

… we have to resurrect very carefully in order to avoid any zombie pets!

This case is more complex than resurrecting cattle, due to the risk of zombie pets.

A zombie is a VM which appeared dead but didn't actually die properly - it could conflict with its resurrected twin.

Do we really need compute HA in OpenStack?

Why?

Compute HA needed for cattle as well as pets
Valid reasons for running pets in OpenStack
- Manageability benefits
- Want to avoid multiple virtual estates
- Too expensive to cloudify legacy workloads

So to sum up, my vote is yes, because even cattle need compute node HA.

Also, rather than painful "big bang" migrations to cloud-aware workloads, it's easier to deprecate legacy workloads, let them reach EOL whilst gradually migrating over to next-generation architectures.

This is a controversial topic, but naysayers tend to favour idealism over real world pragmatism.

Architectural challenges

If this really is needed functionality, why hasn't it already been done? The answer is that it's actually surprisingly tricky to implement in a reliable manner.

Configurability

Different cloud operators will want to support different SLAs with different workflows, e.g.

Protection for pets:
- per AZ?
- per project?
- per pet?
If nova-compute fails, VMs are still perfectly healthy but unmanageable
- Should they be automatically killed? Depends on the workload.

There is no one-size-fits-all solution to compute HA.

Compute plane needs to scale

CERN datacenter © Torkild Retvedt CC-BY-SA 2.0

Clouds will often scale to many compute nodes

100s, or even 1000s

Full mesh clusters don't scale

Typical clustering software uses fully connected mesh topology, which doesn't scale to a large number of nodes, e.g. corosync supports a maximum of 32 nodes.

Addressing Scalability

The obvious workarounds are ugly!

Multiple compute clusters introduce unwanted artificial boundaries
Clusters inside / between guest VM instances are not OS-agnostic, and require cloud users to modify guest images (installing & configuring cluster software)
Cloud is supposed to make things easier not harder!

Common architecture

Scalability issue solved by pacemaker_remote

New(-ish) Pacemaker feature
Allows core cluster nodes to control "remote" nodes via a pacemaker_remote proxy service (daemon)
Can scale to very large numbers

Reliability challenges

Needs to protect critical data ⇒ requires fencing of either
- storage resource, or
- of faulty node (a.k.a. STONITH)
Needs to handle failure or (temporary) freeze of:
- Hardware (including various NICs)
- Kernel
- Hypervisor services (e.g. libvirt)
- OpenStack control plane services
  - including resurrection workflow
- VM
- Workload inside VM (ideally)

We assume that Pacemaker is reliable, otherwise we're sunk!

Labs data sheet

Admin server: crowbar.c$cloud_number
Host (hypervisor): blacher.arch.suse.de
ssh controller1
ssh compute2 etc.

Lab 1: add remotes to Pacemaker cluster

Starting point

2 controllers in HA cluster
3 compute nodes
All barclamps deployed!

Pacemaker barclamp clusters, nodes, and roles

First delete any existing role assignments by clicking Remove all.

Pacemaker role assignment

Apply Pacemaker proposal

Check progress of proposal

root@crowbar:~ # tail -f /var/log/crowbar/production.log
root@crowbar:~ # tail -f /var/log/crowbar/chef-client/*.log

Check status of cluster nodes and remotes

Compute HA in SUSE OpenStack Cloud

NovaCompute / NovaEvacuate OCF agents

Custom OCF Resource Agents (RAs)
- Pacemaker plugins to manage resources
Custom fencing agent (fence_compute) flags host for recovery
NovaEvacuate RA polls for flags, and initiates recovery
- Will keep retrying if recovery not possible
NovaCompute RA starts / stops nova-compute
- Start waits for recovery to complete

RHEL OSP installation

OCF RA approach is supported in RHEL OSP. Setup is manual; here is a fragment of the installation instructions.

RHEL OSP installation (page 2)

RHEL OSP installation (page 3)

RHEL OSP installation (page 4)

RHEL OSP installation (page 175)

NovaCompute / NovaEvacuate OCF agents

Pros

Ready for production use now
Commercially supported by SUSE
RAs upstream in openstack-resource-agents repo

Cons

Known limitations (not bugs):
- Only handles failure of compute node, not of VMs, or nova-compute
- Some corner cases still problematic, e.g. if nova fails during recovery

SUSE's solution is incredibly easy to deploy, as we'll see next!

Lab 3: nova setup

Edit Nova proposal

Nova proposal: clusters available

Nova proposal: role assignment

Apply Nova proposal

Check status of nova resources in cluster

Brief interlude: nova evacuate

This is a good time to introduce nova evacuate.

nova's recovery API

If we have a compute node failure, after fencing the node, we need to resurrect the VMs in a way which OpenStack is aware of.
Luckily nova provides an API for doing this, which is called nova evacuate. So we just call that API and nova takes care of the rest.
Without shared storage, simply rebuilds from scratch

Public Health Warning

nova evacuate does not really mean evacuation!

Think about natural disasters

Not too late to evacuate

Too late to evacuate

nova terminology

nova live-migration

nova evacuate ?!

Public Health Warning

In Vancouver, nova developers considered a rename
- Has not happened yet
- Due to impact, seems unlikely to happen any time soon

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Shared storage

Where can we have shared storage?

Two key areas:

/var/lib/glance/images on controller nodes
/var/lib/nova/instances on compute nodes

When do we need shared storage?

If /var/lib/nova/instances is shared:

VM's ephemeral disk will be preserved during recovery

Otherwise:

VM disk will be lost
recovery will need to rebuild VM from image

Either way, /var/lib/glance/images should be shared across all controllers (unless using Swift / Ceph)

otherwise nova might fail to retrieve image from glance

How crowbar batch set up shared storage

We're using admin server's NFS server:

Only suitable for testing purposes!
In production, use SES / SAN

Verify setup of shared storage

Locate shared directories via nfs_client barclamp
Check /etc/exports on admin server
Check /etc/fstab on controller / compute nodes
Run mount on controller / compute nodes

Intro to crowbar batch

batch is a subcommand of the crowbar client (typically run on the admin node).

crowbar batch

Unattended batch setup of barclamps:

root@crowbar:~ # crowbar batch build my-cloud.yaml

Dump current barclamps as YAML:

root@crowbar:~ # crowbar batch export

batch build is useful once you've learned the web UI.
batch export is useful for debugging and reproducible deployments.

YAML for Pacemaker remotes

- barclamp: pacemaker
  name: services
  attributes:
    stonith:
      mode: libvirt
      libvirt:
        hypervisor_ip: 192.168.217.1
    drbd:
      enabled: true
  deployment:
    elements:
      hawk-server:
      - "@@controller1@@"
      - "@@controller2@@"
      pacemaker-cluster-member:
      - "@@controller1@@"
      - "@@controller2@@"
      pacemaker-remote:
      - "@@compute1@@"
      - "@@compute2@@"

YAML input for KVM remote nodes

- barclamp: nova
  attributes:
    use_migration: true
    kvm:
      ksm_enabled: true
  deployment:
    elements:
      nova-controller:
      - cluster:cluster1
      nova-compute-kvm:
      - remotes:cluster1

Lab 5: Boot a VM

Boot a VM

Let's boot a VM to test compute node HA!

Connect to one of the controller nodes, and get image / flavor / net names:

source .openrc
openstack image list
openstack flavor list
neutron net-list

Boot the VM using these ids:

nova boot --image image --flavor flavor --nic net-id=net testvm

Test it's booted:

nova show testvm

Assign a floating IP

Create floating IP:

neutron floatingip-create floatingnet

Get VM IP:

nova list

Get port id:

neutron port-list | grep vmIP

Associate floating IP with VM port:

neutron floatingip-associate floatingipID portID

Allow ICMP

The VM uses the default security group. Make sure it has ICMP.

Set up monitoring

Recommended in separate windows/terminals
From either of the controller nodes

Ping VM:

ping vmFloatingIP

Ping host where the VM is running:

nova list --fields host,name
ping hostIP

Set up monitoring (part 2)

Check log messages for NovaEvacuate workflow:

tail -f /var/log/messages | grep NovaEvacuate

Monitor cluster status:

crm_mon

Lab 6: test compute node failover

(the exciting bit!)

Simulate compute node failure

pkill -9 -f pacemaker_remoted

This will cause fencing! (Why?)

Pacemaker cluster will now lose connectivity to the compute node, so has no way of knowing whether it's dead or not. So the only way to safely recover resources to another remote is by first fencing the node.

Verify recovery

Ping to the VM is interrupted, then resumed
Ping to the compute node is interrupted (then resumed)

Log messages show:

NovaEvacuate [...] Initiating evacuation
NovaEvacuate [...] Completed evacuation

crm status shows compute node offline (then back online)
Verify compute node was fenced
- Check /var/log/messages on DC
Verify VM moved to another compute node
```
nova list --fields host,name
```

Trouble-shooting

Verifying compute node failure detection

Pacemaker monitors compute nodes via pacemaker_remote.

If compute node failure detected:

compute node is fenced

crm_mon etc. will show node unclean / offline

Pacemaker invokes fence-nova as secondary fencing resource

crm configure show fencing_topology

Find node running fence_compute:

crm resource show fence-nova

Verifying secondary fencing

fence_compute script:

tells nova server that node is down updates attribute on compute node to indicate node needs recovery

Log files:

/var/log/nova/fence_compute.log
/var/log/messages on DC and node running fence-nova

Verify attribute state via:

attrd_updater --query --all --name=evacuate

Verifying compute node failure recovery process

NovaEvacuate spots attribute and calls nova evacuate

root@controller1:~ # crm resource show nova-evacuate
resource nova-evacuate is running on: d52-54-77-77-77-02

nova resurrects VM on other node

root@controller2:~ # grep nova-evacuate /var/log/messages
NovaEvacuate [...] Initiating evacuation
NovaEvacuate [...] Completed evacuation

Warning: no retries if resurrection fails!

Process failures

pacemaker_remote looks after key compute node services.

Exercise: use crmsh on cl-g-nova-compute to find out which services it looks after
Try killing a process and see what happens
- nothing, thanks to bsc#901796
Try stopping a process and see what happens
Try breaking a process (e.g. corrupt config file and restart)

Questions?

Corporate Headquarters Maxfeldstrasse 5 90409 Nuremberg Germany +49 911 740 53 0 (Worldwide) www.suse.com Join us on: www.opensuse.org

Add conclusion notes here.

About this presentation Press ? for help on navigating these slides Press m for a slide menu Press s for speaker notes

http://suse.github.io/compute-ha-training

SUSE

http://suse.github.io/compute-ha-training

2 2 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

compute-ha-training

About this presentation

Compute node HA

Hands-On Training

Nürnberg, Wednesday 11th May 2016

Adam Spiers

Senior Software Engineer, SUSE

aspiers@suse.com

Agenda

HA in OpenStack today

Typical HA control plane

Under the covers

If only the control plane is HA …

When is compute HA important?

Addressing the white elephant in the room

Pets vs. cattle

What does that mean in practice?

If compute node is hosting cattle …

If compute node is hosting pets …

Do we really need compute HA in OpenStack?

Why?

Architectural challenges

Configurability

Compute plane needs to scale

Full mesh clusters don't scale

Addressing Scalability

Common architecture

Reliability challenges

Labs data sheet

Lab 1: add remotes to Pacemaker cluster

Starting point

Pacemaker barclamp clusters, nodes, and roles

Pacemaker role assignment

Apply Pacemaker proposal

Check progress of proposal

Check status of cluster nodes and remotes

Compute HA in SUSE OpenStack Cloud

NovaCompute / NovaEvacuate OCF agents

RHEL OSP installation

RHEL OSP installation (page 2)

RHEL OSP installation (page 3)

RHEL OSP installation (page 4)

RHEL OSP installation (page 175)

NovaCompute / NovaEvacuate OCF agents

Pros

Cons

Lab 3: nova setup

Edit Nova proposal

Nova proposal: clusters available

Nova proposal: role assignment

Apply Nova proposal

Check status of nova resources in cluster

Brief interlude: nova evacuate

nova's recovery API

Public Health Warning

nova evacuate does not really mean evacuation!

Think about natural disasters

Not too late to evacuate

Too late to evacuate

nova terminology

nova live-migration

nova evacuate ?!

Public Health Warning

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Shared storage

Where can we have shared storage?

When do we need shared storage?

How crowbar batch set up shared storage

Verify setup of shared storage

Intro to crowbar batch

crowbar batch

YAML for Pacemaker remotes

YAML input for KVM remote nodes

Lab 5: Boot a VM

Boot a VM

Assign a floating IP

2 2