http://suse.github.io/compute-ha-training



http://suse.github.io/compute-ha-training

2 2


compute-ha-training


On Github SUSE / compute-ha-training

About this presentation

  • Press ? for help on navigating these slides
  • Press m for a slide menu
  • Press s for speaker notes

Compute node HA

Hands-On Training

Nürnberg, Wednesday 11th May 2016

Adam Spiers

Senior Software Engineer, SUSE

aspiers@suse.com

Agenda

  • HA in a typical OpenStack cloud today
  • When do we need HA for compute nodes?
  • Architectural challenges
  • Solution in SUSE OpenStack Cloud
  • Failover testing

HA in OpenStack today

Typical HA control plane

  • Automatic restart of controller services
  • Increases uptime of cloud
  • Active / active API services with load balancing
  • DB + MQ either active / active or active / passive

Under the covers

SOLVED

(mostly)

  • HAProxy distributes service requests
  • Pacemaker monitors and controls nodes and services
  • These days, to a large extent this is a solved problem!

neutron HA is tricky, but out of the scope of this talk.

If only the control plane is HA …

The control plane on the LHS is HA, but VMs live on the RHS, so what happens if one of the compute nodes blows up? That's the topic of the rest of this talk!

When is compute HA important?

The previous slide suggests there is a problem which needs solving, but does it always need solving?

Addressing the white elephant in the room

Compute node HA is a controversial feature, because some people think it's an anti-pattern which does not belong, in clouds, whereas other people feel a strong need for it. To understand when it's needed, first we have to understand the different types of workload which people want to run in the cloud.

But what are pets?

Pets vs. cattle

  • Pets are given names like mittens.mycompany.com
  • Each one is unique, lovingly hand-raised and cared for
  • When they get ill, you nurse them back to health
  • Cattle are given names like vm0213.cloud.mycompany.com
  • They are almost identical to other cattle
  • When one gets ill, you get another one
  • Pets are typically given unique names, whereas cattle aren't.
  • This reflects that pets take a lot of work to create and look after, whereas cattle don't.
  • Similarly, when something goes wrong with a pet, you need to invest a lot of effort to fix it, whereas with cattle you just get another one.
  • thanks to CERN for this slide, and Bill Baker for the original terminology

What does that mean in practice?

  • Service downtime when a pet dies
  • VM instances often stateful, with mission-critical data
  • Needs automated recovery with data protection
  • Service resilient to instances dying
  • Stateless, or ephemeral (disposable) storage
  • Already ideal for cloud … but automated recovery still needed!

If compute node is hosting cattle …

… to handle failures at scale, we need to automatically restart VMs somehow.

... otherwise over time, our service becomes more and more degraded, and manually restarting is a waste of time and unreliable due to the human element.

Heat used to support this, but HARestarter deprecated since Kilo Heat is gaining convergence / self-healing capabilities but nothing concrete currently planned for instance auto-restarting.

http://docs.openstack.org/developer/heat/ says "templates […] allow some more advanced functionality such as instance high availability […]" but according to Thomas Herve (current Heat PTL) this is no longer supported.

Heat/HA wiki out of date

If compute node is hosting pets …

… we have to resurrect very carefully in order to avoid any zombie pets!

This case is more complex than resurrecting cattle, due to the risk of zombie pets.

A zombie is a VM which appeared dead but didn't actually die properly - it could conflict with its resurrected twin.

Do we really need compute HA in OpenStack?

Why?

  • Compute HA needed for cattle as well as pets
  • Valid reasons for running pets in OpenStack
    • Manageability benefits
    • Want to avoid multiple virtual estates
    • Too expensive to cloudify legacy workloads

So to sum up, my vote is yes, because even cattle need compute node HA.

Also, rather than painful "big bang" migrations to cloud-aware workloads, it's easier to deprecate legacy workloads, let them reach EOL whilst gradually migrating over to next-generation architectures.

This is a controversial topic, but naysayers tend to favour idealism over real world pragmatism.

Architectural challenges

If this really is needed functionality, why hasn't it already been done? The answer is that it's actually surprisingly tricky to implement in a reliable manner.

Configurability

Different cloud operators will want to support different SLAs with different workflows, e.g.

  • Protection for pets:
    • per AZ?
    • per project?
    • per pet?
  • If nova-compute fails, VMs are still perfectly healthy but unmanageable
    • Should they be automatically killed? Depends on the workload.

There is no one-size-fits-all solution to compute HA.

Compute plane needs to scale

CERN datacenter © Torkild Retvedt CC-BY-SA 2.0

Clouds will often scale to many compute nodes

  • 100s, or even 1000s

Full mesh clusters don't scale

Typical clustering software uses fully connected mesh topology, which doesn't scale to a large number of nodes, e.g. corosync supports a maximum of 32 nodes.

Addressing Scalability

The obvious workarounds are ugly!

  • Multiple compute clusters introduce unwanted artificial boundaries
  • Clusters inside / between guest VM instances are not OS-agnostic, and require cloud users to modify guest images (installing & configuring cluster software)
  • Cloud is supposed to make things easier not harder!

Common architecture

Scalability issue solved by pacemaker_remote

  • New(-ish) Pacemaker feature
  • Allows core cluster nodes to control "remote" nodes via a pacemaker_remote proxy service (daemon)
  • Can scale to very large numbers

Reliability challenges

  • Needs to protect critical data ⇒ requires fencing of either

    • storage resource, or
    • of faulty node (a.k.a. STONITH)
  • Needs to handle failure or (temporary) freeze of:

    • Hardware (including various NICs)
    • Kernel
    • Hypervisor services (e.g. libvirt)
    • OpenStack control plane services
      • including resurrection workflow
    • VM
    • Workload inside VM (ideally)

We assume that Pacemaker is reliable, otherwise we're sunk!

Labs data sheet

  • Admin server: crowbar.c$cloud_number
  • Host (hypervisor): blacher.arch.suse.de
  • ssh controller1
  • ssh compute2 etc.

Lab 1: add remotes to Pacemaker cluster

Starting point

  • 2 controllers in HA cluster
  • 3 compute nodes
  • All barclamps deployed!

Pacemaker barclamp clusters, nodes, and roles

First delete any existing role assignments by clicking Remove all.

Pacemaker role assignment

Apply Pacemaker proposal

Check progress of proposal

root@crowbar:~ # tail -f /var/log/crowbar/production.log
root@crowbar:~ # tail -f /var/log/crowbar/chef-client/*.log

Check status of cluster nodes and remotes

Login to one of the controller nodes, and do:

Compute HA in SUSE OpenStack Cloud

NovaCompute / NovaEvacuate OCF agents

  • Custom OCF Resource Agents (RAs)
    • Pacemaker plugins to manage resources
  • Custom fencing agent (fence_compute) flags host for recovery
  • NovaEvacuate RA polls for flags, and initiates recovery
    • Will keep retrying if recovery not possible
  • NovaCompute RA starts / stops nova-compute
    • Start waits for recovery to complete

RHEL OSP installation

OCF RA approach is supported in RHEL OSP. Setup is manual; here is a fragment of the installation instructions.

RHEL OSP installation (page 2)

RHEL OSP installation (page 3)

RHEL OSP installation (page 4)

RHEL OSP installation (page 175)

NovaCompute / NovaEvacuate OCF agents

Pros

Cons

  • Known limitations (not bugs):
    • Only handles failure of compute node, not of VMs, or nova-compute
    • Some corner cases still problematic, e.g. if nova fails during recovery

SUSE's solution is incredibly easy to deploy, as we'll see next!

Lab 3: nova setup

Edit Nova proposal

Nova proposal: clusters available

Nova proposal: role assignment

Apply Nova proposal

Check status of nova resources in cluster

Brief interlude: nova evacuate

This is a good time to introduce nova evacuate.

nova's recovery API

  • If we have a compute node failure, after fencing the node, we need to resurrect the VMs in a way which OpenStack is aware of.
  • Luckily nova provides an API for doing this, which is called nova evacuate. So we just call that API and nova takes care of the rest.
  • Without shared storage, simply rebuilds from scratch

Public Health Warning

nova evacuate does not really mean evacuation!

Think about natural disasters

Not too late to evacuate

Too late to evacuate

nova terminology

nova live-migration

nova evacuate ?!

Public Health Warning

  • In Vancouver, nova developers considered a rename
    • Has not happened yet
    • Due to impact, seems unlikely to happen any time soon

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Shared storage

Where can we have shared storage?

Two key areas:

  • /var/lib/glance/images on controller nodes
  • /var/lib/nova/instances on compute nodes

When do we need shared storage?

If /var/lib/nova/instances is shared:

  • VM's ephemeral disk will be preserved during recovery

Otherwise:

  • VM disk will be lost
  • recovery will need to rebuild VM from image

Either way, /var/lib/glance/images should be shared across all controllers (unless using Swift / Ceph)

  • otherwise nova might fail to retrieve image from glance

How crowbar batch set up shared storage

We're using admin server's NFS server:

  • Only suitable for testing purposes!
  • In production, use SES / SAN

Verify setup of shared storage

  • Locate shared directories via nfs_client barclamp
  • Check /etc/exports on admin server
  • Check /etc/fstab on controller / compute nodes
  • Run mount on controller / compute nodes

Intro to crowbar batch

batch is a subcommand of the crowbar client (typically run on the admin node).

crowbar batch

Unattended batch setup of barclamps:

root@crowbar:~ # crowbar batch build my-cloud.yaml

Dump current barclamps as YAML:

root@crowbar:~ # crowbar batch export
  • batch build is useful once you've learned the web UI.
  • batch export is useful for debugging and reproducible deployments.

YAML for Pacemaker remotes

- barclamp: pacemaker
  name: services
  attributes:
    stonith:
      mode: libvirt
      libvirt:
        hypervisor_ip: 192.168.217.1
    drbd:
      enabled: true
  deployment:
    elements:
      hawk-server:
      - "@@controller1@@"
      - "@@controller2@@"
      pacemaker-cluster-member:
      - "@@controller1@@"
      - "@@controller2@@"
      pacemaker-remote:
      - "@@compute1@@"
      - "@@compute2@@"

YAML input for KVM remote nodes

- barclamp: nova
  attributes:
    use_migration: true
    kvm:
      ksm_enabled: true
  deployment:
    elements:
      nova-controller:
      - cluster:cluster1
      nova-compute-kvm:
      - remotes:cluster1

Lab 5: Boot a VM

Boot a VM

Let's boot a VM to test compute node HA!

Connect to one of the controller nodes, and get image / flavor / net names:

source .openrc
openstack image list
openstack flavor list
neutron net-list

Boot the VM using these ids:

nova boot --image image --flavor flavor --nic net-id=net testvm

Test it's booted:

nova show testvm

Assign a floating IP

Create floating IP:

neutron floatingip-create floatingnet

Get VM IP:

nova list

Get port id:

neutron port-list | grep vmIP

Associate floating IP with VM port:

neutron floatingip-associate floatingipID portID

Allow ICMP

The VM uses the default security group. Make sure it has ICMP.

Set up monitoring

  • Recommended in separate windows/terminals
  • From either of the controller nodes

Ping VM:

ping vmFloatingIP

Ping host where the VM is running:

nova list --fields host,name
ping hostIP

Set up monitoring (part 2)

Check log messages for NovaEvacuate workflow:

tail -f /var/log/messages | grep NovaEvacuate

Monitor cluster status:

crm_mon

Lab 6: test compute node failover

(the exciting bit!)

Simulate compute node failure

Login to compute node where VM runs, and type:

pkill -9 -f pacemaker_remoted

This will cause fencing! (Why?)

Pacemaker cluster will now lose connectivity to the compute node, so has no way of knowing whether it's dead or not. So the only way to safely recover resources to another remote is by first fencing the node.

Verify recovery

  • Ping to the VM is interrupted, then resumed
  • Ping to the compute node is interrupted (then resumed)
  • Log messages show:
    NovaEvacuate [...] Initiating evacuation
    NovaEvacuate [...] Completed evacuation
    
  • crm status shows compute node offline (then back online)
  • Verify compute node was fenced
    • Check /var/log/messages on DC
  • Verify VM moved to another compute node
    nova list --fields host,name
    

Trouble-shooting

Verifying compute node failure detection

Pacemaker monitors compute nodes via pacemaker_remote.

If compute node failure detected:

compute node is fenced
  • crm_mon etc. will show node unclean / offline
Pacemaker invokes fence-nova as secondary fencing resource
crm configure show fencing_topology

Find node running fence_compute:

crm resource show fence-nova

Verifying secondary fencing

fence_compute script:

tells nova server that node is down updates attribute on compute node to indicate node needs recovery

Log files:

  • /var/log/nova/fence_compute.log
  • /var/log/messages on DC and node running fence-nova

Verify attribute state via:

attrd_updater --query --all --name=evacuate

Verifying compute node failure recovery process

NovaEvacuate spots attribute and calls nova evacuate
root@controller1:~ # crm resource show nova-evacuate
resource nova-evacuate is running on: d52-54-77-77-77-02
nova resurrects VM on other node
root@controller2:~ # grep nova-evacuate /var/log/messages
NovaEvacuate [...] Initiating evacuation
NovaEvacuate [...] Completed evacuation

Warning: no retries if resurrection fails!

Process failures

pacemaker_remote looks after key compute node services.

  • Exercise: use crmsh on cl-g-nova-compute to find out which services it looks after
  • Try killing a process and see what happens
  • Try stopping a process and see what happens
  • Try breaking a process (e.g. corrupt config file and restart)

Questions?

Corporate Headquarters Maxfeldstrasse 5 90409 Nuremberg Germany +49 911 740 53 0 (Worldwide) www.suse.com Join us on: www.opensuse.org

Add conclusion notes here.

About this presentation Press ? for help on navigating these slides Press m for a slide menu Press s for speaker notes