Let us know you're here!



Let us know you're here!

1 0


openstack-summit-2016-austin-compute-ha

Presentation on compute HA for OpenStack Summit 2016 in Austin

On Github aspiers / openstack-summit-2016-austin-compute-ha

About this presentation

HA for pets and hypervisors

State of the Nation

OpenStack summit, Austin, Tuesday 26th April 2016

Adam Spiers

Senior Software Engineer, SUSE

aspiers@suse.com

Dawid Deja

Software Engineer, Intel

dawid.deja@intel.com

Agenda

  • HA in a typical OpenStack cloud today
  • When do we need HA for compute nodes?
  • Architectural challenges
  • Existing solutions
  • Advice on choosing a solution
  • Future work
  • Upstream community

HA in OpenStack today

Typical HA control plane

  • Automatic restart of controller services
  • Increases uptime of cloud
  • Active / active API services with load balancing
  • DB + MQ either active / active or active / passive

Under the covers

  • Recommended by official HA guide
  • keepalived / VRRP often used

SOLVED

(mostly)

  • HAProxy distributes service requests
  • Pacemaker monitors and controls nodes and services
  • These days, to a large extent this is a solved problem!

neutron HA is tricky, but out of the scope of this talk.

If only the control plane is HA …

The control plane on the LHS is HA, but VMs live on the RHS, so what happens if one of the compute nodes blows up? That's the topic of the rest of this talk!

When is compute HA important?

The previous slide suggests there is a problem which needs solving, but does it always need solving?

Addressing the white elephant in the room

Compute node HA is a controversial feature, because some people think it's an anti-pattern which does not belong, in clouds, whereas other people feel a strong need for it. To understand when it's needed, first we have to understand the different types of workload which people want to run in the cloud.

But what are pets?

Pets vs. cattle

  • Pets are given names like mittens.mycompany.com
  • Each one is unique, lovingly hand-raised and cared for
  • When they get ill, you nurse them back to health
  • Cattle are given names like vm0213.cloud.mycompany.com
  • They are almost identical to other cattle
  • When one gets ill, you get another one
  • Pets are typically given unique names, whereas cattle aren't.
  • This reflects that pets take a lot of work to create and look after, whereas cattle don't.
  • Similarly, when something goes wrong with a pet, you need to invest a lot of effort to fix it, whereas with cattle you just get another one.
  • thanks to CERN for this slide, and Bill Baker for the original terminology

What does that mean in practice?

  • Service downtime when a pet dies
  • VM instances often stateful, with mission-critical data
  • Needs automated recovery with data protection
  • Service resilient to instances dying
  • Stateless, or ephemeral (disposable) storage
  • Already ideal for cloud … but automated recovery still needed!

If compute node is hosting cattle …

… to handle failures at scale, we need to automatically restart VMs somehow.

... otherwise over time, our service becomes more and more degraded, and manually restarting is a waste of time and unreliable due to the human element.

Heat used to support this, but HARestarter deprecated since Kilo Heat is gaining convergence / self-healing capabilities but nothing concrete currently planned for instance auto-restarting.

http://docs.openstack.org/developer/heat/ says "templates […] allow some more advanced functionality such as instance high availability […]" but according to Thomas Herve (current Heat PTL) this is no longer supported.

Heat/HA wiki out of date

If compute node is hosting pets …

… we have to resurrect very carefully in order to avoid any zombie pets!

This case is more complex than resurrecting cattle, due to the risk of zombie pets.

A zombie is a VM which appeared dead but didn't actually die properly - it could conflict with its resurrected twin.

Do we really need compute HA in OpenStack?

Why?

  • Compute HA needed for cattle as well as pets
  • Valid reasons for running pets in OpenStack
    • Manageability benefits
    • Want to avoid multiple virtual estates
    • Too expensive to cloudify legacy workloads

So to sum up, my vote is yes, because even cattle need compute node HA.

Also, rather than painful "big bang" migrations to cloud-aware workloads, it's easier to deprecate legacy workloads, let them reach EOL whilst gradually migrating over to next-generation architectures.

This is a controversial topic, but naysayers tend to favour idealism over real world pragmatism.

Architectural challenges

If this really is needed functionality, why hasn't it already been done? The answer is that it's actually surprisingly tricky to implement in a reliable manner.

Configurability

Different cloud operators will want to support different SLAs with different workflows, e.g.

  • Protection for pets:
    • per AZ?
    • per project?
    • per pet?
  • If nova-compute fails, VMs are still perfectly healthy but unmanageable
    • Should they be automatically killed? Depends on the workload.

There is no one-size-fits-all solution to compute HA.

Compute plane needs to scale

CERN datacenter © Torkild Retvedt CC-BY-SA 2.0

Clouds will often scale to many compute nodes

  • 100s, or even 1000s

Full mesh clusters don't scale

Typical clustering software uses fully connected mesh topology, which doesn't scale to a large number of nodes, e.g. corosync supports a maximum of 32 nodes.

Addressing Scalability

The obvious workarounds are ugly!

  • Multiple compute clusters introduce unwanted artificial boundaries
  • Clusters inside / between guest VM instances are not OS-agnostic, and require cloud users to modify guest images (installing & configuring cluster software)
  • Cloud is supposed to make things easier not harder!

Common architecture

Scalability issue solved by pacemaker_remote

  • New(-ish) Pacemaker feature
  • Allows core cluster nodes to control "remote" nodes via a pacemaker_remote proxy service (daemon)
  • Can scale to very large numbers

Reliability challenges

  • Needs to protect critical data ⇒ requires fencing of either

    • storage resource, or
    • of faulty node (a.k.a. STONITH)
  • Needs to handle failure or (temporary) freeze of:

    • Hardware (including various NICs)
    • Kernel
    • Hypervisor services (e.g. libvirt)
    • OpenStack control plane services
      • including resurrection workflow
    • VM
    • Workload inside VM (ideally)

We assume that Pacemaker is reliable, otherwise we're sunk!

Brief interlude: nova evacuate

This is a good time to introduce nova evacuate.

nova's recovery API

  • If we have a compute node failure, after fencing the node, we need to resurrect the VMs in a way which OpenStack is aware of.
  • Luckily nova provides an API for doing this, which is called nova evacuate. So we just call that API and nova takes care of the rest.
  • Without shared storage, simply rebuilds from scratch

Public Health Warning

nova evacuate does not really mean evacuation!

Think about natural disasters

Not too late to evacuate

Too late to evacuate

nova terminology

nova live-migration

nova evacuate ?!

Public Health Warning

  • In Vancouver, nova developers considered a rename
    • Hasn't happened yet
    • Due to impact, seems unlikely to happen any time soon

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Existing F/OSS solutions

NovaCompute / NovaEvacuate OCF agents

  • Custom OCF Resource Agents (RAs)
    • Pacemaker plugins to manage resources
  • Custom fencing agent (fence_compute) flags host for recovery
  • NovaEvacuate RA polls for flags, and initiates recovery
    • Will keep retrying if recovery not possible
  • NovaCompute RA starts / stops nova-compute
    • Start waits for recovery to complete

RHEL OSP support

OCF RA approach is supported in RHEL OSP. Setup is manual; here is a fragment of the installation instructions.

NovaCompute / NovaEvacuate OCF agents

Pros

Cons

  • Known limitations (not bugs):
    • Only handles failure of compute node, not of VMs, or nova-compute
    • Some corner cases still problematic, e.g. if nova fails during recovery

Masakari architecture

  • Similar architectural concept, different code
    • Recovery handled by separate controller service
    • Persists state to database
  • Monitors for 3 types of failure:
    • compute node down
    • nova-compute service down
    • VM down (detected via libvirt)

Masakari installation

Masakari analysis

Pros

  • Monitors VM health (externally)
  • More sophisticated recovery workflows

Cons

  • Looser integration with pacemaker
  • Failing nova-compute service will be disabled
  • Basically only uses Pacemaker as monitoring / fencing service
  • Waits 5 minutes after fencing

Mistral-based solution

Mistral

  • Workflow as a service
  • Enables user to create any workflows
  • May be expansible with custom action
  • Workflow execution may be triggered by:
    • events from ceilometer
    • at a certain time (cloud cron)
    • on demand (API call)

Next solution is based on mistral. Before I proceed with explaining this solution, I would like to tell you what Mistral is. As you already read, mistral is 'workflow as a service' service. By using it, you can define a set of tasks and connect them into logical graph. For each task, you can define what to do in case of failure or success. Moreover, if predefined tasks are not enaugh for you, you can write your own actions and plugin them into mistral. Those actions are literaly python class, so you can do anything inside of them. Once workflow is created, it can be triggered by various ways. Ceilometer, time, or, what is used in instance-ha mistral based solution, on demand via API.

Mistral-based resurrection workflow

Mistral-based resurrection workflow

Pros

  • In line with upstream OpenStack strategy
  • Clean, simple approach
  • Potential for integration with Congress for policy-based workflows

Cons

  • Still experimental code; not yet usable by most
  • Mistral resilience WIP

Reuses components rather than adding yet another project We can make different decision based on failure type using congress Marking vms as pets Describe problem with mistral HA

Evacuate workflow

Whole workflow should start with nova mark-host-down if fencing was before repeat is not forever

Marking VMs as pets

$ nova meta very_important_VM set evacuate=true
$ nova flavor-key very_important_flavor set evacuation:evacuate=true

Two ways of marking vms Prefix in flaovor is important; without it if we try to schedule vm with 'very important flavor' nova-scheduler would try to find agregate with 'evacuate' capability - as a result vm will end up in error state

Senlin

  • https://wiki.openstack.org/wiki/Senlin
  • Clustering service for OpenStack
  • Orchestration of collections of similar objects
  • Policies for placement / load-balancing / health / scaling etc.
  • Fencing and resurrection not implemented yet

F/OSS solution functionality comparison

OCF Agents Masakari Mistral
Policy
Support for tagging VM for evacuation Yes Yes Yes Customizable actions based on failure No No Planned (via Congress)
Resilience
Service is self-resilient Yes Yes In progress Monitoring of VM's (external) health No Yes Planned
Recovery
Uses force-down API Yes No Planned Disable failed nova‑compute No Yes Planned Fully parallel workflow No No Yes
  • Left column groups capabilities into 3 categories
  • Policy-based workflows via Congress
  • Two capabilities uniquely in masakari which need to be in future solutions

Common functionality:

  • Tolerate simultaneous failures in compute / control planes
  • Retry failed evacuations
  • Monitor node and hypervisor health

Proprietary solutions

ZeroStack

  • Presented in Tokyo
  • Proprietary cloud-in-a-box
  • SaaS management portal
  • VM HA coming in next release
  • Adaptive, self-healing approach
  • Every node is dynamically (re-)assigned a role
  • Much harder to lose quorum, since non-voting nodes can be promoted to voting status
  • Needs outgoing TCP port 443 for SaaS portal
  • Node could switch from controller to compute based on demand

AWcloud / China Mobile

Which one should I pick?

this advice is intended to be as impartial as possible, based on pure facts!

Unbiased decision tree

Future work

  • Interlock in Austin with developers and Product Working Group
  • "Best of breed" solution
  • Implement CI integration testing for failure cases
  • Create new specs repository and submit specs
  • nova evacuate API progress could work similar to nova live-migration progress

Pacemaker lunch meetup!

Wed 12:30pm, Expo Hall 5

Look for table with ClusterLabs sign

Community update

Questions?

Legal Notices and Disclaimers

  • Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
  • No computer system can be absolutely secure.
  • Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
  • Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
  • © 2016 Intel Corporation.

Corporate Headquarters Maxfeldstrasse 5 90409 Nuremberg Germany +49 911 740 53 0 (Worldwide) www.suse.com Join us on: www.opensuse.org

Add conclusion notes here.

About this presentation You can now watch the video of this presentation online Press ? for help on navigating these slides Press m for a slide menu Press s for speaker notes (although watching the video is probably more useful) Here is the original session abstract and details