About this presentation

You can now watch the video of this presentation online
Press ? for help on navigating these slides
Press m for a slide menu
Press s for speaker notes
- (although watching the video is probably more useful)
Here is the original session abstract and details

HA for pets and hypervisors

State of the Nation

OpenStack summit, Austin, Tuesday 26th April 2016

Adam Spiers

Senior Software Engineer, SUSE

Dawid Deja

Software Engineer, Intel

Agenda

HA in a typical OpenStack cloud today
When do we need HA for compute nodes?
Architectural challenges
Existing solutions
Advice on choosing a solution
Future work
Upstream community

HA in OpenStack today

Typical HA control plane

Automatic restart of controller services
Increases uptime of cloud

Active / active API services with load balancing
DB + MQ either active / active or active / passive

Under the covers

Recommended by official HA guide
keepalived / VRRP often used

SOLVED

(mostly)

HAProxy distributes service requests
Pacemaker monitors and controls nodes and services
These days, to a large extent this is a solved problem!

neutron HA is tricky, but out of the scope of this talk.

If only the control plane is HA …

The control plane on the LHS is HA, but VMs live on the RHS, so what happens if one of the compute nodes blows up? That's the topic of the rest of this talk!

When is compute HA important?

The previous slide suggests there is a problem which needs solving, but does it always need solving?

Addressing the white elephant in the room

Compute node HA is a controversial feature, because some people think it's an anti-pattern which does not belong, in clouds, whereas other people feel a strong need for it. To understand when it's needed, first we have to understand the different types of workload which people want to run in the cloud.

But what are pets?

Pets vs. cattle

Pets are given names like mittens.mycompany.com
Each one is unique, lovingly hand-raised and cared for
When they get ill, you nurse them back to health

Cattle are given names like vm0213.cloud.mycompany.com
They are almost identical to other cattle
When one gets ill, you get another one

Pets are typically given unique names, whereas cattle aren't.
This reflects that pets take a lot of work to create and look after, whereas cattle don't.
Similarly, when something goes wrong with a pet, you need to invest a lot of effort to fix it, whereas with cattle you just get another one.
thanks to CERN for this slide, and Bill Baker for the original terminology

What does that mean in practice?

Service downtime when a pet dies
VM instances often stateful, with mission-critical data
Needs automated recovery with data protection

Service resilient to instances dying
Stateless, or ephemeral (disposable) storage
Already ideal for cloud … but automated recovery still needed!

If compute node is hosting cattle …

… to handle failures at scale, we need to automatically restart VMs somehow.

... otherwise over time, our service becomes more and more degraded, and manually restarting is a waste of time and unreliable due to the human element.

Heat used to support this, but HARestarter deprecated since Kilo Heat is gaining convergence / self-healing capabilities but nothing concrete currently planned for instance auto-restarting.

http://docs.openstack.org/developer/heat/ says "templates […] allow some more advanced functionality such as instance high availability […]" but according to Thomas Herve (current Heat PTL) this is no longer supported.

Heat/HA wiki out of date

If compute node is hosting pets …

… we have to resurrect very carefully in order to avoid any zombie pets!

This case is more complex than resurrecting cattle, due to the risk of zombie pets.

A zombie is a VM which appeared dead but didn't actually die properly - it could conflict with its resurrected twin.

Do we really need compute HA in OpenStack?

Why?

Compute HA needed for cattle as well as pets
Valid reasons for running pets in OpenStack
- Manageability benefits
- Want to avoid multiple virtual estates
- Too expensive to cloudify legacy workloads

So to sum up, my vote is yes, because even cattle need compute node HA.

Also, rather than painful "big bang" migrations to cloud-aware workloads, it's easier to deprecate legacy workloads, let them reach EOL whilst gradually migrating over to next-generation architectures.

This is a controversial topic, but naysayers tend to favour idealism over real world pragmatism.

Architectural challenges

If this really is needed functionality, why hasn't it already been done? The answer is that it's actually surprisingly tricky to implement in a reliable manner.

Configurability

Different cloud operators will want to support different SLAs with different workflows, e.g.

Protection for pets:
- per AZ?
- per project?
- per pet?
If nova-compute fails, VMs are still perfectly healthy but unmanageable
- Should they be automatically killed? Depends on the workload.

There is no one-size-fits-all solution to compute HA.

Compute plane needs to scale

CERN datacenter © Torkild Retvedt CC-BY-SA 2.0

Clouds will often scale to many compute nodes

100s, or even 1000s

Full mesh clusters don't scale

Typical clustering software uses fully connected mesh topology, which doesn't scale to a large number of nodes, e.g. corosync supports a maximum of 32 nodes.

Addressing Scalability

The obvious workarounds are ugly!

Multiple compute clusters introduce unwanted artificial boundaries
Clusters inside / between guest VM instances are not OS-agnostic, and require cloud users to modify guest images (installing & configuring cluster software)
Cloud is supposed to make things easier not harder!

Common architecture

Scalability issue solved by pacemaker_remote

New(-ish) Pacemaker feature
Allows core cluster nodes to control "remote" nodes via a pacemaker_remote proxy service (daemon)
Can scale to very large numbers

Reliability challenges

Needs to protect critical data ⇒ requires fencing of either
- storage resource, or
- of faulty node (a.k.a. STONITH)
Needs to handle failure or (temporary) freeze of:
- Hardware (including various NICs)
- Kernel
- Hypervisor services (e.g. libvirt)
- OpenStack control plane services
  - including resurrection workflow
- VM
- Workload inside VM (ideally)

We assume that Pacemaker is reliable, otherwise we're sunk!

Brief interlude: nova evacuate

This is a good time to introduce nova evacuate.

nova's recovery API

If we have a compute node failure, after fencing the node, we need to resurrect the VMs in a way which OpenStack is aware of.
Luckily nova provides an API for doing this, which is called nova evacuate. So we just call that API and nova takes care of the rest.
Without shared storage, simply rebuilds from scratch

Public Health Warning

nova evacuate does not really mean evacuation!

Think about natural disasters

Not too late to evacuate

Too late to evacuate

nova terminology

nova live-migration

nova evacuate ?!

Public Health Warning

In Vancouver, nova developers considered a rename
- Hasn't happened yet
- Due to impact, seems unlikely to happen any time soon

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Existing F/OSS solutions

NovaCompute / NovaEvacuate OCF agents

Custom OCF Resource Agents (RAs)
- Pacemaker plugins to manage resources
Custom fencing agent (fence_compute) flags host for recovery
NovaEvacuate RA polls for flags, and initiates recovery
- Will keep retrying if recovery not possible
NovaCompute RA starts / stops nova-compute
- Start waits for recovery to complete

RHEL OSP support

OCF RA approach is supported in RHEL OSP. Setup is manual; here is a fragment of the installation instructions.

NovaCompute / NovaEvacuate OCF agents

Pros

Ready for production use now
Commercial support available
RAs upstream in openstack-resource-agents repo

Cons

Known limitations (not bugs):
- Only handles failure of compute node, not of VMs, or nova-compute
- Some corner cases still problematic, e.g. if nova fails during recovery

Masakari architecture

Similar architectural concept, different code
- Recovery handled by separate controller service
- Persists state to database
Monitors for 3 types of failure:
- compute node down
- nova-compute service down
- VM down (detected via libvirt)

Masakari installation

https://github.com/ntt-sic/masakari
1.1.0 release: pacemaker_remote, CentOS, SQLAlchemy
Requires manual compilation of pacemaker_remote on Ubuntu 14.04

Masakari analysis

Pros

Monitors VM health (externally)
More sophisticated recovery workflows

Cons

Looser integration with pacemaker

Failing nova-compute service will be disabled
Basically only uses Pacemaker as monitoring / fencing service
Waits 5 minutes after fencing

Mistral-based solution

Mistral

Workflow as a service
Enables user to create any workflows
May be expansible with custom action
Workflow execution may be triggered by:
- events from ceilometer
- at a certain time (cloud cron)
- on demand (API call)

Next solution is based on mistral. Before I proceed with explaining this solution, I would like to tell you what Mistral is. As you already read, mistral is 'workflow as a service' service. By using it, you can define a set of tasks and connect them into logical graph. For each task, you can define what to do in case of failure or success. Moreover, if predefined tasks are not enaugh for you, you can write your own actions and plugin them into mistral. Those actions are literaly python class, so you can do anything inside of them. Once workflow is created, it can be triggered by various ways. Ceilometer, time, or, what is used in instance-ha mistral based solution, on demand via API.

Mistral-based resurrection workflow

https://github.com/gryf/mistral-evacuate

Pros

In line with upstream OpenStack strategy
Clean, simple approach
Potential for integration with Congress for policy-based workflows

Cons

Still experimental code; not yet usable by most
Mistral resilience WIP

Reuses components rather than adding yet another project We can make different decision based on failure type using congress Marking vms as pets Describe problem with mistral HA

Evacuate workflow

Whole workflow should start with nova mark-host-down if fencing was before repeat is not forever

Marking VMs as pets

$ nova meta very_important_VM set evacuate=true
$ nova flavor-key very_important_flavor set evacuation:evacuate=true

Two ways of marking vms Prefix in flaovor is important; without it if we try to schedule vm with 'very important flavor' nova-scheduler would try to find agregate with 'evacuate' capability - as a result vm will end up in error state

Senlin

https://wiki.openstack.org/wiki/Senlin
Clustering service for OpenStack
Orchestration of collections of similar objects
Policies for placement / load-balancing / health / scaling etc.
Fencing and resurrection not implemented yet

F/OSS solution functionality comparison

OCF Agents Masakari Mistral

Policy

Support for tagging VM for evacuation Yes Yes Yes Customizable actions based on failure No No Planned (via Congress)

Resilience

Service is self-resilient Yes Yes In progress Monitoring of VM's (external) health No Yes Planned

Recovery

Uses force-down API Yes No Planned Disable failed nova‑compute No Yes Planned Fully parallel workflow No No Yes

Left column groups capabilities into 3 categories
Policy-based workflows via Congress
Two capabilities uniquely in masakari which need to be in future solutions

Common functionality:

Tolerate simultaneous failures in compute / control planes
Retry failed evacuations
Monitor node and hypervisor health

Proprietary solutions

ZeroStack

Presented in Tokyo
Proprietary cloud-in-a-box
SaaS management portal
VM HA coming in next release
Adaptive, self-healing approach
Every node is dynamically (re-)assigned a role
Much harder to lose quorum, since non-voting nodes can be promoted to voting status

Needs outgoing TCP port 443 for SaaS portal
Node could switch from controller to compute based on demand

AWcloud / China Mobile

Very different solution
Presented in Tokyo
Uses Consul / raft / gossip instead of Pacemaker
Fencing via IPMI / self-fencing
Has some interesting capabilities
- gossip potentially more resilient than full mesh
- action matrix: configurable per failure mode
Source code not available

Which one should I pick?

this advice is intended to be as impartial as possible, based on pure facts!

Unbiased decision tree

Commercial solutions listed alphabetically ;-)
RHEL OpenStack Platform and SUSE OpenStack Cloud both based on OCF RA approach
mistral is one of the most promising approaches for the future

Future work

Interlock in Austin with developers and Product Working Group
"Best of breed" solution
Implement CI integration testing for failure cases
Create new specs repository and submit specs

nova evacuate API progress could work similar to nova live-migration progress

Pacemaker lunch meetup!

Wed 12:30pm, Expo Hall 5

Look for table with ClusterLabs sign

Community update

#openstack-ha IRC channel on FreeNode
- automatic notifications for activity on HA repositories
Topic category on openstack-dev@ mailing list
```
Subject: [HA] i can haz pets in my cloud?
```
Weekly IRC meetings at Monday 9am Europe/London
https://launchpad.net/openstack-resource-agents
HA guide under active development
Everyone welcome to get involved!

Questions?

Legal Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Corporate Headquarters Maxfeldstrasse 5 90409 Nuremberg Germany +49 911 740 53 0 (Worldwide) www.suse.com Join us on: www.opensuse.org

Add conclusion notes here.

About this presentation You can now watch the video of this presentation online Press ? for help on navigating these slides Press m for a slide menu Press s for speaker notes (although watching the video is probably more useful) Here is the original session abstract and details

Let us know you're here!

aspiers

Let us know you're here!

1 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

openstack-summit-2016-austin-compute-ha

About this presentation

HA for pets and hypervisors

State of the Nation

OpenStack summit, Austin, Tuesday 26th April 2016

Adam Spiers

Senior Software Engineer, SUSE

aspiers@suse.com

Dawid Deja

Software Engineer, Intel

dawid.deja@intel.com

Agenda

HA in OpenStack today

Typical HA control plane

Under the covers

If only the control plane is HA …

When is compute HA important?

Addressing the white elephant in the room

Pets vs. cattle

What does that mean in practice?

If compute node is hosting cattle …

If compute node is hosting pets …

Do we really need compute HA in OpenStack?

Why?

Architectural challenges

Configurability

Compute plane needs to scale

Full mesh clusters don't scale

Addressing Scalability

Common architecture

Reliability challenges

Brief interlude: nova evacuate

nova's recovery API

Public Health Warning

nova evacuate does not really mean evacuation!

Think about natural disasters

Not too late to evacuate

Too late to evacuate

nova terminology

nova live-migration

nova evacuate ?!

Public Health Warning

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Existing F/OSS solutions

NovaCompute / NovaEvacuate OCF agents

RHEL OSP support

NovaCompute / NovaEvacuate OCF agents

Pros

Cons

Masakari architecture

Masakari installation

Masakari analysis

Pros

Cons

Mistral-based solution

Mistral

Mistral-based resurrection workflow

Mistral-based resurrection workflow

Pros

Cons

Evacuate workflow

Marking VMs as pets

Senlin

F/OSS solution functionality comparison

Proprietary solutions

ZeroStack

AWcloud / China Mobile

Which one should I pick?

Unbiased decision tree

Future work

Pacemaker lunch meetup!

Wed 12:30pm, Expo Hall 5

Community update

Questions?

Legal Notices and Disclaimers

1 0