Compute node HA in OpenStack

State of the Nation

Manchester OpenStack Meetup, Wed 13th April

Adam Spiers

Senior Software Engineer, Cloud & HA

Agenda

HA in a typical OpenStack cloud today
When do we need HA for compute nodes?
Architectural challenges
Existing solutions
Advice on choosing a solution
Future work
Upstream community

HA in OpenStack today

Typical HA control plane

Increases cloud uptime
Automatic restart of OpenStack controller services
Active / active API services with load balancing
DB + MQ either active / active or active / passive

Under the covers

Recommended by official HA guide
HAProxy distributes service requests
Pacemaker
- monitoring and control of nodes and services
Corosync
- cluster membership / messaging / quorum / leadership election

But what I really want to do is keep my workloads up!

When is compute HA important?

Pets vs. cattle

Pets are given names like mittens.mycompany.com
Each one is unique, lovingly hand-raised and cared for
When they get ill, you spend money nursing them back to health

Cattle are given names like vm0213.cloud.mycompany.com
They are almost identical to other cattle
When one gets ill, you shoot it and get another one

the clue's in the naming

What does that mean in practice?

Service downtime when a pet dies
VM instances often stateful, with mission-critical data
Needs automated recovery with data protection

Service resilient to instances dying
Stateless, or ephemeral (disposable) storage
Already ideal for cloud … but can still benefit from automated recovery!

If only the control plane is HA …

LHS is HA, but cattle and pets live on the RHS, multiple per host

If compute node is hosting cattle …

automatically resurrect via OpenStack Orchestration (Heat) convergence feature (in theory?)

http://docs.openstack.org/developer/heat/ says "templates […] allow some more advanced functionality such as instance high availability […]" ... but wiki hopelessly out of date, and HARestarter deprecated since Kilo in favour of convergence - see http://specs.openstack.org/openstack/heat-specs/

If compute node is hosting pets …

We have to resurrect very carefully in order to avoid any zombie pets

a zombie is VMs which appeared dead but didn't actually die properly - it could conflict with its resurrected twin

Architectural challenges

Reliability challenges

Needs to protect critical data ⇒ requires fencing of either
- storage resource, or
- of faulty node (a.k.a. STONITH)
Needs to handle failure or (temporary) freeze of:
- Hardware (including various NICs)
- Kernel
- OpenStack services
- Hypervisor services (e.g. libvirt)
- VM
- Workload inside VM (ideally)
- Control plane (including resurrection workflow)

Configurability

Different cloud operators will want to support different SLAs with different workflows, e.g.

Protection for pets:
- per AZ?
- per project?
- per pet?
If nova-compute fails, VMs are still perfectly healthy but unmanageable
- Should they be automatically killed? Depends on the workload.

There is no one-size-fits-all solution to compute HA.

Scalability

Clouds will often scale to many compute nodes
- 100s, or even 1000s
Typical clustering software is peer-to-peer
- e.g. corosync requires <= 32 nodes
The obvious workarounds are ugly!
- Multiple compute clusters
  - introduces unwanted artificial boundaries
- Clusters inside / between guest VM instances
  - requires cloud users to modify guest images (installing & configuring cluster software)
  - cluster stacks are not OS-agnostic
Cloud is supposed to make things easier not harder!

Brief interlude: nova evacuate

nova evacuate

API provided by nova for initiating recovery of VM
http://docs.openstack.org/admin-guide/cli_nova_evacuate.html

# nova help evacuate
usage: nova evacuate [--password <password>] [--on-shared-storage]
                     <server> [<host>]

Evacuate server from failed host.

# nova help host-evacuate
usage: nova host-evacuate [--target_host <target_host>] [--on-shared-storage]
                          <host>

Evacuate all instances from failed host.

Used by most HA solutions
Without shared storage, simply rebuilds from scratch

Public Health Warning

nova evacuate does not really mean evacuation!

Think about earthquakes

Not too late to evacuate

Too late to evacuate

nova terminology

nova live-migration

nova evacuate ?!

Public Health Warning

nova evacuate does not do evacuation
nova evacuate does resurrection (after releasing dependencies)
In Vancouver, nova developers considered a rename
- Hasn't happened yet
- Due to impact, seems unlikely to happen any time soon

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Existing solutions

NovaCompute / NovaEvacuate OCF agents

Custom OCF Resource Agents (RAs)
- Pacemaker plugins to manage resources
Used by Red Hat / SUSE; contributions by Intel also
Custom fencing agent (fence_compute) flags host for recovery
NovaEvacuate RA polls for flags, and initiates recovery
- Will keep retrying if recovery not possible
NovaCompute RA starts / stops nova-compute
- Start waits for recovery to complete
RAs upstream in openstack-resource-agents repo (maintained by me)

NovaCompute / NovaEvacuate OCF agents

Scalability issue solved by pacemaker_remote

New(-ish) Pacemaker feature
Allows core cluster nodes to control "remote" nodes via a pacemaker_remote proxy service (daemon)
Can scale to very large numbers

NovaCompute / NovaEvacuate OCF agents

Pros

Ready for production use now
Commercial support available
Tolerates simultaneous failures in compute / control planes

Cons

Known limitations (not bugs):
- Only handles failure of compute node, not of VMs, or nova-compute
- Some corner cases still problematic, e.g. if control plane fails during recovery

Masakari

https://github.com/ntt-sic/masakari
Developed by NTT
Similar architectural concept, different code
- Recovery handled by separate service
- Persists state to RDBMS
Monitors for 3 types of failure:
- compute node down
- nova-compute service down
- VM down (detected via libvirt)
Recently switched to pacemaker_remote and SQLAlchemy

Masakari architecture

Mistral-based resurrection workflow

Experimental PoC code from Intel
- https://github.com/gryf/mistral-evacuate

Pros

Congruous with upstream OpenStack strategy
Potential for integration with Congress for policy-based workflows

Cons

Still early stages; not yet usable by most
Mistral itself not yet HA (but could be fixed in Newton?)

Reuses components rather than adding yet another project

AWcloud / China Mobile

Very different solution
Presented in Tokyo
Uses Consul / raft / gossip instead of Pacemaker
Fencing via IPMI / self-fencing
Has some interesting capabilities
- gossip potentially more resilient than peer-to-peer
- action matrix: configurable per failure mode
But source code not available :-(

Senlin

https://wiki.openstack.org/wiki/Senlin
Started in ~ June 2015 (by IBM?)
Aiming to provide a generic clustering service (HAaaS)

ZeroStack

Presented in Tokyo
Proprietary hosted solution
Adaptive, self-healing approach
- Every node is be dynamically (re-)assigned a role
  - Could switch from controller to compute based on demand
- Much harder to lose quorum, since non-voting nodes can be promoted to voting status

Which one should I pick?

Questions to ask

Do you need a vendor-supported, enterprise-ready solution for production clouds right now?

Private cloud? Only current options:
- RHEL OpenStack Platform
- SUSE OpenStack Cloud
- (TODO: check with Mirantis and Canonical)
Public cloud?
- ZeroStack hosted solution

Questions to ask (2)

Are you prepared to support the solution yourself, and invest some engineering effort on integration / DevOps?

Recommendation: masakari

Handles more failure cases than OCF RA approach
Fairly well tested and documented

Questions to ask (3)

Are you interested in collaborating on experimental technology?

mistral
- One of the most promising approaches for the future
senlin

Questions to ask (4)

Do you work for AWcloud or China Mobile?

Use your own solution ;-)

Future work

Convergence of masakari with Mistral approach
- Replace masakari process monitoring with Pacemaker
- Figure out how masakari could harness Mistral
Create new specs repository and submit specs
Implement CI integration testing for failure cases
Interlock in Austin with developers and Product Working Group

Community

Community news

openstack-resource-agents project on stackforge
- maintained by me
New #openstack-ha IRC channel on FreeNode
- automatic notifications for activity on HA repositories
New topic category on openstack-dev@ mailing list
```
Subject: [HA] i can haz pets in my cloud?
```
Weekly IRC meetings at Monday 9am Europe/London
HA guide currently undergoing a revamp
Everyone welcome to get involved!

Questions?

Corporate Headquarters Maxfeldstrasse 5 90409 Nuremberg Germany +49 911 740 53 0 (Worldwide) www.suse.com Join us on: www.opensuse.org

Add conclusion notes here.

Let me know you're here!

Let me know you're here!

aspiers

Let me know you're here!

1 1 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

compute-ha-presentation

Let me know you're here!

Compute node HA in OpenStack

State of the Nation

Manchester OpenStack Meetup, Wed 13th April

Adam Spiers

Senior Software Engineer, Cloud & HA

aspiers@suse.com

Agenda

HA in OpenStack today

Typical HA control plane

Under the covers

But what I really want to do is keep my workloads up!

When is compute HA important?

Pets vs. cattle

What does that mean in practice?

If only the control plane is HA …

If compute node is hosting cattle …

If compute node is hosting pets …

Architectural challenges

Reliability challenges

Configurability

Scalability

Brief interlude: nova evacuate

nova evacuate

Public Health Warning

nova evacuate does not really mean evacuation!

Think about earthquakes

Not too late to evacuate

Too late to evacuate

nova terminology

nova live-migration

nova evacuate ?!

Public Health Warning

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Existing solutions

NovaCompute / NovaEvacuate OCF agents

NovaCompute / NovaEvacuate OCF agents

NovaCompute / NovaEvacuate OCF agents

NovaCompute / NovaEvacuate OCF agents

Pros

Cons

Masakari

Masakari architecture

Mistral-based resurrection workflow

Pros

Cons

AWcloud / China Mobile

Senlin

ZeroStack

Which one should I pick?

Questions to ask

Questions to ask (2)

Questions to ask (3)

Questions to ask (4)

Future work

Community

Community news

Questions?

1 1