Let me know you're here!



Let me know you're here!

1 1


compute-ha-presentation

Presentation on compute HA for OpenStack

On Github aspiers / compute-ha-presentation

Compute node HA in OpenStack

State of the Nation

Manchester OpenStack Meetup, Wed 13th April

Adam Spiers

Senior Software Engineer, Cloud & HA

aspiers@suse.com

Agenda

  • HA in a typical OpenStack cloud today
  • When do we need HA for compute nodes?
  • Architectural challenges
  • Existing solutions
  • Advice on choosing a solution
  • Future work
  • Upstream community

HA in OpenStack today

Typical HA control plane

  • Increases cloud uptime
  • Automatic restart of OpenStack controller services
  • Active / active API services with load balancing
  • DB + MQ either active / active or active / passive

Under the covers

  • Recommended by official HA guide
  • HAProxy distributes service requests
  • Pacemaker
    • monitoring and control of nodes and services
  • Corosync
    • cluster membership / messaging / quorum / leadership election

But what I really want to do is keep my workloads up!

When is compute HA important?

Pets vs. cattle

  • Pets are given names like mittens.mycompany.com
  • Each one is unique, lovingly hand-raised and cared for
  • When they get ill, you spend money nursing them back to health
  • Cattle are given names like vm0213.cloud.mycompany.com
  • They are almost identical to other cattle
  • When one gets ill, you shoot it and get another one
the clue's in the naming

What does that mean in practice?

  • Service downtime when a pet dies
  • VM instances often stateful, with mission-critical data
  • Needs automated recovery with data protection
  • Service resilient to instances dying
  • Stateless, or ephemeral (disposable) storage
  • Already ideal for cloud … but can still benefit from automated recovery!

If only the control plane is HA …

LHS is HA, but cattle and pets live on the RHS, multiple per host

If compute node is hosting cattle …

automatically resurrect via OpenStack Orchestration (Heat) convergence feature (in theory?)

http://docs.openstack.org/developer/heat/ says "templates […] allow some more advanced functionality such as instance high availability […]" ... but wiki hopelessly out of date, and HARestarter deprecated since Kilo in favour of convergence - see http://specs.openstack.org/openstack/heat-specs/

If compute node is hosting pets …

We have to resurrect very carefully in order to avoid any zombie pets

a zombie is VMs which appeared dead but didn't actually die properly - it could conflict with its resurrected twin

Architectural challenges

Reliability challenges

  • Needs to protect critical data ⇒ requires fencing of either
    • storage resource, or
    • of faulty node (a.k.a. STONITH)
  • Needs to handle failure or (temporary) freeze of:
    • Hardware (including various NICs)
    • Kernel
    • OpenStack services
    • Hypervisor services (e.g. libvirt)
    • VM
    • Workload inside VM (ideally)
    • Control plane (including resurrection workflow)

Configurability

Different cloud operators will want to support different SLAs with different workflows, e.g.

  • Protection for pets:
    • per AZ?
    • per project?
    • per pet?
  • If nova-compute fails, VMs are still perfectly healthy but unmanageable
    • Should they be automatically killed? Depends on the workload.
There is no one-size-fits-all solution to compute HA.

Scalability

  • Clouds will often scale to many compute nodes
    • 100s, or even 1000s
  • Typical clustering software is peer-to-peer
    • e.g. corosync requires <= 32 nodes
  • The obvious workarounds are ugly!

    • Multiple compute clusters

      • introduces unwanted artificial boundaries
    • Clusters inside / between guest VM instances

      • requires cloud users to modify guest images (installing & configuring cluster software)
      • cluster stacks are not OS-agnostic

    Cloud is supposed to make things easier not harder!

Brief interlude: nova evacuate

nova evacuate

# nova help evacuate
usage: nova evacuate [--password <password>] [--on-shared-storage]
                     <server> [<host>]

Evacuate server from failed host.

# nova help host-evacuate
usage: nova host-evacuate [--target_host <target_host>] [--on-shared-storage]
                          <host>

Evacuate all instances from failed host.
  • Used by most HA solutions
  • Without shared storage, simply rebuilds from scratch

Public Health Warning

nova evacuate does not really mean evacuation!

Think about earthquakes

Not too late to evacuate

Too late to evacuate

nova terminology

nova live-migration

nova evacuate ?!

Public Health Warning

  • nova evacuate does not do evacuation
  • nova evacuate does resurrection (after releasing dependencies)
  • In Vancouver, nova developers considered a rename
    • Hasn't happened yet
    • Due to impact, seems unlikely to happen any time soon

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Existing solutions

NovaCompute / NovaEvacuate OCF agents

  • Custom OCF Resource Agents (RAs)
    • Pacemaker plugins to manage resources
  • Used by Red Hat / SUSE; contributions by Intel also
  • Custom fencing agent (fence_compute) flags host for recovery
  • NovaEvacuate RA polls for flags, and initiates recovery
    • Will keep retrying if recovery not possible
  • NovaCompute RA starts / stops nova-compute
    • Start waits for recovery to complete
  • RAs upstream in openstack-resource-agents repo (maintained by me)

NovaCompute / NovaEvacuate OCF agents

Scalability issue solved by pacemaker_remote

  • New(-ish) Pacemaker feature
  • Allows core cluster nodes to control "remote" nodes via a pacemaker_remote proxy service (daemon)
  • Can scale to very large numbers

NovaCompute / NovaEvacuate OCF agents

NovaCompute / NovaEvacuate OCF agents

Pros

  • Ready for production use now
  • Commercial support available
  • Tolerates simultaneous failures in compute / control planes

Cons

  • Known limitations (not bugs):
    • Only handles failure of compute node, not of VMs, or nova-compute
    • Some corner cases still problematic, e.g. if control plane fails during recovery

Masakari

  • https://github.com/ntt-sic/masakari
  • Developed by NTT
  • Similar architectural concept, different code
    • Recovery handled by separate service
    • Persists state to RDBMS
  • Monitors for 3 types of failure:
    • compute node down
    • nova-compute service down
    • VM down (detected via libvirt)
  • Recently switched to pacemaker_remote and SQLAlchemy

Masakari architecture

Mistral-based resurrection workflow

Pros

  • Congruous with upstream OpenStack strategy
  • Potential for integration with Congress for policy-based workflows

Cons

  • Still early stages; not yet usable by most
  • Mistral itself not yet HA (but could be fixed in Newton?)
Reuses components rather than adding yet another project

AWcloud / China Mobile

  • Very different solution
  • Presented in Tokyo
  • Uses Consul / raft / gossip instead of Pacemaker
  • Fencing via IPMI / self-fencing
  • Has some interesting capabilities
    • gossip potentially more resilient than peer-to-peer
    • action matrix: configurable per failure mode
  • But source code not available :-(

Senlin

ZeroStack

  • Presented in Tokyo
  • Proprietary hosted solution
  • Adaptive, self-healing approach
    • Every node is be dynamically (re-)assigned a role
      • Could switch from controller to compute based on demand
    • Much harder to lose quorum, since non-voting nodes can be promoted to voting status

Which one should I pick?

Questions to ask

Do you need a vendor-supported, enterprise-ready solution for production clouds right now?

Questions to ask (2)

Are you prepared to support the solution yourself, and invest some engineering effort on integration / DevOps?

Recommendation: masakari

  • Handles more failure cases than OCF RA approach
  • Fairly well tested and documented

Questions to ask (3)

Are you interested in collaborating on experimental technology?

  • mistral
    • One of the most promising approaches for the future
  • senlin

Questions to ask (4)

Do you work for AWcloud or China Mobile?

  • Use your own solution ;-)

Future work

  • Convergence of masakari with Mistral approach
    • Replace masakari process monitoring with Pacemaker
    • Figure out how masakari could harness Mistral
  • Create new specs repository and submit specs
  • Implement CI integration testing for failure cases
  • Interlock in Austin with developers and Product Working Group

Community

Community news

  • openstack-resource-agents project on stackforge
    • maintained by me
  • New #openstack-ha IRC channel on FreeNode
    • automatic notifications for activity on HA repositories
  • New topic category on openstack-dev@ mailing list

    Subject: [HA] i can haz pets in my cloud?
    
  • Weekly IRC meetings at Monday 9am Europe/London

  • HA guide currently undergoing a revamp
  • Everyone welcome to get involved!

Questions?

Corporate Headquarters Maxfeldstrasse 5 90409 Nuremberg Germany +49 911 740 53 0 (Worldwide) www.suse.com Join us on: www.opensuse.org
Add conclusion notes here.
Let me know you're here!