HA for pets and hypervisors
State of the Nation
Adam Spiers
Senior Software Engineer, SUSE
Dawid Deja
Software Engineer, Intel
Agenda
- HA in a typical OpenStack cloud today
- When do we need HA for compute nodes?
- Architectural challenges
- Existing solutions
- Advice on choosing a solution
- Future work
- Upstream community
Typical HA control plane
- Automatic restart of controller services
- Increases uptime of cloud
- Active / active API services with load balancing
- DB + MQ either active / active or active / passive
Under the covers
- Recommended by
official HA guide
-
keepalived / VRRP often used
-
HAProxy distributes service requests
-
Pacemaker monitors and controls nodes and services
- These days, to a large extent this is a solved problem!
neutron HA is tricky, but out of the
scope of this talk.
If only the control plane is HA …
The control plane on the LHS is HA, but VMs live on the RHS,
so what happens if one of the compute nodes blows up? That's
the topic of the rest of this talk!
When is compute HA important?
The previous slide suggests there is a problem which needs
solving, but does it always need solving?
Addressing the white elephant in the room
Compute node HA is a controversial feature, because
some people think it's an anti-pattern which does not belong,
in clouds, whereas other people feel a strong need for it.
To understand when it's needed, first we have to understand
the different types of workload which people want to run in
the cloud.
But what are pets?
Pets vs. cattle
-
Pets are given names like mittens.mycompany.com
-
Each one is unique, lovingly hand-raised and cared for
-
When they get ill, you nurse them back to health
-
Cattle are given names like vm0213.cloud.mycompany.com
-
They are almost identical to other cattle
-
When one gets ill, you get another one
- Pets are typically given unique names, whereas cattle aren't.
- This reflects that pets take a lot of work to create and look after,
whereas cattle don't.
- Similarly, when something goes wrong with a pet, you need to
invest a lot of effort to fix it, whereas with cattle you just get another one.
- thanks to CERN for this slide, and Bill Baker for the original terminology
What does that mean in practice?
-
Service downtime when a pet dies
-
VM instances often stateful, with mission-critical data
-
Needs automated recovery
with data protection
-
Service resilient to instances dying
-
Stateless, or ephemeral (disposable) storage
-
Already ideal for cloud
… but automated
recovery still needed!
If compute node is hosting pets …
… we
have to resurrect very carefully in order to
avoid any zombie pets!
This case is more complex than resurrecting cattle, due to the risk
of zombie pets.
A zombie is a VM which appeared dead but didn't actually die properly -
it could conflict with its resurrected twin.
Do we really need compute HA in OpenStack?
Why?
- Compute
HA needed for cattle as well as pets
- Valid
reasons for running pets in OpenStack
- Manageability benefits
- Want to avoid multiple virtual estates
- Too expensive to cloudify legacy workloads
So to sum up, my vote is yes, because even cattle need compute node HA.
Also, rather than painful "big bang" migrations to cloud-aware
workloads, it's easier to deprecate legacy workloads, let them reach
EOL whilst gradually migrating over to next-generation architectures.
This is a controversial topic, but naysayers tend to favour idealism
over real world pragmatism.
Architectural challenges
If this really is needed functionality, why hasn't it already been done?
The answer is that it's actually surprisingly tricky to implement in a
reliable manner.
Configurability
Different cloud operators will want to support different SLAs
with different workflows, e.g.
- Protection for pets:
- per AZ?
- per project?
- per pet?
- If nova-compute fails, VMs are still perfectly healthy
but unmanageable
- Should they be automatically killed? Depends on
the workload.
There is no one-size-fits-all solution to compute HA.
Full mesh clusters don't scale
Typical clustering software uses fully connected mesh topology, which
doesn't scale to a large number of nodes, e.g. corosync supports a
maximum of 32 nodes.
Addressing Scalability
The obvious workarounds are ugly!
- Multiple compute clusters introduce unwanted artificial boundaries
- Clusters inside / between guest VM instances are not OS-agnostic,
and require cloud users to modify guest images (installing & configuring cluster software)
- Cloud is supposed to make things
easier not harder!
Common architecture
Scalability issue solved by pacemaker_remote
- New(-ish) Pacemaker feature
- Allows core cluster nodes to control "remote"
nodes via a pacemaker_remote proxy service (daemon)
- Can scale to very large numbers
Reliability challenges
We assume that Pacemaker is reliable, otherwise we're sunk!
Brief interlude: nova evacuate
This is a good time to introduce nova evacuate.
nova's recovery API
- If we have a compute node failure, after fencing the node,
we need to resurrect the VMs in a way which OpenStack is aware of.
- Luckily nova provides an API for doing this, which is called
nova evacuate. So we just call that API and nova takes care
of the rest.
- Without shared storage, simply rebuilds from scratch
Public Health Warning
nova evacuate does not really mean evacuation!
Think about natural disasters
Public Health Warning
- In Vancouver, nova developers considered a rename
- Hasn't happened yet
- Due to impact, seems unlikely to happen any time soon
Whenever you see “evacuate” in a nova-related context,
pretend you saw “resurrect”
NovaCompute / NovaEvacuate OCF agents
- Custom OCF Resource Agents (RAs)
- Pacemaker plugins to manage resources
- Custom fencing agent (fence_compute) flags host for recovery
-
NovaEvacuate RA polls for flags, and initiates recovery
- Will keep retrying if recovery not possible
-
NovaCompute RA starts / stops nova-compute
- Start waits for recovery to complete
NovaCompute / NovaEvacuate OCF agents
Pros
Cons
- Known limitations (not bugs):
- Only handles failure of compute node, not of VMs, or nova-compute
- Some corner cases still problematic, e.g. if nova fails during recovery
Masakari architecture
- Similar architectural concept, different code
- Recovery handled by separate controller service
- Persists state to database
- Monitors for 3 types of failure:
- compute node down
-
nova-compute service down
- VM down (detected via libvirt)
Masakari analysis
Pros
- Monitors VM health (externally)
- More sophisticated recovery workflows
Cons
- Looser integration with pacemaker
- Failing nova-compute service will be disabled
- Basically only uses Pacemaker as monitoring / fencing service
- Waits 5 minutes after fencing
Mistral
- Workflow as a service
- Enables user to create any workflows
- May be expansible with custom action
- Workflow execution may be triggered by:
- events from ceilometer
- at a certain time (cloud cron)
- on demand (API call)
Next solution is based on mistral. Before I proceed with explaining this solution, I would like to tell you what Mistral is.
As you already read, mistral is 'workflow as a service' service. By using it, you can define a set of tasks and connect them into logical graph. For each task, you can define what to do in case of failure or success. Moreover, if predefined tasks are not enaugh for you, you can write your own actions and plugin them into mistral. Those actions are literaly python class, so you can do anything inside of them.
Once workflow is created, it can be triggered by various ways. Ceilometer, time, or, what is used in instance-ha mistral based solution, on demand via API.
Mistral-based resurrection workflow
Mistral-based resurrection workflow
Pros
- In line with upstream OpenStack strategy
- Clean, simple approach
- Potential for integration with Congress for policy-based workflows
Cons
- Still experimental code; not yet usable by most
- Mistral resilience WIP
Reuses components rather than adding yet another project
We can make different decision based on failure type using congress
Marking vms as pets
Describe problem with mistral HA
Evacuate workflow
Whole workflow should start with nova mark-host-down if fencing was before
repeat is not forever
Marking VMs as pets
$ nova meta very_important_VM set evacuate=true
$ nova flavor-key very_important_flavor set evacuation:evacuate=true
Two ways of marking vms
Prefix in flaovor is important; without it if we try to schedule vm with 'very important flavor' nova-scheduler would try to find agregate with 'evacuate' capability - as a result vm will end up in error state
Senlin
- https://wiki.openstack.org/wiki/Senlin
- Clustering service for OpenStack
- Orchestration of collections of similar objects
- Policies for placement / load-balancing / health / scaling etc.
- Fencing and resurrection not implemented yet
F/OSS solution functionality comparison
OCF Agents
Masakari
Mistral
Policy
Support for tagging VM for evacuation
Yes
Yes
Yes
Customizable actions based on failure
No
No
Planned (via Congress)
Resilience
Service is self-resilient
Yes
Yes
In progress
Monitoring of VM's (external) health
No
Yes
Planned
Recovery
Uses force-down API
Yes
No
Planned
Disable failed nova‑compute
No
Yes
Planned
Fully parallel workflow
No
No
Yes
- Left column groups capabilities into 3 categories
- Policy-based workflows via Congress
- Two capabilities uniquely in masakari which need to be in
future solutions
Common functionality:
- Tolerate simultaneous failures in compute / control planes
- Retry failed evacuations
- Monitor node and hypervisor health
ZeroStack
- Presented in Tokyo
- Proprietary cloud-in-a-box
- SaaS management portal
- VM HA coming in next release
- Adaptive, self-healing approach
- Every node is dynamically (re-)assigned a role
- Much harder to lose quorum, since non-voting nodes can
be promoted to voting status
- Needs outgoing TCP port 443 for SaaS portal
- Node could switch from controller to compute based on demand
AWcloud / China Mobile
- Very different solution
- Presented in Tokyo
- Uses Consul / raft / gossip instead of Pacemaker
- Fencing via IPMI / self-fencing
- Has some interesting capabilities
- Source code not available
Which one should I pick?
this advice is intended to be as impartial as possible, based
on pure facts!
Future work
- Interlock in Austin with developers and Product Working Group
- "Best of breed" solution
- Implement CI integration testing for failure cases
- Create new specs repository and submit specs
-
nova evacuate API progress could work similar to
nova live-migration progress
Pacemaker lunch meetup!
Wed 12:30pm, Expo Hall 5
Look for table with ClusterLabs sign
Legal Notices and Disclaimers
- Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
- No computer system can be absolutely secure.
- Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
- Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
- © 2016 Intel Corporation.
Add conclusion notes here.
About this presentation
You can now watch the video of this presentation online
Press ? for help on navigating these slides
Press m for a slide menu
Press s for speaker notes
(although watching the video is probably more useful)
Here is the original session abstract and details