Check in!



Check in!

0 3


openstacksummit2014-paris-automateddeployment


On Github fghaas / openstacksummit2014-paris-automateddeployment

Automated Deployment

of a Highly Available

OpenStack Cloud

What You'll Learn in This Tutorial

Why do we want OpenStack HA?

How does SUSE Cloud do it?

This tutorial is specifically about SUSE Cloud. However, you may also be interested in a more general, vendor-agnostic overview of HA in OpenStack Juno and beyond. For that there is an additional talk in the main conference, OpenStack High Availability: Are We There Yet?, on Wednesday, 1530, in room 252 AB. http://sched.co/1qeP6TX

but before that...

Adam Spiers

Professional Cellist

Lapsed Triathlete

... and Engineer at SUSE

Image Credit: The London Tango Orchestra

Florian Haas

Professional Traveler

Amateur Foodie, Photographer & Parent

... and Founder at hastexo

Professional services company with expertise in HA, cloud, storage, virtualization. They provide training, remote and on-site consulting, and emergency troubleshooting.

Why

do we want

High Availability

in

OpenStack?

Everything's distributed, right?

Everything's shared nothing, right?

Any component can always die and we have another, right?

Well,

not quite.

This is a simplified overview of the OpenStack architecture. Various components like Heat, Ceilometer, Trove are omitted for clarity. Image by Ken Pepple.
Even in this simplified architecture, the majority of services rely on shared infrastructure. In particular:

AMQP Bus

RabbitMQ, Qpid

One such example is the Advanced Message Queing Protocol bus, which OpenStack uses to pass messages between services. These messages are considered volatile and are expected to have an effective life of 30 seconds or less.

Can't do without it

Few OpenStack services can operate without an AMQP exchange they can talk to. However, the only thing we really need to worry about here in recent OpenStack releases is whether a suitable AMQP service is alive, not so much what data is in its bus. Because:

Not stateful

... the data in AMQP services is not treated as persistent or stateful by OpenStack services. It's a bit like UDP: if a message is not delivered, we resend it. Problem solved. That's strikingly different in another problem domain:

RDBMS

MySQL, PostgreSQL

Relational database management systems is where we store non-volatile (persistent) data. This is trickier than the message bus problem:

Can't do without it

Again, most OpenStack services need a backend database.

Stateful

But this time, the information in them is inherently stateful, meaning we need not only to protect against the loss of service, but also ensure data persistence across hardware failures.

What does

Infrastructure High Availability

do for us?

This is where infrastructure high availability comes in. It serves not one, but two purposes.

Ensure

service

availability

We of course need to make sure that our critical services are running and responsive.

Ensure

data

availability

And for stateful services, we *additionally need to ensure that they can find their data where they need it. This may or may not include replication.

How do vendors approach High Availability?

When we talk about the approach vendors take with providing high availability for OpenStack, we really need to talk about 3 different things:

Deployment

What does the vendor support/recommend to deploy OpenStack in the first place? (This is normally also their choice of deployment facility for an HA manager, as everything else would be braindead.) It is important to note that the deployment facility is a key differentiator between vendors' OpenStack products. As a general rule, you should always go with what your vendor supports, rather than roll your own or, even worse, deploy OpenStack without automation.

HA Management

Which high availability manager(s) does the vendor support for ensuring service availability?

State management

Data availability

What does the vendor support to ensure state sharing, or data replication, between the backend stores of stateful services?

Crowbar deployment

Pacemaker/HAProxy

Shared Storage/DRBD

So what exactly is

SUSE Cloud?

Let's take a really quick look at what SUSE Cloud is all about.

SUSE's OpenStack based cloud product

SUSE Cloud is an OpenStack cloud deployment and management solution, including SUSE packaging of OpenStack components and automated deployment and management facilities.

First release:

SUSE Cloud 1.0

Essex based (2012)

some things in technical preview, e.g. Ceph

SUSE Cloud 2.0

Grizzly based (2013)

SUSE Cloud 3

Havana based (Feb 14)

HA support added

SUSE Cloud 4

Icehouse based (Aug 14)

Ceph support added

Based on

SLES 11 SP3

SUSE Cloud

Node roles

In SUSE Cloud, deployment and management of services is centered on the concept of node roles. The concept of node roles is not unique to SUSE Cloud, it is a rather common method of abstracting node functionality.
From SUSE Cloud 4 reference architecture, original image at https://www.suse.com/documentation/suse-cloud4/book_cloud_deploy/graphics/cloud_node_structure.png.

Crowbar

Software Deployment and Automation Framework

Individual application units

Barclamps

What were the

Design Goals

for adding HA to Crowbar?

- guest HA out of scope - infrastructure tolerant to admin node failure

Build from scratch

and

upgrading an existing cloud

support existing customers

Flexible allocation of

roles

across potentially multiple clusters

- not too opinionated about sizing - allow growing the cluster later

Automated

configuration

reduce complexity and learning curve

Pacemaker barclamp

Provides

HA library code

for other barclamps

Installs the

Pacemaker

High-availability manager

and web / CLI / desktop UIs. Switch to Crowbar browser tab

STONITH

Configuration mode for STONITH:

STONITH - Configured with STONITH Block Devices (SBD)

SBD

Pre-configured on /dev/sdc

DRBD

Prepare cluster for DRBD: true

Pacemaker GUI:

Setup non-web GUI (hb_gui): true

Initial Pacemaker deployment

What's special about how

SUSE Cloud

uses Pacemaker with Crowbar?

Chef LWRPs for Pacemaker

pacemaker_clone "cl-#{service_name}" do
  rsc service_name
  action [:create, :start]
end
minimise disruption to existing cookbooks

Chef::Provider::Pacemaker::Service

- basic idea of usurping management of SysVinit services - maintenance mode to deal with restarts triggered by config file changes

DRBD

haproxy

Load Balancer

- automatic VIP allocation - endpoint calculation - front-end and back-ends both HA

Automatic

cluster

configuration

- Quorum - Fencing shoot-out protection - SBD auto-configuration - installs UIs

Orchestration

and

Synchronization

flexible role allocation

UI extensions

notifications

Database

barclamp

Switch back and forth to and from Crowbar browser tab

Installs

PostgreSQL

in high-availability mode

High Availability

Storage mode: DRBD

Size to Allocate for DRBD Device: 1

PostgreSQL/DRBD deployment

RabbitMQ

barclamp

Switch back and forth to and from Crowbar browser tab

Installs

RabbitMQ

in high-availability mode

High Availability

Storage mode: DRBD

Size to Allocate for DRBD Device: 1

RabbitMQ/DRBD deployment

Keystone

barclamp

Switch back and forth to and from Crowbar browser tab

Installs

Keystone

under Pacemaker management

Keystone deployment

Glance

barclamp

Switch back and forth to and from Crowbar browser tab

Installs

Glance

under Pacemaker management

Glance deployment

Cinder

barclamp

Switch back and forth to and from Crowbar browser tab

Installs

Cinder

under Pacemaker management

Stores volumes on

Compute nodes

(for purposes of this tutorial)

Type of volume: Local file

Also supports

SAN storage

and

Ceph

Cinder deployment

Neutron

barclamp

Switch back and forth to and from Crowbar browser tab

Installs

Neutron

under Pacemaker management

neutron-l3-agent

OCF RA

neutron-ha-tool.py

- monitor action checks for dead l3-agents - start action - replicates DHCP agents - migrates routers onto healthy agents OpenStack Juno adds experimental support for DVR and HA L3 agents. It is expected that neutron-ha-tool.py will no longer be necessary once this feature has stabilized (targeted for Kilo).

Networking Plugin

ML2 with OVS/GRE

Neutron deployment

Nova

barclamp

Switch back and forth to and from Crowbar browser tab

Installs

Nova

under Pacemaker management

Neutron deployment

Horizon

barclamp

Switch back and forth to and from Crowbar browser tab

Installs

Horizon

under Pacemaker management

Horizon deployment

Testing high availability

Retrieve

Horizon URL

from Crowbar

OpenStack Dashboard (admin)

admin/crowbar

Select

openstack

Project

(a.k.a. tenant)

Use as you normally would

Doing

bad things

to services

pkill openstack-keystone

pkill openstack-nova-api

Watch

services

recover automatically

crm_mon

Service recovery

Doing

bad things

to nodes

poweroff -f

echo o > /proc/sysrq-trigger

Watch

services

fail over automatically

crm_mon

Node recovery

What you learned today

Motivation behind OpenStack HA

- Infrastructure relies on shared services - Some of these services also rely on shared state or data

Vendors' approaches to OpenStack HA

- Canonical: Juju/Pacemaker/Galera/Ceph - Mirantis: Fuel/Pacemaker/Galera - Red Hat: Puppet/Pacemaker/Galera More on this in Florian's talk on Wednesday. OpenStack High Availability: Are We There Yet?, 1530, room 252 AB. http://sched.co/1qeP6TX

SUSE Cloud HA

- Fully automated deployment with Crowbar - Pacemaker for high availability - Shared storage, DRBD - Ability to HA-ify existing deployments