How we fought for OpenStack High Availability

Created by Vladimir Kuklin - Principal Depl. Engineer/Fuel-Library Tech. Lead and Sergii Golovatiuk - Senior Depl. Engineer

Speaker - Sergii Golovatiuk: Let me introduce Vladimir Kuklin - Principal Deployment Engineer at Mirantis and Fuel Technical Lead. Speaker - Vladimir Kuklin: Let me introduce Sergii Golovatiuk - Senior Deployment Engineer at Mirantis. We are going to present you how we fought for OpenStack High Availability. Apparently, this battle is still going on, but we have some really good results. So we would like to share the details how we are achieving it while working on Mirantis OpenStack and Fuel projects.

Definitions and Scope of this talk

+ failure of a single component at a single point in time

- force majeure events

- uncommon physical destruction (not deterioration) of components

Speaker - Vladimir Kuklin: First of all, before we start talking about OpenStack and its underlying components individually, let's define High Availability and what kind of High Availability we are going to cover in this talk. High Availability is a characteristic of a system that retains a certain level of availability, even if the system is exposed to failures. Scope of this talk is about retaining availability of OpenStack cluster in case of failure of one of its underlying components. We do not consider: 1. Simultaneous failure of several components 2. Force majeure and natural disasters 3. Physical destruction of the hardware

OpenStack Architecture

Speaker - Golovatiuk Sergii: Here is a classical diagram of OpenStack and its loosely coupled components. What are the main problems with each of these components? You should organize High Availability for all of them. So, in reality, you should have at least 2 copies of each component. However, I would suggest to have at least 3 copies to eliminate split brain scenarios that may arise. All components should be ready for High Availability. Some of them might be under arbitrary control such as Pacemaker or Zookeper, some of them may have own healing mechanisms.

OpenStack HA Stack

Fault Tolerance of every component

Network Connectivity
Database - MySQL
AMQP - RabbitMQ
Memcached
Storage - Ceph
API Services
Neutron/Heat/Ceilometer

Speaker - Vladimir Kuklin: We will not cover High Availability for DataCenters. Our primary focus is OpenStack Controller and Compute nodes. Network Connectivity: Out of the box we provide several options, such as Active-Passive Connection or more advanced LACP (802.1ad) network connection. However, we still have a separate connection for PXE network booting. DataBase: We started working on DB HA with Galera Support. So we achieved very good results by optimizing management of Galera cluster with Pacemaker. AMQP: AMQP is a crucial part of OpenStack Architecture. Death of the particular Controller should not affect availability of service. We had to modify both deployment mechanisms of rabbitmq and OpenStack Messaging code to make it work. Memcached: Main goal with memcache is to retain the ability to serve requests when a one of memcached servers is dead. Storage: For storage availability, one can choose enterprise solutions such as NetApp or EMC or fault-tolerant Software-Defined Storages. We concentrated our efforts on Ceph. API Services: Should be load-balanced and requests should be redirected to the node that can actually serve API requests. Neutron/Heat/Ceilometer: For Neutron agents, we also need to implement High Avalability solution that migrates assigned entities and retains networking connectivity. Also, we need to ensure that Heat and Ceilometer Agents can be migrated to alive controllers.

Galera/MySQL

Speaker - Golovatiuk Sergii: Why Galera? Sometimes, I hear that Galera is too complex. DRBD is an easy solution which is enough for reliability of Database. Firstly, DRBD block replication is not a native mechanism for MySQL. Secondly, you cannot scale up with DRBD. What about master-slave? Master-Slave is a well known good technology. However, at the moment there is no big difference between Galera and master-slave replication in terms of OpenStack reliability. Here is a classical diagram of Galera Implementation in Fuel. All services communicate with MySQL via HAProxy. High Availability of HAProxy is based on Virtual IP, which is controlled by Pacemaker. As you can see that only one HAProxy instance serves read/write operations for MySQL. The problem was described by Peter Boros from Percona and Jay Pipes from Mirantis as many OpenStack Services use SELECT ... FOR UPDATE SQL queries or don't have special functioninality to perform retry on failed SQL transactions.

MySQL/Galera improvements

MySQL 5.6.11
Latest XtraBackup
HaProxy + xinetd httpcheck

Speaker - Sergii Golovatiuk: MySQL Server was upgraded to 5.6.11 with the latest Galera plugin. That resolved many stability issues. Mysqldump was replaced with xtrabackup from Percona. Xtrabackup doesn't lock database during State Snapshot Transfer. Also it has good performance allowing to synchronize really large databases. HAProxy was extended to perform simple checks against the database to avoid performing any DB operations against Donor or Desynced servers.

MySQL/Galera - OCF script

Use latest GTID info for master election:
- From CIB
- From grastate.dat
Start PC with empty gcomm://
Clone-based

Speaker - Sergii Golovatiuk: Our previous implementation of OCF script was fragile and didn't reassemble the cluster in many conditions. The new version was rewritten from scratch which allows to bringGalera cluster back online without any interruptions. The general idea is to select Primary Component with the most recent data. OCF script gets the most recent GTID and keeps values in Pacemaker Cluster Information Base. In case of problems OCF script gets the data from grastate file allowing to bootstrap cluster. Pacemaker uses this data to find the most up-to-date server for Primary Component. In case of scenario where all Controllers are down, Pacemaker waits for neighbours for 5 minutes. If neighbours are stuck on fsck or grub prompt, Pacemaker will start with all available nodes. On monitor function, OCF script finds the cases when node went out of sync. Though, it correctly finds Donor/Desync state allowing nodes to perform State Snapshot Transfer (SST) OCF script is currently clonned based so we don't need to create primitive for every node.

RabbitMQ

Hard to reassemble RabbitMQ cluster
Each Rabbit node tries to connect to previous queue Master

Speaker - Vladimir Kuklin: While working on RabbitMQ resilience, we noticed that RabbitMQ server behaviour is not always obvious and clear. E.g. in case of 3-node cluster if you do hard reset of RabbitMQ cluster it usually becomes very vulnerable to race conditions and there is not much you can do about it. In order to automate RabbitMQ reassemblance we created OCF script that leverages Pacemaker master/slave resources and notification mechanism for cloned resources. So you can see the diagram showing what you need to do to assemble rabbitmq cluster under control of Pacemaker: - Fire up beam processes for the first node - Let Pacemaker elect the master (this will be the first node) - Create 'master' attribute in CIB - Start 'master' application on the first node and attach slave nodes to the master - Periodic status command should check each running rabbitmq server if it is connected to the master node - If rabbit app cannot start or join or whatever - reset it

AMQP - oslo.messaging

AMQP - multiple rabbit connection included
OSLO.messaging lacks heartbeats:
- 1st community heartbeat implementation was broken
- even kernel could not help
Good news: fix is on review https://review.openstack.org/126330

Speaker - Vladimir Kuklin We started with HAproxy proxying to the only rabbitmq node. That solution was not the best. We wanted to achieve affordable results allowing our controllers to scale up easily. When oslo.messaging was merged, we started using internal OSLO.messaging healing mechanism making connections to all RabbitMQ instances specified in config files along with shuffling of them to minimize effect of AMQP failover. But we had to rewrite part of oslo.messaging code to support AMQP heartbeats and handle connection failure scenarios. The first community implementation was broken. Even kernel killing connections could not make oslo fail over. Good news is that we fixed it and pushed it to openstack gerrit.

API Services and services engines

All API requests go through active/active HAproxy
Service engines are managed by pacemaker and ocf scripts
HAproxy with VIP is in separate namespace using veth pairs and arp_proxy.

Speaker - Sergii Golovatiuk: There is not much interesting to say about how we are balancing API requests. We are using standard HAproxy with active-active backends. Some of the parameters are tuned to digest production workloads. Regarding service engines, such as heat and ceilometer, we are managing them using dedicated ocf scripts. HAproxy itself is running in a separate network namespace to avoid hanging connection problems when migrating service endpoint IPs between controllers. To achieve this, we used the common veth+arp_proxy approach, along with the addition of NAT rules to retain connectivity with all networking services.

Keystone

tokens in memcached
dogpile driver with python-memcached broken
pylibmc is non eventlet safe
Y. Taraday wrote newer driver with connection pool
there are still bugs in python-memcached

Speaker - Sergii Golovatiuk We strongly believe that temporary data should be kept in key value stores. What will happen if we lose this data? Correct, on the next operation, the client or service will get a new token using standard authentication method. In order to maintain keystone resilience, we added memcached support for tokens, but then we figured out that a failure of a controller with memcached instance may lead up to 6 seconds lag in operations which makes the cluster unusable. We started looking for another solution, so we tried pylibmc which is nice implementation but it was not eventlet-safe so we could not use it. Our developer Yuri Taraday wrote a driver that supports pool of connections for memcached. Nevertheless, there are still some problems with python-memcached as it has some broken logic for keys sharding, which we are working to merge a fix for.

Neutron

API under HAproxy
managing Neutron agents
- clean everything on stop start actions:
  - Interfaces
  - Child Processes (dnsmasq, metadata proxy)
- entities rescheduling after migration
- WIP: API handler and rescheduling using internal Neutron mechanism

Speaker - Vladimir Kuklin: Managing Neutron services is pretty easy when we are talking about API. However, the devil is in the details when you want to safely manage agents. First of all, when you start and stop agents, you need to ensure that they do not leave any artifacts that can affect connectivity, such as orphaned interfaces along with IP addresses or child processes. In order to achieve this, we perform special cleanup actions that destroy previously created interfaces and kill child processes on the nodes. The next thing you need to do is rescheduling of entities, such as routers for l3 agent and networks for DHCP agent after the agent is migrated to another node. For example, this is necessary in case of failover. Current version utilizies Neutron API calls for L3 and dhcp agents. Our Neutron community team is working on moving this functionality to Neutron core. Currently, there is some code for automatic rescheduling, but it showed some issues in case of AMQP failover. Our developers are working to resolve these issues.

Ceph

ephemeral storage <=> live migration
object and image storage shared
share host param for volume service

Speaker - Sergii Golovatiuk: We are using pretty common Ceph architecture putting monitor nodes on the controller nodes and having separate roles for Ceph OSD nodes. User is able to use specific block devices for ceph journal for each OSD. Ceph is shared software defined storage and can be used as a replacement for proprietary solutions if you want live migration and highly available object/image/volume storage. We had to write a bunch of code to support live migration with ceph as it was not in the perfect state previously. For cinder HA we had to specify identical host parameter in cinder config for all volume nodes in order to make volumes really shared.

Main deployment components

corosync/pacemaker manifests
haproxy with conf.d patch + manifests
NS contained resources

Speaker - Vladimir Kuklin: In order to get all of the aforementioned components set up and deployed we added or modified several deployment pieces and modules along with other software. These were: 1. Puppet-corosync module originally created by puppetlabs and modified by us 2. We also needed Haproxy version that supports configuration directory includes in order to inject changes to haproxy in granular manner 3. Modified VIP and Haproxy OCF resources to contain haproxy and IP addresses inside specific namespaces

Corosync/pacemaker manifests: part 1

not all resource types supported:
- constraints (e.g. location) support
- master/slave resources not supported
puppet service type provider for pacemaker:
- parses LRM of alive nodes
- waits for status change with respect to timeouts
- handles timeouts depending on defaults or user-specified values
[5.1 release] Shadow approach broken => Moved to xml patches instead

Speaker - Vladimir Kuklin: First of all, we needed to polish some of the corosync module code that we had at the time we forked it. We needed additional support for other pacemaker resources/entities such as location constraints and master/slave resources. Then, in order to deploy pacemaker resources and to maintain almost the same puppet code we needed to implement pacemaker service provider for puppet. It parses output of Pacemaker Local Resource Managers in CIB respecting timeout values and monitor commands. Also in order to support complex OCF scripts such as scripts for Galera and RabbitMQ we had to abandon default upstream 'shadow'-like approach as it was sometimes overwriting cluster attributes during deployment changes and this desynchronization led to deployment failures. So we leveraged pacemaker support for XML diff CIB modification and rewrote all the providers to support it.

Corosync/pacemaker manifests: part 2 - stability

[5.1 release] asymmetric cluster:
- services stopped everywhere by default
  - enabled by 0 location constraint ("unbanning")
  - clones started locally
  - primitives started only once
tuning of sql drivers
short kernel tcp keepalives

Speaker - Vladimir Kuklin: Our initial implementation of service provider and deployment workflow was not perfect as it was triggering restarts not only for the services on the particular node but globally. So we switched to asymmetric pacemaker cluster which does not start services by default. Then we refactored service provider to perform actions locally using pacemaker location constraints for start and stop actions. In order to make service actions local-wise we altered status method behaviour depending on the type of resource. If resource is a primitive, then we check its status globally. For cloned resources we check status locally. Also in order to make deployment stable enough, we added short kernel TCP keepalives to kill hanging connections in less than a minute along with timeouts tuning for SQL.

Corosync/pacemaker manifests: part 3 - HA scalability

multicast problems => switched to unicast by default
need to alter and restart corosync => pacemaker maintenance mode
galera transitional sync => limit on parallel controllers

Speaker - Vladimir Kuklin: So as soon as we won a fight for deployment stability we moved to scalability and here is what we faced: we received initial feedback from our services guys that most of our customers do not have multicast enabled or correctly configured. This made us switch deployment to unicast one by default. We had to alter corosync configuration and restart it each time we wanted to add a new controller. In order to make it work we modified init scripts for corosync to check for pacemaker and put it into maintenence mode. Also, we needed to limit amount of parallel controllers being deployed in order not to exhaust donor nodes and affect working environment.

Testing HA and results

Testing

fuel-devops - tiny piece of libvirt-based orchestrator
destructive tests:
- running virtual environments
- running bare-metal environments
- performing destructive actions
testing that cluster failed over successfully:
- FUEL OSTF
- Tempest
- Rally

Speaker - Sergii Golovatiuk: Testing HA does not come at easy cost. We are doing this primarily using our own test suite. We spawn environments using our tiny orchestrator which uses python libvirt and other libraries. Then we perform actual destructive actions and check if cluster can failover succesfully using FUEL OSTF as well as other test suites.

Results

controller reset
all controllers reset
networking partitioning
AMQP cluster node failure
DB node failure
particular OS service failure

Speaker - Vladimir Kuklin So far, speaking of results that we achieved with HA installations is that we can easily handle single-failure-at-a-time tolerance for the whole OpenStack cluster whether these are networking problems, failure of DB, amqp or particular OpenStack services. And this was not only confirmed by our tests but also by our most significant partners and customers including carrier-grade ones running extensive and rigorous testing saying that they still cannot break our HA setup.

Current challenges and plans for future

multiple writes for Galera
oslo.db resilience
VRRP and DVR implementations and testing
fencing Support
event-driven failover and evacuations
memcached-related fixes
zeroMQ research
gate for HA tests

Speaker - Sergii Golovatiuk: Multiple writes: With Galera Cluster, HAproxy still performs read/write operations to one server. OpenStack components has SELECT ... FOR UPDATE queries. When 2 servers have write operation simultaneously, one of the servers will revert its own transaction, perform its neighbor's transaction, then repeat its DB transaction again. OpenStack Services should be aware of such manipulations. We work with OSLO.db developers to resolve these issues. We have plans to make OCF script master-slave based, allowing OpenStack Cloud operators to see status from pcs or crm_mon. Speaker - Vladimir Kuklin: We want to handle the reschedule triggered by pacemaker and OCF along with utilizing VRRP and DVR mechanism for L3 agents. Node Fencing: We are working on configuration of fencing for pacemaker cluster nodes along with RabbitMQ cluster member fencing. Event driven failover: We also want to have centralized view of the whole cluster to start failover before actual failure happens. One of the common examples are triggers in conventional monitoring systems Memcached-related fixes: We need to fix python-memcached library along with memcached driver implementation for horizon. ZeroMQ research: Despite some big cloud installations show that zeromq may be a useful replacement for messaging, current oslo.messaging driver for zeromq is in a poor state. So we are going to research whether it maybe applicable for HA-enabled production ready installations. HA tests: As we already have a well working testing framework for highly available deployment and we are really close to provide ability for community FUEL ISO to deploy vanilla openstack from particular commits, we are going to add HA tests gating to OpenStack jenkins to indicate whether each particular commit is not affecting HA.

Links

http://wiki.openstack.org/Fuel
http://github.com/stackforge/fuel-library.git
https://etherpad.openstack.org/p/fuel-ha-fixes-catalogue
#fuel-dev @ freenode
openstack mailing list with [Fuel] in subject

Speaker - Vladimir Kuklin You can always check out our wiki page for FUEL project. Actual deployment is done by code in fuel-library sub project. We also periodically write down all our ha fixes to etherpad. And you can always contact is in #fuel-dev channel at freenode or using openstack mailing list with Fuel Prefix in subject.

Credits

Aleksandr Didenko & Bogdan Dobrelia

Roman Podoliaka & Yuriy Taraday

Sergey Melikyan & Stanislav Lagun

Sergey Vasilenko & Matthew Mosesohn

Dmitry Ilyin & Dmitry Borodaenko

Ryan Moe & Andrew Woodward

Anastasia Urlapova & Tatyana Leontovich

Yegor Kotko & Artem Panchenko

Andrey Sledzinskiy

How we fought for OpenStack High Availability

holser

How we fought for OpenStack High Availability

5 0

openstack-summit-paris-2014

How we fought for OpenStack High Availability

Created by Vladimir Kuklin - Principal Depl. Engineer/Fuel-Library Tech. Lead and Sergii Golovatiuk - Senior Depl. Engineer

Definitions and Scope of this talk

OpenStack Architecture

OpenStack HA Stack

Fault Tolerance of every component

Galera/MySQL

MySQL/Galera improvements

MySQL/Galera - OCF script

RabbitMQ

AMQP - oslo.messaging

API Services and services engines

Keystone

Neutron

Ceph

Main deployment components

Corosync/pacemaker manifests: part 1

Corosync/pacemaker manifests: part 2 - stability

Corosync/pacemaker manifests: part 3 - HA scalability

Testing HA and results

Testing

Results

Current challenges and plans for future

Links

Credits

Questions?

Created by Vladimir Kuklin and Sergii Golovatiuk

How we fought for OpenStack High Availability

holser

How we fought for OpenStack High Availability

5 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

openstack-summit-paris-2014

How we fought for OpenStack High Availability

Created by Vladimir Kuklin - Principal Depl. Engineer/Fuel-Library Tech. Lead and Sergii Golovatiuk - Senior Depl. Engineer

Definitions and Scope of this talk

OpenStack Architecture

OpenStack HA Stack

Fault Tolerance of every component

Galera/MySQL

MySQL/Galera improvements

MySQL/Galera - OCF script

RabbitMQ

AMQP - oslo.messaging

API Services and services engines

Keystone

Neutron

Ceph

Main deployment components

Corosync/pacemaker manifests: part 1

Corosync/pacemaker manifests: part 2 - stability

Corosync/pacemaker manifests: part 3 - HA scalability

Testing HA and results

Testing

Results

Current challenges and plans for future

Links

Credits

Questions?

Created by Vladimir Kuklin and Sergii Golovatiuk

5 0