How we fought for OpenStack High Availability



How we fought for OpenStack High Availability

5 0


openstack-summit-paris-2014


On Github holser / openstack-summit-paris-2014

How we fought for OpenStack High Availability

Created by Vladimir Kuklin - Principal Depl. Engineer/Fuel-Library Tech. Lead and Sergii Golovatiuk - Senior Depl. Engineer

Speaker - Sergii Golovatiuk: Let me introduce Vladimir Kuklin - Principal Deployment Engineer at Mirantis and Fuel Technical Lead. Speaker - Vladimir Kuklin: Let me introduce Sergii Golovatiuk - Senior Deployment Engineer at Mirantis. We are going to present you how we fought for OpenStack High Availability. Apparently, this battle is still going on, but we have some really good results. So we would like to share the details how we are achieving it while working on Mirantis OpenStack and Fuel projects.

Definitions and Scope of this talk

+ failure of a single component at a single point in time

- force majeure events

- uncommon physical destruction (not deterioration) of components

Speaker - Vladimir Kuklin: First of all, before we start talking about OpenStack and its underlying components individually, let's define High Availability and what kind of High Availability we are going to cover in this talk. High Availability is a characteristic of a system that retains a certain level of availability, even if the system is exposed to failures. Scope of this talk is about retaining availability of OpenStack cluster in case of failure of one of its underlying components. We do not consider: 1. Simultaneous failure of several components 2. Force majeure and natural disasters 3. Physical destruction of the hardware

OpenStack Architecture

Speaker - Golovatiuk Sergii: Here is a classical diagram of OpenStack and its loosely coupled components. What are the main problems with each of these components? You should organize High Availability for all of them. So, in reality, you should have at least 2 copies of each component. However, I would suggest to have at least 3 copies to eliminate split brain scenarios that may arise. All components should be ready for High Availability. Some of them might be under arbitrary control such as Pacemaker or Zookeper, some of them may have own healing mechanisms.

OpenStack HA Stack

Fault Tolerance of every component
  • Network Connectivity
  • Database - MySQL
  • AMQP - RabbitMQ
  • Memcached
  • Storage - Ceph
  • API Services
  • Neutron/Heat/Ceilometer
Speaker - Vladimir Kuklin: We will not cover High Availability for DataCenters. Our primary focus is OpenStack Controller and Compute nodes. Network Connectivity: Out of the box we provide several options, such as Active-Passive Connection or more advanced LACP (802.1ad) network connection. However, we still have a separate connection for PXE network booting. DataBase: We started working on DB HA with Galera Support. So we achieved very good results by optimizing management of Galera cluster with Pacemaker. AMQP: AMQP is a crucial part of OpenStack Architecture. Death of the particular Controller should not affect availability of service. We had to modify both deployment mechanisms of rabbitmq and OpenStack Messaging code to make it work. Memcached: Main goal with memcache is to retain the ability to serve requests when a one of memcached servers is dead. Storage: For storage availability, one can choose enterprise solutions such as NetApp or EMC or fault-tolerant Software-Defined Storages. We concentrated our efforts on Ceph. API Services: Should be load-balanced and requests should be redirected to the node that can actually serve API requests. Neutron/Heat/Ceilometer: For Neutron agents, we also need to implement High Avalability solution that migrates assigned entities and retains networking connectivity. Also, we need to ensure that Heat and Ceilometer Agents can be migrated to alive controllers.

Galera/MySQL

Speaker - Golovatiuk Sergii: Why Galera? Sometimes, I hear that Galera is too complex. DRBD is an easy solution which is enough for reliability of Database. Firstly, DRBD block replication is not a native mechanism for MySQL. Secondly, you cannot scale up with DRBD. What about master-slave? Master-Slave is a well known good technology. However, at the moment there is no big difference between Galera and master-slave replication in terms of OpenStack reliability. Here is a classical diagram of Galera Implementation in Fuel. All services communicate with MySQL via HAProxy. High Availability of HAProxy is based on Virtual IP, which is controlled by Pacemaker. As you can see that only one HAProxy instance serves read/write operations for MySQL. The problem was described by Peter Boros from Percona and Jay Pipes from Mirantis as many OpenStack Services use SELECT ... FOR UPDATE SQL queries or don't have special functioninality to perform retry on failed SQL transactions.

MySQL/Galera improvements

  • MySQL 5.6.11
  • Latest XtraBackup
  • HaProxy + xinetd httpcheck
Speaker - Sergii Golovatiuk: MySQL Server was upgraded to 5.6.11 with the latest Galera plugin. That resolved many stability issues. Mysqldump was replaced with xtrabackup from Percona. Xtrabackup doesn't lock database during State Snapshot Transfer. Also it has good performance allowing to synchronize really large databases. HAProxy was extended to perform simple checks against the database to avoid performing any DB operations against Donor or Desynced servers.

MySQL/Galera - OCF script

  • Use latest GTID info for master election:
    • From CIB
    • From grastate.dat
  • Start PC with empty gcomm://
  • Clone-based
Speaker - Sergii Golovatiuk: Our previous implementation of OCF script was fragile and didn't reassemble the cluster in many conditions. The new version was rewritten from scratch which allows to bringGalera cluster back online without any interruptions. The general idea is to select Primary Component with the most recent data. OCF script gets the most recent GTID and keeps values in Pacemaker Cluster Information Base. In case of problems OCF script gets the data from grastate file allowing to bootstrap cluster. Pacemaker uses this data to find the most up-to-date server for Primary Component. In case of scenario where all Controllers are down, Pacemaker waits for neighbours for 5 minutes. If neighbours are stuck on fsck or grub prompt, Pacemaker will start with all available nodes. On monitor function, OCF script finds the cases when node went out of sync. Though, it correctly finds Donor/Desync state allowing nodes to perform State Snapshot Transfer (SST) OCF script is currently clonned based so we don't need to create primitive for every node.

RabbitMQ

  • Hard to reassemble RabbitMQ cluster
  • Each Rabbit node tries to connect to previous queue Master
Speaker - Vladimir Kuklin: While working on RabbitMQ resilience, we noticed that RabbitMQ server behaviour is not always obvious and clear. E.g. in case of 3-node cluster if you do hard reset of RabbitMQ cluster it usually becomes very vulnerable to race conditions and there is not much you can do about it. In order to automate RabbitMQ reassemblance we created OCF script that leverages Pacemaker master/slave resources and notification mechanism for cloned resources. So you can see the diagram showing what you need to do to assemble rabbitmq cluster under control of Pacemaker: - Fire up beam processes for the first node - Let Pacemaker elect the master (this will be the first node) - Create 'master' attribute in CIB - Start 'master' application on the first node and attach slave nodes to the master - Periodic status command should check each running rabbitmq server if it is connected to the master node - If rabbit app cannot start or join or whatever - reset it

AMQP - oslo.messaging

Speaker - Vladimir Kuklin We started with HAproxy proxying to the only rabbitmq node. That solution was not the best. We wanted to achieve affordable results allowing our controllers to scale up easily. When oslo.messaging was merged, we started using internal OSLO.messaging healing mechanism making connections to all RabbitMQ instances specified in config files along with shuffling of them to minimize effect of AMQP failover. But we had to rewrite part of oslo.messaging code to support AMQP heartbeats and handle connection failure scenarios. The first community implementation was broken. Even kernel killing connections could not make oslo fail over. Good news is that we fixed it and pushed it to openstack gerrit.

API Services and services engines

  • All API requests go through active/active HAproxy
  • Service engines are managed by pacemaker and ocf scripts
  • HAproxy with VIP is in separate namespace using veth pairs and arp_proxy.
Speaker - Sergii Golovatiuk: There is not much interesting to say about how we are balancing API requests. We are using standard HAproxy with active-active backends. Some of the parameters are tuned to digest production workloads. Regarding service engines, such as heat and ceilometer, we are managing them using dedicated ocf scripts. HAproxy itself is running in a separate network namespace to avoid hanging connection problems when migrating service endpoint IPs between controllers. To achieve this, we used the common veth+arp_proxy approach, along with the addition of NAT rules to retain connectivity with all networking services.

Keystone

  • tokens in memcached
  • dogpile driver with python-memcached broken
  • pylibmc is non eventlet safe
  • Y. Taraday wrote newer driver with connection pool
  • there are still bugs in python-memcached
Speaker - Sergii Golovatiuk We strongly believe that temporary data should be kept in key value stores. What will happen if we lose this data? Correct, on the next operation, the client or service will get a new token using standard authentication method. In order to maintain keystone resilience, we added memcached support for tokens, but then we figured out that a failure of a controller with memcached instance may lead up to 6 seconds lag in operations which makes the cluster unusable. We started looking for another solution, so we tried pylibmc which is nice implementation but it was not eventlet-safe so we could not use it. Our developer Yuri Taraday wrote a driver that supports pool of connections for memcached. Nevertheless, there are still some problems with python-memcached as it has some broken logic for keys sharding, which we are working to merge a fix for.

Neutron

  • API under HAproxy
  • managing Neutron agents
    • clean everything on stop start actions:
      • Interfaces
      • Child Processes (dnsmasq, metadata proxy)
    • entities rescheduling after migration
    • WIP: API handler and rescheduling using internal Neutron mechanism
Speaker - Vladimir Kuklin: Managing Neutron services is pretty easy when we are talking about API. However, the devil is in the details when you want to safely manage agents. First of all, when you start and stop agents, you need to ensure that they do not leave any artifacts that can affect connectivity, such as orphaned interfaces along with IP addresses or child processes. In order to achieve this, we perform special cleanup actions that destroy previously created interfaces and kill child processes on the nodes. The next thing you need to do is rescheduling of entities, such as routers for l3 agent and networks for DHCP agent after the agent is migrated to another node. For example, this is necessary in case of failover. Current version utilizies Neutron API calls for L3 and dhcp agents. Our Neutron community team is working on moving this functionality to Neutron core. Currently, there is some code for automatic rescheduling, but it showed some issues in case of AMQP failover. Our developers are working to resolve these issues.

Ceph

  • ephemeral storage <=> live migration
  • object and image storage shared
  • share host param for volume service
Speaker - Sergii Golovatiuk: We are using pretty common Ceph architecture putting monitor nodes on the controller nodes and having separate roles for Ceph OSD nodes. User is able to use specific block devices for ceph journal for each OSD. Ceph is shared software defined storage and can be used as a replacement for proprietary solutions if you want live migration and highly available object/image/volume storage. We had to write a bunch of code to support live migration with ceph as it was not in the perfect state previously. For cinder HA we had to specify identical host parameter in cinder config for all volume nodes in order to make volumes really shared.

Main deployment components

  • corosync/pacemaker manifests
  • haproxy with conf.d patch + manifests
  • NS contained resources
Speaker - Vladimir Kuklin: In order to get all of the aforementioned components set up and deployed we added or modified several deployment pieces and modules along with other software. These were: 1. Puppet-corosync module originally created by puppetlabs and modified by us 2. We also needed Haproxy version that supports configuration directory includes in order to inject changes to haproxy in granular manner 3. Modified VIP and Haproxy OCF resources to contain haproxy and IP addresses inside specific namespaces

Corosync/pacemaker manifests: part 1

  • not all resource types supported:
    • constraints (e.g. location) support
    • master/slave resources not supported
  • puppet service type provider for pacemaker:
    • parses LRM of alive nodes
    • waits for status change with respect to timeouts
    • handles timeouts depending on defaults or user-specified values
  • [5.1 release] Shadow approach broken => Moved to xml patches instead
Speaker - Vladimir Kuklin: First of all, we needed to polish some of the corosync module code that we had at the time we forked it. We needed additional support for other pacemaker resources/entities such as location constraints and master/slave resources. Then, in order to deploy pacemaker resources and to maintain almost the same puppet code we needed to implement pacemaker service provider for puppet. It parses output of Pacemaker Local Resource Managers in CIB respecting timeout values and monitor commands. Also in order to support complex OCF scripts such as scripts for Galera and RabbitMQ we had to abandon default upstream 'shadow'-like approach as it was sometimes overwriting cluster attributes during deployment changes and this desynchronization led to deployment failures. So we leveraged pacemaker support for XML diff CIB modification and rewrote all the providers to support it.

Corosync/pacemaker manifests: part 2 - stability

  • [5.1 release] asymmetric cluster:
    • services stopped everywhere by default
      • enabled by 0 location constraint ("unbanning")
      • clones started locally
      • primitives started only once
  • tuning of sql drivers
  • short kernel tcp keepalives
Speaker - Vladimir Kuklin: Our initial implementation of service provider and deployment workflow was not perfect as it was triggering restarts not only for the services on the particular node but globally. So we switched to asymmetric pacemaker cluster which does not start services by default. Then we refactored service provider to perform actions locally using pacemaker location constraints for start and stop actions. In order to make service actions local-wise we altered status method behaviour depending on the type of resource. If resource is a primitive, then we check its status globally. For cloned resources we check status locally. Also in order to make deployment stable enough, we added short kernel TCP keepalives to kill hanging connections in less than a minute along with timeouts tuning for SQL.

Corosync/pacemaker manifests: part 3 - HA scalability

  • multicast problems => switched to unicast by default
  • need to alter and restart corosync => pacemaker maintenance mode
  • galera transitional sync => limit on parallel controllers
Speaker - Vladimir Kuklin: So as soon as we won a fight for deployment stability we moved to scalability and here is what we faced: we received initial feedback from our services guys that most of our customers do not have multicast enabled or correctly configured. This made us switch deployment to unicast one by default. We had to alter corosync configuration and restart it each time we wanted to add a new controller. In order to make it work we modified init scripts for corosync to check for pacemaker and put it into maintenence mode. Also, we needed to limit amount of parallel controllers being deployed in order not to exhaust donor nodes and affect working environment.

Testing HA and results

Testing

  • fuel-devops - tiny piece of libvirt-based orchestrator
  • destructive tests:
    • running virtual environments
    • running bare-metal environments
    • performing destructive actions
  • testing that cluster failed over successfully:
    • FUEL OSTF
    • Tempest
    • Rally
Speaker - Sergii Golovatiuk: Testing HA does not come at easy cost. We are doing this primarily using our own test suite. We spawn environments using our tiny orchestrator which uses python libvirt and other libraries. Then we perform actual destructive actions and check if cluster can failover succesfully using FUEL OSTF as well as other test suites.

Results

  • controller reset
  • all controllers reset
  • networking partitioning
  • AMQP cluster node failure
  • DB node failure
  • particular OS service failure
Speaker - Vladimir Kuklin So far, speaking of results that we achieved with HA installations is that we can easily handle single-failure-at-a-time tolerance for the whole OpenStack cluster whether these are networking problems, failure of DB, amqp or particular OpenStack services. And this was not only confirmed by our tests but also by our most significant partners and customers including carrier-grade ones running extensive and rigorous testing saying that they still cannot break our HA setup.

Current challenges and plans for future

  • multiple writes for Galera
  • oslo.db resilience
  • VRRP and DVR implementations and testing
  • fencing Support
  • event-driven failover and evacuations
  • memcached-related fixes
  • zeroMQ research
  • gate for HA tests
Speaker - Sergii Golovatiuk: Multiple writes: With Galera Cluster, HAproxy still performs read/write operations to one server. OpenStack components has SELECT ... FOR UPDATE queries. When 2 servers have write operation simultaneously, one of the servers will revert its own transaction, perform its neighbor's transaction, then repeat its DB transaction again. OpenStack Services should be aware of such manipulations. We work with OSLO.db developers to resolve these issues. We have plans to make OCF script master-slave based, allowing OpenStack Cloud operators to see status from pcs or crm_mon. Speaker - Vladimir Kuklin: We want to handle the reschedule triggered by pacemaker and OCF along with utilizing VRRP and DVR mechanism for L3 agents. Node Fencing: We are working on configuration of fencing for pacemaker cluster nodes along with RabbitMQ cluster member fencing. Event driven failover: We also want to have centralized view of the whole cluster to start failover before actual failure happens. One of the common examples are triggers in conventional monitoring systems Memcached-related fixes: We need to fix python-memcached library along with memcached driver implementation for horizon. ZeroMQ research: Despite some big cloud installations show that zeromq may be a useful replacement for messaging, current oslo.messaging driver for zeromq is in a poor state. So we are going to research whether it maybe applicable for HA-enabled production ready installations. HA tests: As we already have a well working testing framework for highly available deployment and we are really close to provide ability for community FUEL ISO to deploy vanilla openstack from particular commits, we are going to add HA tests gating to OpenStack jenkins to indicate whether each particular commit is not affecting HA.

Links

Speaker - Vladimir Kuklin You can always check out our wiki page for FUEL project. Actual deployment is done by code in fuel-library sub project. We also periodically write down all our ha fixes to etherpad. And you can always contact is in #fuel-dev channel at freenode or using openstack mailing list with Fuel Prefix in subject.

Credits

Aleksandr Didenko & Bogdan Dobrelia

Roman Podoliaka & Yuriy Taraday

Sergey Melikyan & Stanislav Lagun

Sergey Vasilenko & Matthew Mosesohn

Dmitry Ilyin & Dmitry Borodaenko

Ryan Moe & Andrew Woodward

Anastasia Urlapova & Tatyana Leontovich

Yegor Kotko & Artem Panchenko

Andrey Sledzinskiy

Questions?

Created by Vladimir Kuklin and Sergii Golovatiuk