Manageable

Application Containers

Lightning Quick Updates, Scaleable Security, Easy High Availability

A few

words

to start with

This talk covers an approach towards running application containers that we think is quite suitable for production use, at least we've been running it in product for 6 months. A lot of people we talked to about this approach found it unusual though, which always has me slightly wary of some important issues we may have missed. So with this talk I'll attempts to demonstrate, to explain, and to hopefully kick off some good hallway discussions about what what we've been doing. In other words, if you can poke holes into the ideas described here, please do.

You should

know

containers

- This talk assumes familiarity with basic linux container concepts. You should know what a container is, and ideally you would have occasionally run a container. It doesn't really matter whether your experience has been with LXC, LXD, Docker or whatever. - It is not for complete container novices. Not being a container expert is perfectly OK though, I'm not one either.

You should

know

high availability

- It also doesn't hurt if you've worked with high availability systems.

LXC

containers

LXC containers are a means of cleverly making use of Linux kernel features, such as..

Lightweight virtualization

cgroups

namespaces

... network and process namespaces, and cgroups, to isolate processes from one another. What LXC does is basically set up namespace and cgroup scaffolding for a particular process, and then start that process, which becomes PID 1 in the container.

Two

container types

Based on *what exactly the process is that is being started as the container's PID 1, we can distinguish between two different container types:

System

containers

A system container is one where the container manager (like LXC) starts an init process, like SysV init or upstart or systemd. This is the most common use case. However, this is by no means required.

Application

containers

In contrast, an application container is one where LXC starts any other binary, like, say, MySQL or Apache or lighttpd or HAproxy or whatever your conntainer is meant to run.

What's

wrong

with system containers?

So what's the problem with running system containers? Well they're not so much problems as nuisances, but can be extraordinarily painful nuisances at times.

Resource utilization

So one issue is that in a system container whose only purpose in life is to, say, run an Apache webserver, there are a bunch of processes that the container manages that are not Apache... like a cron daemon, perhaps postfix, something to manage our network, etc. And this may not hurt if all you're running is 50 Apache containers, but if you're running 500 or 5,000 or 50,000, then having that excess overhead may actually cut into your server density.

Updates

But a much bigger operational problem is updates, specifically security updates. System containers generally run off their own chroot-type root filesystem, so any packages that you install there must be updates. Generally you'd do this with something like unattended-upgrades (on Debian/Ubuntu), or by Puppetizing or Ansiblifying your containers. Or, and this is the approach generally favored by, for example, the Docker camp, you don't patch containers at all, but instead you rebuild them often and redeploy. Now, put yourself in the shoes of someone who manages maybe 10,000 Apache instances, and the latest bombshell like Heartbleed drops. Now you have to scramble to either patch or rebuild those 10,000, and rather quickly. All for containers that are largely identical anyway; otherwise they wouldn't be containerized.

How can we do

better

with application containers?

Now there's gotta be a better way, and indeed there is, if we make use of another handy kernel feature.

OverlayFS

An in-kernel

filesystem

merged in Linux 3.18

3.18 landed in November 2014. OverlayFS was previously already available in Ubuntu 14.04.

Union mount

filesystem

OverlayFS isn't the only union mount filesystem in existence, among its predecessors were UFS and AUFS. OverlayFS is just the first one that made it upstream.

OverlayFS

↑

upperdir

↑

lowerdir

In OverlayFS we always deal with three directories directly, and one more is internal to OverlayFS' workings. The OverlayFS is the union mount at the top of the stack. The lowerdir is our template or baseline directory. The upperdir is a directory that stores all the files by which the union mount differs from the lowerdir. And finally, there's also a workdir that OverlayFS uses internally, but that is not exposed to users. In reads, all data that exists in the upperdir is served from there, and anything that only exists in the lowerdir transparently passes through.

OverlayFS

↓

upperdir

×

lowerdir

Writes, in contrast, never hit the lower directory, they always go to the upper directory. What this means is that our template lowerdir stays pristine, it is only the upperdir that the OverlayFS mount physically modifies.

- ls -lR lower, upper - mount -t overlayfs -o lowerdir=$PWD/lower,upperdir=$PWD/upper,workdir=$PWD/work none $PWD/overlay - Explain contents - cd overlay - mkdir blatch - touch blatch/blatchfile - echo hello > bar/barfile - ls -lR lower, upper, overlay - umount overlayfs - Clean out upper - Mount overlayfs from / - ls overlay - chroot overlay - ls - ls /tmp - ls /root - cd .. - mkdir upper/{root,tmp} - setfattr -n trusted.overlay.opaque -v y upper/{root,tmp} - mount -o remount $PWD/overlay - chroot again - ps -AHfww - netstat -lntp - exit - lxc-start -n bash -d - lxc-attach -n bash - ps -AHfww - netstat -lntp

How

does this

help

with application containers?

So this means that on any host we can have our host filesystem as the template for any number of containers using the same OS. Anything that is installed on the host, all containers inherit, but they all only run exactly the services they need. And if there is any piece of software that we need to update, we just do so on our host, and as soon as we remount the overlays and restart each container, it is immediately updated. But that still leaves us with two problems: - We have a window of undefined behavior while the upgrade is underway and containers are still running. - Having to do this manually doesn't scale; we have to find an appropriate means of automating it.

Pacemaker

Pacemaker can come in here to bail us out.

High availability

Cluster manager

The Pacemaker stack is the default Linux high-availability stack, and one of its qualities is that it is...

Application agnostic

... application agnostic, so that effectively it can be made to manage any cluster resource in a highly-available fashion.

Can manage

LXC

through

libvirt

native resource agent

Pacemaker offers support for managing LXC containers not in one, but two ways: - via libvirt, using the VirtualDomain resource agent, or - via a native lxc resource agent, using the lxc userspace toolchain.

Can manage

filesystems

Pacemaker is perfectly capable of managing filesystem mounts as well, including OverlayFS mounts (and of course, mounts for regular local filesystems).

Can manage

storage

& data replication

and finally, we can also use Pacemaker to enable and disable access to storage and data replication facilities. So how can we use all that to our advantage?

We use two hosts which are identically configured as far as packages go. Then we define one DRBD device for every container that we run, and that device gets a Pacemaker-managed filesystem. Pacemaker also manages the overlay mounts and the actual containers. So that means that on a host that is in standby (doesn't run any cluster resources), we can simply run apt-get dist-upgrade. That installs all pending patches and security fixes, all while leaving the existing containers untouched. Then we fail everything over. If anything goes wrong, we still have the original system that we can fail back to. Otherwise, we just repeat the same process on the other box. We can nicely automate this with Dpkg::Pre-Invoke and Dpkg::Post-Invoke in our APT conf, and initiate all package installations and updates from Ansible. You could also go all the way and auto-update with unattended-upgrades, but there is a tiny chance that APT could get invoked within the failover window on the other node. So the Pre-Invoke check would have to be a little more involved.

This is a simple failover. We just put one node in standby, watch what it's doing, and taking it out of standby again.

lceu2015

fghaas

lceu2015

1 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

lceu2015

Manageable

Application Containers

words

know

know

LXC

containers

Lightweight virtualization

cgroups

namespaces

Two

container types

System

containers

Application

containers

wrong

Resource utilization

Updates

better

OverlayFS

filesystem

Union mount

OverlayFS

↑

upperdir

↑

lowerdir

OverlayFS

↓

upperdir

×

lowerdir

How

help

Pacemaker

High availability

Cluster manager

Application agnostic

LXC

libvirt

native resource agent

filesystems

storage

LXC

+ OverlayFS

+ APT

+ Pacemaker

= Awesome

But

DRBD + XFS

CephFS

libvirt-lxc

systemd-nspawn

Pacemaker

fleet

systemd

+ OverlayFS

+ APT

+ Ceph

+ fleet

= Moar Awesome?

Kubernetes?

1 0