A few
words
to start with
This talk covers an approach towards running application
containers that we think is quite suitable for production use, at
least we've been running it in product for 6 months.
A lot of people we talked to about this approach found it unusual
though, which always has me slightly wary of some important issues we
may have missed.
So with this talk I'll attempts to demonstrate, to explain, and to
hopefully kick off some good hallway discussions about what what we've
been doing. In other words, if you can poke holes into the ideas
described here, please do.You should
know
containers
- This talk assumes familiarity with basic linux container
concepts. You should know what a container is, and ideally you would
have occasionally run a container. It doesn't really matter whether
your experience has been with LXC, LXD, Docker or whatever.
- It is not for complete container novices. Not being a container
expert is perfectly OK though, I'm not one either.You should
know
high availability
- It also doesn't hurt if you've worked with high availability systems.LXC
containers
LXC containers are a means of cleverly making use of Linux
kernel features, such as..Lightweight virtualization
cgroups
namespaces
... network and process namespaces, and cgroups, to isolate
processes from one another. What LXC does is basically set up
namespace and cgroup scaffolding for a particular process, and then
start that process, which becomes PID 1 in the container.Two
container types
Based on *what exactly the process is that is being started
as the container's PID 1, we can distinguish between two different
container types:System
containers
A system container is one where the container manager (like
LXC) starts an init process, like SysV init or upstart or
systemd. This is the most common use case. However, this is by no
means required.Application
containers
In contrast, an application container is one where LXC
starts any other binary, like, say, MySQL or Apache or lighttpd or
HAproxy or whatever your conntainer is meant to run.What's
wrong
with system containers?
So what's the problem with running system containers? Well
they're not so much problems as nuisances, but can be extraordinarily
painful nuisances at times.Resource utilization
So one issue is that in a system container whose only purpose in
life is to, say, run an Apache webserver, there are a bunch of
processes that the container manages that are not Apache... like a
cron daemon, perhaps postfix, something to manage our network, etc.
And this may not hurt if all you're running is 50 Apache containers,
but if you're running 500 or 5,000 or 50,000, then having that excess
overhead may actually cut into your server density.Updates
But a much bigger operational problem is updates,
specifically security updates. System containers generally run off
their own chroot-type root filesystem, so any packages that you
install there must be updates. Generally you'd do this with something
like unattended-upgrades (on Debian/Ubuntu), or by Puppetizing or
Ansiblifying your containers.
Or, and this is the approach generally favored by, for example, the
Docker camp, you don't patch containers at all, but instead you
rebuild them often and redeploy.
Now, put yourself in the shoes of someone who manages maybe 10,000
Apache instances, and the latest bombshell like Heartbleed drops. Now
you have to scramble to either patch or rebuild those 10,000, and
rather quickly. All for containers that are largely identical anyway;
otherwise they wouldn't be containerized.How can we do
better
with application containers?
Now there's gotta be a better way, and indeed there is, if we
make use of another handy kernel feature.An in-kernel
filesystem
merged in Linux 3.18
3.18 landed in November 2014. OverlayFS was previously already
available in Ubuntu 14.04.Union mount
filesystem
OverlayFS isn't the only union mount filesystem in existence,
among its predecessors were UFS and AUFS. OverlayFS is just the first
one that made it upstream.OverlayFS
↑
upperdir
↑
lowerdir
In OverlayFS we always deal with three directories directly, and
one more is internal to OverlayFS' workings.
The OverlayFS is the union mount at the top of the stack.
The lowerdir is our template or baseline directory.
The upperdir is a directory that stores all the files by which the
union mount differs from the lowerdir.
And finally, there's also a workdir that OverlayFS uses
internally, but that is not exposed to users.
In reads, all data that exists in the upperdir is served from
there, and anything that only exists in the lowerdir transparently
passes through.OverlayFS
↓
upperdir
×
lowerdir
Writes, in contrast, never hit the lower directory, they always go
to the upper directory.
What this means is that our template lowerdir stays pristine, it is
only the upperdir that the OverlayFS mount physically modifies.
- ls -lR lower, upper
- mount -t overlayfs -o lowerdir=$PWD/lower,upperdir=$PWD/upper,workdir=$PWD/work none $PWD/overlay
- Explain contents
- cd overlay
- mkdir blatch
- touch blatch/blatchfile
- echo hello > bar/barfile
- ls -lR lower, upper, overlay
- umount overlayfs
- Clean out upper
- Mount overlayfs from /
- ls overlay
- chroot overlay
- ls
- ls /tmp
- ls /root
- cd ..
- mkdir upper/{root,tmp}
- setfattr -n trusted.overlay.opaque -v y upper/{root,tmp}
- mount -o remount $PWD/overlay
- chroot again
- ps -AHfww
- netstat -lntp
- exit
- lxc-start -n bash -d
- lxc-attach -n bash
- ps -AHfww
- netstat -lntpHow
does this
help
with application containers?
So this means that on any host we can have our host filesystem
as the template for any number of containers using the same
OS. Anything that is installed on the host, all containers inherit,
but they all only run exactly the services they need. And if there is
any piece of software that we need to update, we just do so on our
host, and as soon as we remount the overlays and restart each
container, it is immediately updated.
But that still leaves us with two problems:
- We have a window of undefined behavior while the upgrade is underway
and containers are still running.
- Having to do this manually doesn't scale; we have to find an
appropriate means of automating it.Pacemaker
Pacemaker can come in here to bail us out.High availability
Cluster manager
The Pacemaker stack is the default Linux high-availability
stack, and one of its qualities is that it is...Application agnostic
... application agnostic, so that effectively it can be made to
manage any cluster resource in a highly-available fashion.Can manage
LXC
through
libvirt
or
native resource agent
Pacemaker offers support for managing LXC containers not in one,
but two ways:
- via libvirt, using the VirtualDomain resource agent, or
- via a native lxc resource agent, using the lxc userspace
toolchain.Can manage
filesystems
Pacemaker is perfectly capable of managing filesystem mounts as
well, including OverlayFS mounts (and of course, mounts for regular
local filesystems).Can manage
storage
& data replication
and finally, we can also use Pacemaker to enable and disable
access to storage and data replication facilities.
So how can we use all that to our advantage?We use two hosts which are identically configured as far as
packages go. Then we define one DRBD device for every container that
we run, and that device gets a Pacemaker-managed filesystem. Pacemaker
also manages the overlay mounts and the actual containers.
So that means that on a host that is in standby (doesn't run any
cluster resources), we can simply run apt-get dist-upgrade. That
installs all pending patches and security fixes, all while leaving the
existing containers untouched. Then we fail everything over. If
anything goes wrong, we still have the original system that we can
fail back to. Otherwise, we just repeat the same process on the other
box.
We can nicely automate this with Dpkg::Pre-Invoke and
Dpkg::Post-Invoke in our APT conf, and initiate all package
installations and updates from Ansible. You could also go all the way
and auto-update with unattended-upgrades, but there is a tiny chance
that APT could get invoked within the failover window on the other
node. So the Pre-Invoke check would have to be a little more
involved.
This is a simple failover. We just put one node in standby,
watch what it's doing, and taking it out of standby again.LXC
+ OverlayFS
+ APT
+ Pacemaker
= Awesome
But
Where to go from here?
libvirt-lxc
systemd-nspawn
systemd
+ OverlayFS
+ APT
+ Ceph
+ fleet
= Moar Awesome?
http://hastexo.github.io/lceu2015