Have Data, Want Scale
Indefinitely
Exploring Ceph
Florian Haas, @hastexo
Every one of us has certain prized possessions. Something that,
despite us all realizing that the truly important things in life are
immaterial, we all have things that we are proud to own and that we
would have trouble giving up.
For most people, those are things like ...
... this road bicycle, and for many people
it will be their house, or their stereo system, or maybe their best
cooking knife. Strangely enough, though, we seem to be relatively indifferent
to the things that are the truly important parts of said prized
possessions.
For example, for a car, the most important parts, the things that
truly determine whether the car can be driven in the first place, and
the things that can also use potentially fatal damage upon failure,
are the tires.
And for a house, the part that determines its stability and its
use for shelter the most is arguably the foundation.
However, people seem to be much more keen to show off their ...
Source: https://flic.kr/p/83BQ5Z
... front porch or living room or their patio, rather than, say,
the concrete in the foundation, or the rebar in that concrete.
And unsurprisingly, IT is much the same way. What we're proud
of, what we like to show off, are, say cool visualization libraries ...
Source: https://flic.kr/p/79R15s
... or fancy mobile apps or social networks, or maybe ...
... fancy distributed systems solving the mysteries of the
universe.
Source: CERN, Belmiro Moreira
Yet what's crucially important, what determines whether these
systems are actually useful or whether they are meaningless shells, is
data. And it's how we store that data that makes a huge difference to
the success of those systems.
Now in recent history, in the last 10-15 years perhaps, ...
Source: https://flic.kr/p/6CtYcz
... the storage industry had maintained a strange
kind of immunity against that revolution. While vendors adopted Linux
as a platform they would support, storage systems were still rife with
vendor lock-in. Vendors would essentially sell you overpriced
refrigerators, and if you so much as wanted to make one refrigerator
talk to another, then the typical reaction from tech people and sales
people alike would be that ...
Source: https://flic.kr/p/6TT1b3
... they'd laugh at you. In fact,
they'd even laugh at you if you wanted to kick out the...
Source: https://flic.kr/p/etMgGc
... came Ceph.
Ceph is a distributed storage system designed to provide block,
file and object storage on a software-defined platform running on
commodity hardware, providing automatic scale-out and high
availability on a Petabyte scale using open replication protocols, all
under a free software license.
Now clearly, that's way too many ...
... buzzwords to be useful, so let's dissect what Ceph is
about, where it came from, and how it's significant and useful. I can
speak my mind freely here, because I run an independent company that
-- while we do work with Ceph very frequently -- is not affiliated
with any business entity that is related to Ceph.
Now the original motivation behind Ceph -- back in 2005 -- was
to provide a distributed filesystem, akin to the then-popular Lustre
filesystem, but without its shortcomings. In short, it was meant to be ...Lustre
- suck
= Ceph
... "Lustre without the suck".
It originally came out of a research project (and PhD thesis) at
the University of California, Santa Cruz (UCSC). The doctoral
candidate was Sage Weil; the research was partially funded by the US
Department of Energy. Sage continues to serve as Ceph's lead developer
to this day, even though Ceph has grown a massive developer community
in the interim.
At the core of Ceph ...
... lies the notion of a low-level object. And
object is a chunk of data, of arbitrary size with an arbitrary number
of key-value attributes attached to it.
And collectively, that object store is known as...Reliable
Autonomic
Distributed
Object
Store
RADOS
... RADOS.
Crucially, the operations that can be performed on any
object in the object store are very simple: they boil down to GET, PUT and DELETE
operations.
Objects do not concern themselves with sector
offsets (like block devices), nor with file metadata like permission
bits or file ownership. Why? Because those interfaces complicate what
Ceph is built to shine at:Distribution
Let's talk about distribution first. If you want to be able to
scale a datastore to the tune of Petabytes or Exabytes, then vertical
scalability (scale-up) is not an option, you will have to go with
horizontal scalability (scale-out).
However, if you distribute by storing where you wrote something
in a central location, then every lookup for every read and write
needs to go through that central location, which by definition becomes
a single point of failure and a bottleneck. This, incidentally, is
exactly the suck in Lustre that Ceph was designed to fix.
So the only way to build this the right way is to devise an
algorithm where the placement of any object is computed, rather than
looked up, and that algorithm is then known to all components in the
environment (servers and clients alike).Controlled
Replication
Under
Scalable
Hashing
CRUSH
CRUSH is that algorithm, and it's at the heart of Ceph. Every
Ceph client and server component is aware of CRUSH, and can
computationally derive the location of any object in the system. This
eliminates the need for a central lookup facility and opens the system
for massive scalability.Replication
CRUSH includes automatic, synchronous replication, where
each object is stored in the cluster not once, but a configurable
number of times, and where these replicas are distributed according to
an operator defined policy, such that the system can, for example,
keep copies of each object in separate failure domains.
Now if you want to use Ceph, as an application developer,...librados
libradospp
python-ceph
phprados
... Ceph provides a set of CRUSH-aware libraries, which can be used
to interact with RADOS directly. This enables application developers
to store data in a reliable, distributed, highly available object
store without having to worry how exactly that data is stored.
More importantly though, many of these applications have already
been built, so you can use higher level abstractions on the client
side.RADOS
Block
Device
RBD
This is a block abstraction layer on top of the distributed object
store.
This means that while you interact with RBD as if it were a
block device, all I/O to and from that device is translated,
transparently, into I/O operations on RADOS objects -- which means
you get distribution and replication for free. Write a block to the
block device, it becomes one or several RADOS objects,
transparently distributed and replicated across the cluster.RADOS Gateway
RADOS Gateway enables ReSTful HTTP/S access to objects in the
store using the S3 or Swift protocol.
Again, you interact with a well known client protocol and
transparently, without your intervention, this is automatically
distributed and replicated across the cluster.CephFS
CephFS finally is what Lustre always wanted to be: a
horizontally scaleable filesystem with built-in high
availability. Crucially though, CephFS is implemented as a client
layer, again on top of RADOS.
All it has to do is translate POSIX filesystem access into RADOS
object I/O, and it gets distribution and replication for free.
Ceph is readily available to a multitude of applications today.Qemu/KVM
Ceph RBD is directly integrated with Qemu/KVM as a storage
driver, enabling you to run virtual machines directly off Ceph RBD
volumes.
As such, it is fully available as persistent and ephemeral block
storage in OpenStack.CloudStack
OpenNebula
Eucalyptus
Ceph is likewise integrated with, and supported by, other
open-source cloud platforms like Apache CloudStack, OpenNebula or
Eucalyptus.
Deploying Ceph is not rocket science thanks to the ceph-deploy
utility, which in this example allowed us to deploy a simple Ceph
cluster in under 4 minutes. Deployment facilities for Puppet, Chef,
Ansible and SaltStack are readily available. More information is
available at http://ceph.com. So go forth and run it, play with it,
join the mailing lists, get on IRC, find me there.