Have Data, Want Scale

Indefinitely

Exploring Ceph

Florian Haas, @hastexo

Every one of us has certain prized possessions. Something that, despite us all realizing that the truly important things in life are immaterial, we all have things that we are proud to own and that we would have trouble giving up. For most people, those are things like ...

... maybe a fancy car, and for me it happens to be ... Source: https://flic.kr/p/kbSGkT

... this road bicycle, and for many people it will be their house, or their stereo system, or maybe their best cooking knife. Strangely enough, though, we seem to be relatively indifferent to the things that are the truly important parts of said prized possessions. For example, for a car, the most important parts, the things that truly determine whether the car can be driven in the first place, and the things that can also use potentially fatal damage upon failure, are the tires.

And yet, people who like cars typically really don't show off their tires that much. Source: https://flic.kr/p/63pmXg

And for a house, the part that determines its stability and its use for shelter the most is arguably the foundation. However, people seem to be much more keen to show off their ... Source: https://flic.kr/p/83BQ5Z

... front porch or living room or their patio, rather than, say, the concrete in the foundation, or the rebar in that concrete. And unsurprisingly, IT is much the same way. What we're proud of, what we like to show off, are, say cool visualization libraries ... Source: https://flic.kr/p/79R15s

... or fancy mobile apps or social networks, or maybe ...

... fancy distributed systems solving the mysteries of the universe. Source: CERN, Belmiro Moreira

Storage systems, in contrast ... Source: https://flic.kr/p/6TT1b3

... well, people usually find them somewhat boring. Source: https://flic.kr/p/jWZBsL

Yet what's crucially important, what determines whether these systems are actually useful or whether they are meaningless shells, is data. And it's how we store that data that makes a huge difference to the success of those systems. Now in recent history, in the last 10-15 years perhaps, ... Source: https://flic.kr/p/6CtYcz

... we've been swept up in a revolution in all of IT,... Source: https://flic.kr/p/jSmB5

... and that revolution is called open source... Source: http://commons.wikimedia.org/wiki/File:Open_Source_Initiative_keyhole.svg

...or free software, depending on your personal conviction. But strangely... Source: http://en.wikipedia.org/wiki/File:Heckert_GNU_white.svg

... the storage industry had maintained a strange kind of immunity against that revolution. While vendors adopted Linux as a platform they would support, storage systems were still rife with vendor lock-in. Vendors would essentially sell you overpriced refrigerators, and if you so much as wanted to make one refrigerator talk to another, then the typical reaction from tech people and sales people alike would be that ... Source: https://flic.kr/p/6TT1b3

... they'd laugh at you. In fact, they'd even laugh at you if you wanted to kick out the... Source: https://flic.kr/p/etMgGc

... fibre channel switch vendor they preferred for one that you wanted. And then... Source: https://flic.kr/p/dtbmdi

... came Ceph. Ceph is a distributed storage system designed to provide block, file and object storage on a software-defined platform running on commodity hardware, providing automatic scale-out and high availability on a Petabyte scale using open replication protocols, all under a free software license. Now clearly, that's way too many ...

... buzzwords to be useful, so let's dissect what Ceph is about, where it came from, and how it's significant and useful. I can speak my mind freely here, because I run an independent company that -- while we do work with Ceph very frequently -- is not affiliated with any business entity that is related to Ceph. Now the original motivation behind Ceph -- back in 2005 -- was to provide a distributed filesystem, akin to the then-popular Lustre filesystem, but without its shortcomings. In short, it was meant to be ...

Lustre

- suck

= Ceph

... "Lustre without the suck".

It originally came out of a research project (and PhD thesis) at the University of California, Santa Cruz (UCSC). The doctoral candidate was Sage Weil; the research was partially funded by the US Department of Energy. Sage continues to serve as Ceph's lead developer to this day, even though Ceph has grown a massive developer community in the interim. At the core of Ceph ...

... lies the notion of a low-level object. And object is a chunk of data, of arbitrary size with an arbitrary number of key-value attributes attached to it. And collectively, that object store is known as...

Reliable

Autonomic

Distributed

Object

Store

RADOS

... RADOS.

Crucially, the operations that can be performed on any object in the object store are very simple: they boil down to GET, PUT and DELETE operations. Objects do not concern themselves with sector offsets (like block devices), nor with file metadata like permission bits or file ownership. Why? Because those interfaces complicate what Ceph is built to shine at:

Distribution

Replication

Distribution

Let's talk about distribution first. If you want to be able to scale a datastore to the tune of Petabytes or Exabytes, then vertical scalability (scale-up) is not an option, you will have to go with horizontal scalability (scale-out). However, if you distribute by storing where you wrote something in a central location, then every lookup for every read and write needs to go through that central location, which by definition becomes a single point of failure and a bottleneck. This, incidentally, is exactly the suck in Lustre that Ceph was designed to fix. So the only way to build this the right way is to devise an algorithm where the placement of any object is computed, rather than looked up, and that algorithm is then known to all components in the environment (servers and clients alike).

Controlled

Replication

Under

Scalable

Hashing

CRUSH

CRUSH is that algorithm, and it's at the heart of Ceph. Every Ceph client and server component is aware of CRUSH, and can computationally derive the location of any object in the system. This eliminates the need for a central lookup facility and opens the system for massive scalability.

Replication

CRUSH includes automatic, synchronous replication, where each object is stored in the cluster not once, but a configurable number of times, and where these replicas are distributed according to an operator defined policy, such that the system can, for example, keep copies of each object in separate failure domains. Now if you want to use Ceph, as an application developer,...

librados

libradospp

python-ceph

phprados

... Ceph provides a set of CRUSH-aware libraries, which can be used to interact with RADOS directly. This enables application developers to store data in a reliable, distributed, highly available object store without having to worry how exactly that data is stored. More importantly though, many of these applications have already been built, so you can use higher level abstractions on the client side.

RADOS

Block

Device

RBD

This is a block abstraction layer on top of the distributed object store.

This means that while you interact with RBD as if it were a block device, all I/O to and from that device is translated, transparently, into I/O operations on RADOS objects -- which means you get distribution and replication for free. Write a block to the block device, it becomes one or several RADOS objects, transparently distributed and replicated across the cluster.

RADOS Gateway

RADOS Gateway enables ReSTful HTTP/S access to objects in the store using the S3 or Swift protocol.

Again, you interact with a well known client protocol and transparently, without your intervention, this is automatically distributed and replicated across the cluster.

CephFS

CephFS finally is what Lustre always wanted to be: a horizontally scaleable filesystem with built-in high availability. Crucially though, CephFS is implemented as a client layer, again on top of RADOS.

All it has to do is translate POSIX filesystem access into RADOS object I/O, and it gets distribution and replication for free. Ceph is readily available to a multitude of applications today.

Qemu/KVM

Ceph RBD is directly integrated with Qemu/KVM as a storage driver, enabling you to run virtual machines directly off Ceph RBD volumes.

As such, it is fully available as persistent and ephemeral block storage in OpenStack.

CloudStack

OpenNebula

Eucalyptus

Ceph is likewise integrated with, and supported by, other open-source cloud platforms like Apache CloudStack, OpenNebula or Eucalyptus.

Deploying Ceph is not rocket science thanks to the ceph-deploy utility, which in this example allowed us to deploy a simple Ceph cluster in under 4 minutes. Deployment facilities for Puppet, Chef, Ansible and SaltStack are readily available. More information is available at http://ceph.com. So go forth and run it, play with it, join the mailing lists, get on IRC, find me there.

Image credits

pestoverde mroach CERN MoDOT Photos sonjalovas Strangers of London derfian jev55 Brad Bergeron dvanzuijlekom

Visualization credit

kerryrodden

These slides

http://fghaas.github.io/driving-it-2014 https://github.org/fghaas/driving-it-2014

https://www.hastexo.com/contact

driving-it-2014

fghaas

driving-it-2014

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

driving-it-2014

Have Data, Want Scale

Indefinitely

Lustre

- suck

= Ceph

Reliable

Autonomic

Distributed

Object

Store

RADOS

Distribution

Replication

Distribution

Controlled

Replication

Under

Scalable

Hashing

CRUSH

Replication

librados

libradospp

python-ceph

phprados

RADOS

Block

Device

RBD

RADOS Gateway

CephFS

Qemu/KVM

CloudStack

OpenNebula

Eucalyptus

Image credits

Visualization credit

These slides

0 0