vault2015



vault2015

0 2


vault2015


On Github arbrandes / vault2015

Ceph performance

demystified

Benchmarks, Tools, and the Metrics That Matter

- Introductions - Partners with Inktank before it was called that (and now it's Red Hat!) - We've done many Ceph-related projects, including building clusters and optimizing them. One of the questions we get asked most often is: how do I make head-and-tail of Ceph performance? - ...and this is why this talk exists. - I'll tell you about about the benchmarks and tools you can use to appraise your Ceph performance. - What are the real metrics that are important.

ceph -s

HEALTH_OK

(Credit: Aidan Jones, CC-BY-SA, source: https://flic.kr/p/5Ksc9t) - This is how I feel when I finish deploying my cluster and get HEALTH_OK. - Whether using ceph-deploy, or Puppet Ceph Deploy, or Ansible, etc, and deploying potentially hundreds of OSDs over dozens of nodes, the first thing I want to get is this.

But does it

perform?

- The question arises: does it actually perform? - Does it actually do what I expect it to do given the architecture and hardware available? - When analyzing storage performance, whether Ceph or otherwise, there are multiple dimensions we need to look at.

Latency

- For example, one of the dimensions we typically look at is latency. - What is latency? You take the smallest piece of data you can write to your device, or read from it, how long does it take. - Latency is important for applications such as databases.

Throughput

- Another metric is throughput. - Entirely different from latency: it's not about the time that it takes to make the smallest possible write, but the total volume of data that we can write/read from the device at an arbitrary time span. - This is important if you are, for example, streaming media storage, or any application that requires storing or fetching large amounts of data at a time.

IOPS

- IOPs are more or less the reciprocal of latency. - Instead of how long does a single small write take, how many IO operations can I get per second? - These are the metrics we want to look at, and which of these is important for your use case or application? That depends on your use case. :) - If building streaming media application, you'll want read throughput. - If app needs to get to different data fast, you want latency. - If you're building a general-use Ceph cluster, you'll need to find a balance. - In addition to these multiple dimensions, there are also multiple components you'll want to benchmark.

RADOS

- You might be building an application that uses the Object Store, RADOS, directly. If so, you'll be interested in its raw performance. - But you might also be interested in layers that are on top of it, such as:

RBD

- If building a virtualization platform, such as is the case with OpenStack, CloudStack, of KVM, you'll use Ceph for volume storage, and you're going to be interested specifically on RBD performance. - RBD is a client built on top of RADOS, so you need to take into account the performance of both.

radosgw

- If you're working with clients that are using primarily the RESTful API to interact with Object Storage, you'll be very interested in radosgw performance. - Like with RBD, you'll need to analyze performance of both the underlying Object Store, but the radosgw servers (how many you have, how Apache/nginx is performing, how fastcgi is performing, or whatever you're using for a proxy server).

CephFS

- Finally, if running CephFS, another client layer on top of RADOS, you'll also going to be interested in the POSIX filesystem performance, whether using the Linux kernel as a client, or ceph-fuse, or libcephfs, or Samba, or something else. - Once again, you need to consider multiple performance dimensions (latency, IOPs, throughput), but you also need to appraise your Ceph performance in it's multiple layers (RADOS, RBD, radosgw, CephFS).

Block

- At the very lowest layer, of crucial importance is the performance of the block devices to which data is actually being written to, whether these are spinners, or SSDs, or Fusion IO drives, or whatever.

Network

- And of course, another item of great importance when measure overall performance is your network throughput and latency. - This includes the client network, used for clients to interact with MONs and OSDs: critical because Ceph allows any client to talk to multiple OSDs directly and simultaneously - And also, the network connectivity between OSDs is important because Ceph puts a lot of replication inteligence into the OSDs themselves. - Once again, multiple dimensions, multiple application workloads, and multiple layers, including the lowest level, notably block device performance and network performance.

Ceph cluster

Benchmarks

- When benchmarking Ceph (and this is where it gets challenging), you need to take into account all these dimensions, use cases, and layers.

- Overview: What can influence your Ceph performance, at what level. - Lowest: block devices that store data. Your OSD can never get any faster than your underlying block device. Typically, a raw block device for the journal, and the OSD file store on a filesystem. - Journal performance: As a cost/performance tradeoff, you'll see relatively fast, but small, journal devices, versus slower but larger file stores. Per server, you'll see a maximum of 12-18 spinners as file stores, with a smaller number of SSDs (2-3) to host those journals. Journal performance = streaming performance of device. - File store performance: Filesystems influence Ceph performance: 3 filesystems to choose from: BTRFS, XFS, ext4 (most people will use XFS, as it tends to outperform ext4, and BTRFS is still not production-ready). All have tunables. - OSD daemon performance: it uses the file store and journal, but also a fair amount of CPU and memory. There are also many tunables, which are critical to performance. - Network performance: client (or public) network, and the cluster network (for replication and backfilling). - Top layer: what uses librbd, such as RBD itself, QEMU with RBD driver, or CephFS, or, if going through radosgw, Swift or S3 clients. - Luckily, good tools to evaluate the system, bottom to top.

Block device benchmarks

(simple)

dd if=/dev/zero of=/dev/sdh1 bs=1G count=1 oflag=direct

dd if=/dev/zero of=/dev/sdh1 bs=512 count=1000 oflag=direct
- One of the things you should do is benchmark performance of the underlying block devices BEFORE actually deploying the cluster, as some benchmarks are destructive (such as this really simple one). - This writes some data to a block device. The top one is a simple micro-benchmark for throughput: writes a full gig of data in o_direct mode (oflag=sync, oflag=dsync). The flag is necessary, otherwise you're just measuring your page cache in RAM. - The bottom one is a super-simple one for latency: smallest amount of data you can write. If a spinner, typically 512 bytes; SSDs, 4K. - If the block device has a cache, make sure to overwhelm it with the amount of data you're writing.

- Here, on a live Ceph cluster, on the OSD directory, I'm writing 1G to the journal device directly (a symlink to a partition, as you can see), in direct mode. 2GB/s is a little high, though. - Next, we test SSD latency: 0.03s/1000 = 0.03milliseconds. - Next, we write directly to the spinner, but we need to overwhelm the cache (with 2GB) to get the real value.

Block device benchmarks

(advanced)

fio --size=100m \
    --ioengine=libaio \
    --invalidate=1 \
    --direct=1 \
    --numjobs=10 \
    --rw=write \
    --name=fiojob \
    --blocksize_range=4K-512k \
    --iodepth=1
- You can do the same benchmark in a more elaborate fashion using FIO. - Universal benchmark tool, maintained by Jens Axboe, who also maintains the Linux kernel's block layer. - In this case, we're trying to duplicate the typical write load for a Ceph OSD journal: async IO (libaio engine), direct IO, random block size, writing 100MB 10 times (1GB)

- We get 6.3GB/s aggregate bandwidth, which is cool. - You can play around with the parameters as I'm doing here. Usually, a spinner will give you 15-80 MB/s of bandwidth, while an SSD will get you something in the 300-400 MB/s range. - Don't expect Ceph performance to match journal performance, though! OSDs will periodically flush journals to disk (of course), and when that happens, you're constrained by spinner performance.

Network benchmarks

netperf -t TCP_STREAM -H <host>
- I don't have a screencast for this, as most of you are probably familiar with netperf. - Ceph connections use regular TCP sockets, so you could also dd into netcat, for instance.

OSD benchmark

ceph tell osd.X bench
- At this point we have benchmarks for block and network level, and now we can actually benchmark the OSD. - This is really useful because it comes bundled with Ceph.

- What comes out is something like this. - I always run benchmarks, as you've seen, at least 3 times, and take an average. This is because you can run into hidden caches, or something to that effect. - By the way, this is a non-destructive benchmark, so you can use it on a running cluster (though it might impact THAT OSDs performance temporarily). - (If you're feeling naughty, run ceph tell osd.asterisk on a cluster you don't maintain. ;) - This runs locally on the OSD. - Useful for catching configuration errors, such as mounting XFS without inode64, which will impact performance.

rados benchmark

rados bench -p bench 30 write

Do this on a throwaway pool!

- This DOES take the network into account. You tell rados to test writing or reading, and for how long, from an actual client (such as an OpenStack compute node.) - This will create a bunch of RADOS objects, and while you can use any pool for this, it is recommended to create a throwaway one. Though the benchmark will removes objects it creates, if you interrupt it with SIGINT, they'll remain.

- This is what it looks like. - Another reason to use a separate pool is that you can play with the number of PGS and how they affect performance. - You can see the expected performance on a 10Gigabit network, which is about 1.1GB/s. - You can tune it to the appropriate write size, to get a better latency test. If using RADOS, for instance, you'll probably want to set the writes to 4MB chunks.

fio RBD benchmarks

fio --size=10G \
    --ioengine=rbd \
    --invalidate=0 \
    --direct=1 \
    --numjobs=10 \
    --rw=write \
    --name=fiojob \
    --blocksize_range=4K-512k \
    --iodepth=1 \
    --pool=bench \
    --rbdname=fio-test
- Now we can move up the stack and benchmark one of the popular Ceph use cases, which is volume storage for virtualization. Luckily, in recent versions of fio there is a librbd backend engine, so instead of having to jump through mounting hoops. - You can see standard options such as direct IO and the regular benchmarks (read, write, random read, random write, streaming write, etc). - Specific options: the pool, and the rbd volume you want to use. It won't create it for you!

- Writing 100GB at a time, and getting an aggregate bandwith of 927MB/s. - 10% under raw RADOS, but a pretty good result for RBD. - Results fluctuate a bit depending on the benchmark type. - Of note, is that Ceph performs particularly well with randwrite benchmarks, because it doesn't care that IO is all over the place, because it hits OSDs all over the place. - This runs on 200 OSDs on 4 physical nodes, on all-SSD OSDs.

rbd bench-write

rbd bench-write -p bench rbd-test \
    --io-threads=10 \
    --io-size 4096 \
    --io-total 10GB \
    --io-pattern rand
- Recently a new option for the "rbd" tool came out: "bench-write". You can use it in the same spirit as fio with the librbd engine.

Mind the cache!

- 900MB/s is pretty good, but we have a wonderful writeback and writethrough cache in RBD. - We can measure it with fio.

- First test is without caching, and we get roughly 900MB/s. - We get the same with rbdcache=true, though. Why? - Ceph has a feature where the client operates in writethrough mode until it receives the first "flush" from upper layers. When you have an intelligent VM guest that is capable of sending down a flush request (which Ceph can only tell once the first one comes in), Ceph can decide to enter writeback caching mode safely. - A relatively recent version of QEMU with virtio will do this, but not if the VM is running an ancient kernel such as 2.6.32, or something like Windows. If you're not using virtio, you're also not going to see flushes. In this case, Ceph defaults to protecting your data (i.e., use only writethrough). - To test this in FIO, you need to set --fsync=10 (after 10 IOs, send an fsync), because the RBD fio engine translates fsync into "flush": and now we get over 1GB/s, and we can also see the flushes on the client log. - Further benchmark tools: rest-bench, for radosgw, fio tests on CephFS, IOzone and bonnie++ (results are reportedly difficult to interpret, though)

Please

share, copy, adapt, remix!

https://github.com/arbrandes/vault2015