Ceph performance learnings (long read)
We have been using Ceph since 0.7x back in 2013 already, starting when we were fed up with the open source iSCSI implementations, longing to provide our customers with a more elastic, manageable, and scalable solution. Ceph has generally fulfilled its promises from the perspective of functionality. However, if you have been following this blog or searched for Ceph troubles on Google you will likely have seen our previous posts.
Aside from early software stability issues we had to invest a good amount of manpower (and nerves) into learning how to make Ceph perform acceptably and how all the pieces of hard drives, SSDs, raid controllers, 1- and 10Gbit network, CPU and RAM consumption, Ceph configuration, Qemu drivers, ... fit together. Today, I'd like to present our learnings both from a technical and methodical view. Specifically the methodical aspects should be seen in the retrospective of running a production cluster for a comparatively long time by now, going through version upgrades, hardware changes, and so on. Even if you won't be bitten by the specific issues of the 0.7x series in the future, the methods may prove useful in the future to avoid navigating into troublesome waters. No promises, though. :)
It's alive - you need to take care of it
Your Ceph installation is a living being: we thought we could go ahead, read the manual, understand it, install, configure, and be done. Looking back we expected a "no hassles" software, which Ceph delivers on some levels (like the flexibility with which we can now host VM images) but it requires a lot of attention on other levels. The cause is that Ceph is a multi-variate system that keeps changing: Ceph developers invent new ways to improve the system which need to be operated to upgrade properly, hardware dies, new hardware comes in, your performance requirements change, your storage requirements change, configuration recommendations change, you get a better understanding of the system, and on and on ... You might be tempted to think: I can fix this by investing more in the "pre-live" stage, but complex systems do not lend to this thinking. You will get things wrong and you will learn new things and then you adapt an move on.
Stick with the defaults and recommended values
Some substantial amount of troubles and work that we caused for ourselves was trying to "improve" things by being really smart about tuning. This caused us having to clean up mistakes that we could have avoid by simply sticking with the defaults and recommended values. The Python community has a saying about "meta classes" that seems to apply to Ceph tuning, as well: if you don't really know then you shouldn't do it, and if you know, then you don't need it. Yes, obviously this is an overstatement, however, this gets even more weight through combined with the next learning:
"Hin und her macht Taschen leer"
I could not find a translation for this perfectly fitting German stock investment motto - if you have one, I'd love to hear it! As a rough English translation I can offer: back and forth makes you go broke. Ceph wants to be lazy. Storage wants to sit around (hey, it's called storage, not "Energizer Bunny"). When you keep changing parameters in your cluster you will experience that Ceph needs to redistribute your data. This causes additional load on drives and will reduce the IOs and bandwidth available to clients. This also takes time and requires your personnel to pay attention. Combined with not paying attention to the defaults and recommended values this caused us to spend many many late nights and uncomfortable performance for our customers, even outages when having to perform disruptive operations with our VMs. Those three learnings will guide us in the future to avoid frustration and reduce running unnecessarily into unpredictable situations. From here, lets move on to the more technical items.
Pools are for grouping OSDs - not for managing multi-tenancy
We run a multi-tenant environment and thought it would be a good idea to group the data from our different customers together in pools. This turned out to be a big misunderstanding: pools are used to create groups of OSDs that adhere to one specific policy of placing your data. If you need different policies, you create different pools. Those policies can be complicated and even use the same OSDs, but pools are not intended for "arbitrary" grouping and won't scale, because of ...
Controlling the number of PGs in your cluster is important
Initially when we started with Ceph there was a rule about choosing the number of placement groups based on the amount of data (in a pool). The idea in general is that placement groups allow Ceph to efficiently assign objects as groups to one ore more OSDs. The number of PGs in your cluster will represent a tradeoff between performance and safety: too few PGs and your data won't be properly distributed and does not leverage your equipment properly, too many and your OSDs will have trouble managing them with acceptable CPU usage—increasing latency for all client operations. The current rule of thumb is 150-300 PGs per OSD in your cluster. Small clusters with less than 50 OSDs have some more static recommendations. In our environment, running with almost 200 pools, we added PGs based on the original recommendation looking at the amount of data in the pools. This caused us to end up with about 15.000 PGs at about 12 OSDs. As you can see this is about 4-8 times of the recommendation and we ended up with a high access latency. Increasing the number of OSDs, moving to a single pool with fewer PGs we immediately saw much improved latency which directly resulted in a drop of 550 ms average response time for various sites on our platform to around 470ms. More than 10% of improvement due to sticking with the defaults. Again: we spend a lot of effort to get into a bad situation and had to spend more getting out again ...
One OSD per platter/drive
We always knew about this recommendation but started seeing weird behaviour with various Avago RAID controllers and thought it would be wise to try various combinations with RAID 5, RAID 0, and others. However, we always ended up with experiments in our development environment that showed improvements that we could not replicate in production environments. I had to interact way too many hours with the infamous MegaCLI, so please, to save your sanity: stick with 1 OSD per platter/drive until Sage or someone else tells you not to.
SSD caching helps, but ...
Earlier this year we had troubles providing enough IOs to our clients. Running a cluster of around 50 7.2k SATA HDD drives is easy to calculate a rough upper limit. If you're used to SSD performance in your MacBook, brace yourself for this number: 50 SATA drives will give you roughly 5.000 IOs. In our case that's 6 servers with a capital investment of more than 30.000 €. The physics of spinning disks are not on your side. You could improve this somewhat with 10k or 15k SAS drives, but the price per GiB will only go up. As we're running with Avago RAID controllers we evaluated a tip we got from our friends at filoo.de: CacheCade 2.0. This is a software upgrade for current controllers that allows the controller to use an SSD (or an SSD RAID) as a large read/write cache for your slower disks. The good news: it's relatively cheap and can be added to a running environment without too much hassle. However, it's quite opaque to operate and again, MegaCLI will drive you insane. Also, depending on your cluster size and access patterns, it may take CacheCade a few days to actually find a balanced performance level. It's not the most impressive change we've deployed so far, but it helped in a dire situation so it might be worthwhile to know about. In the future, if we consider SSD caching for spinning disk storage we will likely consider an open solution like bcache2 or whatever will be around then. CacheCade is fine for us as it was a drop-in replacement and we did not have to touch the cluster's configuration much.
OSD latency: jemalloc
Allocating memory to work with your data is something that Ceph will do a lot. When researching latency issues with our OSDs we picked up on the discussion around which memory allocator Ceph works best with. At the moment the best performance recommendation is jemalloc which adds a little more memory usage but is generally faster. People did in-depth testing which shows the details and edge cases.
RAID controllers in general
We've been running LSI/Avago (or PERC when we used DELL) as RAID controllers for a long time. We had a few 3ware controllers at some point, but basically my experience is that those are annoying pieces of hardware and the vendors all can not be bothered to create good software or documentation. In the future we'll try to either run with on-board controllers or simple HBAs but the money spend on RAID controllers is just wasted. To improve reliability of the system disks, we'll likely employ Linux software RAID instead as that has improved a lot compared to 10 years ago and I found it actually nice to work with compared to all the commercial RAID options.
Separate data paths: NVMe SSDs for journals
At some point we were struggling with journal performance and had a strong indication that our RAID controllers were not able to cope with high IO SSD traffic concurrently with slow SATA HDD traffic. As mentioned before controllers are annoying hard to debug and we found that moving the journals to PCIe NVMe cards got that off our table. They may not be hot pluggable, but with Ceph's redundancy you'll find it easy to power off a machine to perform maintenance when needed.
One slow OSD can bring your cluster to a halt
As Ceph distributes data evenly (according to your policy) throughout your cluster, all your data will be spread everywhere. This means there's some part of your VMs (or other data) on every OSD. No problem in general. But if an OSD starts to respond sluggishly, this spreads over to all clients when they access data on that OSD. If an OSD becomes utterly unresponsive, Ceph will remove it from the cluster and restore any missing data from the remaining copies. However...
Flapping adds insult to injury
Ceph can decide to remove an acting OSD from the cluster to avoid things getting stuck. However, it can also decide to take it back in. If an OSD behaves wrong in a "weird" way this will result in the OSD joining and leaving the cluster over and over which may create a feedback loop of stress: the OSD is slow and gets kicked out, Ceph starts recovery which adds stress to the cluster, the OSD behaves fine again and joins, the recovery will include the temporarily missing OSD which now has more stress than usual and may respond even more sluggish than before, so it gets kicked out, which creates recovery traffic again and so on and so on... If something is broken, make sure it either is fixed completely or stays completely broken. Partially working OSDs are a bane.
Slow requests may indicate slow clients
Ceph's internal health monitoring has a helpful indicator whether client requests are being processed in a timely fashion. When things get out of control, Ceph will tell you that a number requests have been in progress for 30 seconds (the default threshold) which means that some poor VM has been waiting for its read or write request for that long. Usually this indicates that your disks or some other components in your Ceph cluster are saturated. However, a part that may get ignored when debugging this are the clients. Your Qemu instances talking to Ceph are part of the cluster. And if those behave slow, i.e. due to a saturated or broken NIC requests may take a long time because they started a write request but did not manage to send their data.
Use 10Gbit Ethernet
10Gbit networking is a lot more expensive than 1Gbit networking. Depending on the pricing that you are used (we try to buy reasonably priced stuff without being cheap but we avoid things tagged "enterprise") 10G ports may cost 5-10 times (or even) more than what you're used to with 1Gbit. 1Gbit will run fine for a while, however, the official recommendations to include running 10Gbit and if you can afford it I'd spend it in this order of decreasing priority and increasing cost:
- 10Gbit on your interconnects if you have multiple racks/switches
- 10Gbit for connecting the OSD servers (frontend and backend combined)
- 10Gbit for connecting your clients
- 2 x 10Gbit for providing separate frontend/backend access to your OSDs.
Metrics with collectd + influxdb + grafana
To reason about your cluster's performance it's important to quickly see various parameters at their current and past values. The most basic thing you can run is "watch ceph -s" which will give you the most basic statistics. However, we found it a huge improvement to gather metrics with collectd's ceph plugin (written by Ceph developers with an almost insane amount of data points), store it in an InfluxDB and have ready-made dashboards in Grafana. Whenever I want to quickly assess the situation I look at this dashboard: [gallery ids="3174,3175" type="slideshow" link="none"] I also use it as a companion whenever I run maintenance activities to quickly see whether the situation is getting better or worse by my doing.
Next up: SSD-based VM pool and 10Gbit KVM access
Those days we're busy adding an SSD-only storage tier into our cluster so that customers can choose substantially higher performance for individual VMs by placing them on SSDs. Additionally we're rolling out the first KVM hosts with 10Gbit access to our storage network. Both together will bring much improved performance as I've seen in our development environment: more than 5k IOs read/write and 600 MiB/s IO bandwidth in a single VM disk without having our cluster breaking too much of a sweat are a great next step. Additionally we're looking forward to deploy the next long term release named "Jewel", which will bring single-queue approach for scheduling OSD activity. It will make overall performance much more predictable when the cluster is busy with client IO, maintenance activities and recovery.
Conclusion
I hope this long read was worth your time. I want to explicitly state that we're happy running Ceph: our first impression back in 2013 was that it was properly designed from the beginning (PhD thesis, yeah!) and the community and traction is there to grow it and keep investing in it based on the communities experiences. Thanks to everyone working hard on a true open solution: should one of you Ceph guys ever be around the Leipzig or Stuttgart area in Germany, shoot us an email and we'll gladly buy beer, whisky or any other beverage that you might fancy. EDIT 2016-10-14: A commenter asked for our dashboard configuration. Here's an export from our grafana instance.. It will likely not work out of the box, as it depends on various transformations to achieve useful metric names and tags in InfluxDB, but maybe it helps to inspire you for your own dashboard.