Improving Ceph OSD start-up behaviour with vmtouch
We have a love/hate relation ship with Ceph. On one hand, it is probably the best open source distributed storage around. On the other hand, Ceph repeatedly exhibits unexpected behaviour under high load. And it is absolutely correct that you expect Flying Circus VMs to perform evenly. That is something we keep revisiting regularly. In the following article, I will describe an improvement we have applied on a common pain point: I/O hangs during OSD restarts.
Restarting an OSD (Object Storage Daemon) places additional load on its backing disks. Flying Circus business growth led to increasing storage I/O demand. While this is generally a good thing, it brought our main Ceph cluster near its throughput limit for several times. Danger ahead: The storage cluster is running fine as long as nothing special happens. But if something unusual happens, the cluster suddenly goes over the tipping point and performance becomes shaky. We are attacking the problem of insufficient headroom from several sides at once. The first thing is to upgrade hardware continuously so that the storage cluster is keeping up with demand. Additionally, it is a good idea to defuse situations which are likely to turn a highly loaded cluster into an overloaded one. I will focus on OSDs restarts here. Other conditions like all VMs producing I/O load spikes at the same time are worth another article.
Problem anatomy
OSD restarts are so critical because, in my opinion, there is a weakness in Ceph's design. Let me explain. A Ceph cluster has a number of Object Storage Daemons. OSDs manage access to replicated on-disk objects. Disk objects are pooled in Placement Groups (PG) that share the same distribution and replication properties (if this is all confusing to you, Ceph's architecture overview would be a good read). When an OSDs goes down, it gets marked "out" by the cluster monitors. I/O requests are serviced by other OSDs that maintain copies of the affected PGs. When the OSD restarts, it gets marked "in" again. Now it needs to check all PGs for updates it may have missed during its downtime. Here comes the problem: A newly started OSD hits the disk hard to check PGs for staleness, recover from missed updates, and receives client requests because it has already been marked "in", all at the same time. If there is already little headroom in I/O throughput, disks cannot keep up and performance starts to drop. There have even been reports on the mailing list that, in extreme cases, this effect can bring a whole cluster down. This happens when an OSDs becomes so unresponsive that it gets marked "out" by the cluster monitors despite the fact that it is still running. Once client requests are routed elsewhere, the OSD starts to respond again and gets marked "in". Note that other, formerly unaffected OSDs also kick off increased disk activity as part of the recovery process. If a cluster is near its I/O limit, this also make other OSDs unresponsive. The result is a cluster where OSDs are repeatedly switching between "in" and "out" states. A cluster in such a state is unable to service client requests. It may even not recover without administrator intervention. Luckily, we have never experienced a situation like this. But even in its weaker form, I/O hangs on OSD restarts are affecting VM performance and should at least be reduced, if not avoided. What can we do about this? A Ceph developer confirmed that the design should be fixed. But is there anything we can do to reduce impact in the meantime? The obvious thing is to make sure a storage cluster has always an I/O throughput reserve. Another vector is to reduce I/O contention during that critical phase when a freshly started OSD is simultaneously checking PGs and receiving client traffic.
Reducing disk load during OSD starts
OSDs show predictable disk access patterns during start-up. We can use knowledge about these access patterns to read files into the kernel cache in advance. These files don't need to be read during the critical phase. This means fewer seeks and better performance. We have examined OSD behaviour with Brendan Gregg's opensnoop. A typical OSD start-up looks like this:
Tracing open()s issued for filenames containing "ceph-7". osd 0x4 /srv/ceph/osd/ceph-7/magic osd 0x4 /srv/ceph/osd/ceph-7/whoami osd 0x4 /srv/ceph/osd/ceph-7/ceph_fsid osd 0x4 /srv/ceph/osd/ceph-7/fsid osd 0xb /srv/ceph/osd/ceph-7/fsid osd 0xb /srv/ceph/osd/ceph-7/fsid osd 0xc /srv/ceph/osd/ceph-7/store_version osd 0xc /srv/ceph/osd/ceph-7/superblock osd 0xc /srv/ceph/osd/ceph-7 osd 0xd /srv/ceph/osd/ceph-7/fiemap_test osd 0xd /srv/ceph/osd/ceph-7/xattr_test osd 0xd /srv/ceph/osd/ceph-7/current osd 0xe /srv/ceph/osd/ceph-7/current/commit_op_seq osd 0xf /srv/ceph/osd/ceph-7/current/omap/LOG osd 0x10 /srv/ceph/osd/ceph-7/current/omap/LOCK osd 0x11 /srv/ceph/osd/ceph-7/current/omap/CURRENT osd 0x11 /srv/ceph/osd/ceph-7/current/omap/MANIFEST-040577
At first, the OSD reads various metadata files. Nothing special here. But later on, traces become more interesting:
osd 0x13 /srv/ceph/osd/ceph-7/current/omap/040588.ldb osd 0x13 /srv/ceph/osd/ceph-7/current/omap/040580.ldb osd 0x13 /srv/ceph/osd/ceph-7/current/omap/040225.ldb osd 0x16 /srv/ceph/osd/ceph-7/current/410.cf_head osd 0x16 /srv/ceph/osd/ceph-7/current/482.b9_head osd 0x16 /srv/ceph/osd/ceph-7/current/482.90_head
Our OSD first opens files belonging to the Object Map database (omap/*.ldb). The Object Map keeps record which storage objects are located where in the cluster. Afterwards, the named objects are opened (*_head) and reconciled with other replicas. This pattern repeats for a while: first omap files are opened, then the corresponding objects:
osd 0x16 /srv/ceph/osd/ceph-7/current/omap/040228.ldb osd 0x16 /srv/ceph/osd/ceph-7/current/410.1e8_head osd 0x16 /srv/ceph/osd/ceph-7/current/482.122_head osd 0x16 /srv/ceph/osd/ceph-7/current/410.1ea_head [...] osd 0x16 /srv/ceph/osd/ceph-7/current/omap/040229.ldb osd 0x16 /srv/ceph/osd/ceph-7/current/410.266_head osd 0x16 /srv/ceph/osd/ceph-7/current/410.395_head osd 0x16 /srv/ceph/osd/ceph-7/current/610.41_head [...] osd 0x16 /srv/ceph/osd/ceph-7/current/omap/040230.ldb osd 0x16 /srv/ceph/osd/ceph-7/current/410.26b_head osd 0x16 /srv/ceph/osd/ceph-7/current/410.31c_head osd 0x16 /srv/ceph/osd/ceph-7/current/482.e1_head [...]
We cannot predict which objects need to be read, but we know that all omap database files need to be opened sooner or later. Enter vmtouch. This little utility reads files into the kernel cache and locks them in memory, so that subsequent I/O operations will read them from the cache. This is exactly what we need here. We lock all omap database files into memory before starting an OSD. Now access patterns looks like this:
vmtouch 0x6 /srv/ceph/osd/ceph-7/current/omap/040424.ldb vmtouch 0x7 /srv/ceph/osd/ceph-7/current/omap/040558.ldb vmtouch 0x9 /srv/ceph/osd/ceph-7/current/omap/040559.ldb vmtouch 0xb /srv/ceph/osd/ceph-7/current/omap/040120.ldb vmtouch 0xc /srv/ceph/osd/ceph-7/current/omap/040102.ldb [...] osd 0x16 /srv/ceph/osd/ceph-7/current/410.cf_head osd 0x16 /srv/ceph/osd/ceph-7/current/482.b9_head osd 0x16 /srv/ceph/osd/ceph-7/current/482.90_head osd 0x16 /srv/ceph/osd/ceph-7/current/482.9e_head osd 0x16 /srv/ceph/osd/ceph-7/current/410.f_head
Disk seeks are reduced once the OSD has been marked "in". But is the effect large enough to make a difference?
Measurements
A good indicator to measure is how many and how long PGs are in the so-called "peering" phase. Peering is an internal Ceph state for PGs that are in the process of being reconciled and have not yet decided how much recovery is needed. Client requests are temporarily not serviced for those PGs. Ideally, a peering state should pass so fast that it is hardly noticeable. But if a cluster is overloaded, peering states hold up and requests are stalled. To verify that our vmtouch trick reduces peering time, I stressed a lab cluster with artificial load slightly below its I/O throughput limit. Then I turned an OSD off, waited for a minute so that it missed a certain amount of writes, and turned it back on. From the logs, I plotted the number of PGs in peering state over time. The impact score is the integral under that function, i.e. the sum of the number of peering PGs times the peering durations. In the first experiment an OSD is restarted the normal way (without vmtouch). For more than 30 seconds, quite a number of PGs are in peering state and cause the storage cluster to appear slow. In the second experiment, I use vmtouch to prefetch omap database files.
While peering states have not vanished completely, we see greatly improved behaviour. Peering starts a bit earlier (omap files are already loaded) and the impact score is more than an order of magnitude smaller. I think this is quite an impressive result.
Conclusion
Preloading files along a known access pattern helps quite a bit. While this is not the complete solution to make our Ceph cluster more robust, it helps us bearing a common pain point. In the long run, we would like to see Ceph's design improved so that the underlying cause goes away.