How we are improving our infrastructure for better reliability and scalability
In 2016 we have seen substantial growth in our infrastructure after we changed our pricing model to make resources more affordable and as we introduced NixOS-based 64-bit virtual machines. Unfortunately under this growth we have failed to adapt our infrastructure in good time and thus, ultimately, failed our customers to provide the reliability and performance that they rightfully expect.
We have taken the time and reviewed all the incidents we experienced this year and decided to further invest in two critical areas of our infrastructure: networking and storage. Networking Earlier this year, after a long period of research, we discovered Brocade's VDX offering. Combined with their flexible subscription option we replaced our transitional 10 Gigabit infrastructure with Brocade VDX 6740T switches that have been working fantastically since about 2 months now. However, we are currently still running a 1 GE HP ProCurved based infrastructure for some of our VLANs. This situation at the moment has multiple drawbacks:
- we do not yet have full active-active redundancy for all servers and all networks
- we still have a mixed-vendor environment that has its own risks (as we have seen with our outages earlier this year)
- we do not benefit from the reliability features that Brocade offers (like persistent logging, full meshing)
- we require too many cables per server and have a much too complicated switch configuration
Our plan for the next weeks and months is, to:
- provide two redundant 10GE Brocade switches per rack, so that every server gets redundant access to the network
- remove existing 1GE connections
- move all our networks to tagged VLAN configurations
For you as a customer, this will be visible as a faster network in all VMs with a much, much lower risk of incidents (due to component failure or operating errors) than before. Storage Our storage has not been able to provide the performance and resilience that we want to deliver to you. It specifically struggled to keep up with our increased load. We have learned multiple things that resulted in our new roadmap:
- For Ceph to show its strength in horizontal scaling we not only need many disks, but we also need a more servers to reduce the impact of individual server failure.
- Using our existing HDD pool based on its storage capacity has lowered the available IOPS per customer dramatically. We need to provide a lot more IOPS, manage them more systematically, and we also need to communicate what can be expected more transparently.
- With the advent of high capacity and medium endurance SSD technology we are finally at a point where HDD technology is now turning from a mainstream default choice to a niche solution for low-performance high-capacity tasks, like test environments, archives, backups, etc.
Our next steps for our storage cluster are:
- Add more capacity and IOPS to our existing cluster by extending the HDD pool with SSDs.
- Growing our cluster from 6 storage hosts to 10 to further improve available IOPS and reduce host outage impact.
- Move from n+1 redundancy to n+2 redundancy to allow for more complex failure scenarios.
- Introduce more strict IOPS mangement for VMs and communicate the rules around it. We already started to defensively add limits to counter the worst impact. At the moment we are discussing some options that will allow you to choose from different storage classes with different performance characteristics. Very likely we will bind the absolute IOPS limits of each VM based on the total storage size that it uses. This reflects physical reality of how IOPS can be added to a cluster.
- Update our Ceph installation to the next long-term supported version "Jewel" which will have much more controlled IOPS behaviour to avoid negative impact from maintenance operations on customer traffic.
- Implement a more fine-grained and much more strict capacity management in our inventory system that reflects cluster status, effective usage and booked capacities to avoid over-subscription which results in undetected reduction of redundancy and performance penalties.
Those measures will be implemented in the next days, weeks, and months and ultimately lead to a more robust, more reliable, and more scalable infrastructure. We will announce individual steps that require maintenance on our status page. We are also interested in hearing your feedback, specifically when it comes to your requirements of performance and capacity. Let us know by sending an email to support@flyingcircus.io.