Disaster recovery

This disaster recovery plan provides an overview of potential disasters and how the Flying Circus systems and personnel is prepared to deal with them.

For each scenario we give:

  • a preventive action

  • the recovery action

  • the recovery time and recovery point objective

  • measures we take to prevent the scenario

Terminology

RTO

Recovery time objective. The planned time needed between discovering a disaster and restoring the service.

RPO

Recovery point objective. The point in time to which data will be available after recovery. Given as in “time before the disaster”.

Note

If recovery actions are neither self-service nor automatic then a 1 hour response time is included to notify the standby support technician.

Hardware errors

Loss of active network component

Disaster prevention

We deploy hot-standby routers and hot-standby switches.

Disaster recovery

Swap faulty component with standby component. This happens automatically for routers and manually for switches.

Depending on the affected services higher level components’ redundancies ( storage, virtualisation) may allow faster recovery times.

RTO for hot-standby routers: less than 15 seconds

RTO for switch port failures or complete failures: 1 hour

RPO: n/a

Loss of VM server

Disaster prevention

We buy professional hardware. We use redundant power supplies. OS disks are not made redundant - failure does not impact VM operations and affected hosts will be evacuated if needed.

Disaster recovery

Migrate or restart virtual machines from the failed host on spare hosts.

RTO: within customer-specific SLA + 15 minutes

RPO: 0

Loss of storage servers covered by redundancy

Disaster prevention

We store all virtual machine images on a distributed storage system (Ceph) with n+2 redundancy. Loss of a single server can be masked transparently.

We can loose multiple storage servers, depending on the capacity of our cluster. We expect to be able to loose at least 2 servers in total without impacting service or data availability. A simultaneous failure of 2 servers may cause intermittent service outages while recovering to an n+1 redundancy state.

Disaster recovery

Ceph performs automatic recovery. Reduced I/O performance may be experienced during this period on virtual machines.

RTO: 5 minutes

RPO: 0

Loss of storage servers exceeding redundancy

Disaster prevention

This is a multi-layered issue. In the case of loss of redundancy beyond automatic repair abilities requires manual specific diagnostics and decision-making.

Customers wanting to exceed this may choose to keep an offsite-backup as well as an emergemency operations setup with our secondary data center.

Disaster recovery

Restore virtual machines from backup.

RTO: 4 hours + 5 hours per TiB of VM storage

RPO: 24 hours / 1 hour [1]

Loss of server rack

Disaster prevention

The most likely scenario to loose a server rack is due to overheating and fire. We thus pack racks loosely to allow for optimal air-flow and density without over-heating. Also, the data center operator employs a smoke detection system that allows for early detection and fire prevention.

Customers wanting to exceed this may choose to keep an offsite-backup as well as an emergemency operations setup with our secondary data center.

Disaster recovery

Buy and install new hardware, provision to new rack in data center.

RTO: 2 weeks

RPO: not available

Force majeure

Loss of power in the data center

Disaster prevention

Require redundant power lines, UPS backup, and diesel generators in the data center.

Customers wanting to exceed this may choose to keep an offsite-backup as well as an emergemency operations setup with our secondary data center.

Disaster recovery

Data center personnel restores power.

RTO: n/a, covered by 3rd party 99.99% SLA

RPO: n/a

Loss of data center

Disaster prevention

Our data center implements a variety of security measures certified through the ISO 27000 family.

RZOB: http://www.kamp.de/kamp-rechenzentrum/sicherheit.html

Disaster recovery

Evaluate recovery of data center, if possible together with the data center operator.

Alternatively find new data center and rebuild infrastructure.

Customers wanting to exceed this may choose to keep an offsite-backup as well as an emergemency operations setup with our secondary data center.

RTO: n/a (24h for backup data center operations)

RPO: n/a (depending on backup frequency)

Software errors

Filesystem corruption

Disaster prevention

We use mature file systems in our storage cluster, backup solutions and with the VMs which can cause inconsistencies under failure scenarios.

Disaster recovery

Restore filesystem or missing files from backups, recreate backups in case of file system errors on backup systems.

RTO: depends on SLA [2]

RPO: 1 day/1 hour [1]

Configuration errors

Disaster prevention

Leverage automated, repeatable, and version-controlled configuration systems.

Disaster recovery

Roll back configuration changes and restore backups if data is lost.

RTO: depends on SLA [2]

RPO for reversible configuration changes: 4 hours

RPO for restore: 1 day/1 hour [1]

Application errors

Disaster prevention

Leverage automated, repeatable, and version-controlled application deployment. Leverage fully separated test/staging/production environments.

Disaster recovery

Re-install application and restore backups if data is lost.

RTO: depends on SLA [2]

RPO for reinstallation: 4 hours

RPO for restore: 1 day/1 hour [1]

User errors

Accidental single file deletion

Disaster prevention

Performing backups.

Disaster recovery

Restore deleted file from backup.

RTO: depends on SLA [2]

RPO: 1 day/1 hour [1]

Accidental database/directory tree deletion

Disaster prevention

Restricting root access and performing backups.

Disaster recovery

Restore deleted files from backup.

RTO: depends on SLA [2]

RPO: 1 day/1 hour [1]