Any good system that is targeting the public or the enterprise these days must be built to expect the unexpected. No system is perfect and at some point, something will happen that will render a system inoperative – a fire, a hurricane, an earthquake, human error – the list goes on. Because there are so many different possible ways that systems can fail, systems need to be designed with the expectation that failure will occur.

There are two related, but often confused topics that play into system architecture that mitigate against failure: high availability (HA) and disaster recovery (DR). High availability, simply put, is eliminating single points of failure and disaster recovery is the process of getting a system back to an operational state when a system is rendered inoperative. In essence, disaster recovery picks up when high availability fails, so HA first.

High Availability

As mentioned, High Availability is about eliminating single points of failure, so it implies redundancy. There are basically 3 kinds of redundancy that are implemented in most systems: hardware, software, and environmental.

Hardware redundancy was one of the first ways that high availability was introduced into computing. Before most apps were connected to the internet, they served enterprises on a LAN. These servers didn’t need the scale that modern applications do where there may be thousands of simultaneous connections with 24/7 demand. These applications did, however, supplied business critical data, so they needed hardware that was fault tolerant. Single points of failure where eliminated by manufacturers building servers that had:

  • Redundant storage with RAID or similar technology, which ensured that data was written to read from multiple physical disks. This prevented data loss and downtime.
  • Redundant power, typically in the form of multiple power supplies, enabled admins to connect servers to independent power sources so servers could remain powered on if there was a power loss from one source.
  • Error correction, such as ECC RAM, that enabled data to be healed in the event of data corruption in storage.
  • Redundant networking, such as multiple NIC’s connected to independent networks to ensure that a server remained online in the event of network failures.

Software redundancy soon followed suit. Application designers worked to ensure that applications themselves could tolerate failures in a system, be it hardware failure, configuration errors, or any number of other reasons that could take down a part of the software. A few ways this has been accomplished includes:

  • Clustering technologies, such as database clusters, that spread workloads across multiple servers.
  • Statelessness in applications for rapid scaling and easy-to-configure high availability.
  • Load balancing with application monitoring by way of health probes. This allows incoming requests to applications to be routed to healthy application nodes as well as raise events to proactively handle failure.
  • Self-healing systems that move workloads around or allocate additional capacity when failures are detected.

With the rise of cloud computing, cloud providers have taken high availability to a whole new level to include large scale, environmental redundancy with:

  • Hardware redundancy on a server rack within data centers to include discrete networking, power, and storage for hardware that allows users to spread workloads to mitigate single points of failure. Azure calls these “fault domains”.
  • Data center redundancy within a geographic region, typically referred to as an “availability zone”, allow users to run applications in separate data centers that are located geographically close to one another.

All these domains (hardware, software, and environmental) seek to solve the same basic problem by making efforts to eliminate single points of failure. The results now supply high service level agreements (SLA’s) that measure unplanned downtime to less than 10 seconds for a given 24-hour period.

Disaster Recovery

Disaster recovery picks up where high availability fails. Disaster recovery can be as simple as restoring from a backup, but it can also be very complex too depending on two factors: the Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

A Recovery Time Objective is the maximum amount of time that a system can be down before it is recovered to an operational state. For some systems, this recovery time objective can be measured in hours or even days, but for more mission-critical systems the recovery time objectives are typically measured in seconds.

A Recovery Point Objective is the amount of data loss, measured in time, that is tolerable in a disaster. For some systems, losing a day’s worth of data might be acceptable while for other systems this might be mere minutes or seconds. The length of RTO’s and RPO’s have profound implications on how disaster recovery plans are implemented.

Short RTO’s and RPO’s require that a system implements active data replication between primary and recovery systems (such as database log shipping) and maintaining failover systems in a ready (expressed as “hot-hot”) or near ready (“hot-warm”) state to take over in the event of a disaster. Likewise, the trigger for a disaster recovery failover is automated.

For longer RTO’s and RPO’s, restoring systems from daily backups might be enough to meet the RTO’s and RPO’s. These backups might be backups of application servers, databases, or both. The process for restoring these may be manual, automated, or both. Whenever backups are used to restore systems to an operational state though, this is typically referred to as a “hot-cold” configuration. In any case, the process of recovering a hot-cold configuration is significantly longer than hot-warm or hot-hot.

One of the biggest factors that prevents organizations from implementing high availability and short RTO’s and RPO’s is cost. Where HA is concerned, more redundancy requires more resources which translates into higher costs. Similarly, short RTO’s and RPO’s require that capacity be available to handle a failover, which also translates into higher costs. There is always a balancing act between costs and system downtime, and sometime the costs of HA, short RTO’s, and short RPO’s is not worth it for some apps, while for others it is necessary no matter what the costs may be.

Fundamentally, High Availability and Disaster Recovery are aimed at the same problem: keeping systems up and running in an operational state, with the main difference being that HA is intended to handle problems while a system is running while DR is intended to handle problems after a system fails. Regardless though of how highly available a system is, any production application no matter how trivial minimally needs to have some sort of disaster recovery plan in place.