Disaster recovery - freeCodeCamp.org

How to Use Playbooks to Execute an Incident Recovery Plan

David Clinton — Wed, 31 May 2023 01:28:09 +0000

A playbook is the official, formal written record that describes policies and processes that will reliably produce a working deployment of an organization's resource stack. When it comes to generating predictable results, the playbook is the plan.

I'll describe all the key elements of a good playbook in just a moment. But it's important to emphasize that a playbook on its own is more or less useless unless your team is able to read it and convert it into real-world results.

To do that you'll need to make sure every relevant member of your team is completely familiar with their roles and how they'll be expected to carry them out. That'll require you to distribute copies of the plan and ensure that everyone gets the training they'll need to perform perfectly when the time arrives.

This article comes from my Introduction to Cybersecurity course. If you'd like, you can follow the video version here:

How to Define Playbook Scope

At any rate, a good plan begins with clear definitions:

Where can you find up-to-date and clean copies of the source code?
Where should your production environment be hosted? In a public cloud like AWS? On-premises?
What is the infrastructure supposed to accomplish?
What's the scope of your operation: what scale of hardware resources will it require?

A playbook should also clearly define the policies that must be followed through the rebuilding process:

How is organizational data to be protected?
What decisions must be made only by senior company officers?
Are there restrictions on what software and third party solutions can be used...or from which countries they can be acquired?
Are there stack components that must remain local, or can everything live in the cloud?

How to Define the Tools

Perhaps the core of any playbook is the section addressing the software and deployment tools and procedures that you'll use at every stage of your workflow.

This section should include the complete code for the scripts handling moving resources from code to deployment, along with links to all the software code in use, and instructions for authenticating to the services you'll be using.

How to Define the Participants

IT deployments are performed by people. But which people?

Who do you speak to who has access to a credit card so you can purchase needed resources?
Who has access to the key codebases and online accounts you'll need?
Who's responsible for testing and signing off before code is pushed to production?
What if that person isn't available?

Each and every role relating to the project you're documenting needs to be defined, and the person responsible must be identified – along with current contact information.

Beyond operational contacts, the playbook should also include a complete company communications directory. If you're paying someone a paycheck each month, the odds are they'll be expected to perform some important function during a recovery. So you'll want a reliable source for contact information – preferably containing multiple contact endpoints for each person.

How to Document Your Recovery

Recovery operations can be chaotic. But it's nevertheless critically important that log records for every step – pre-, post-, and during recovery – should be kept. So log generation and storage should also be part of your playbook.

Even if you don't have the time to read them right now, they'll be invaluable later as you try to review events and figure out exactly what happened. The existence of accurate and reliable logs and other records might actually be legally mandated.

Any code review and application testing you would normally incorporate in your deployment lifecycles should be included in your recovery playbook. After all, bugs and failures aren't going to be any more fun after a crisis than they were before it. The actual code for all the scripts that would normally power your testing should be included here, too.

How to Keep Your Playbook Current

Finally, your playbook should be regularly updated to reflect changes to your application and its supporting environment. Naturally, you want to keep all details up-to-date, including changes to the personnel responsible for specific roles, along with their correct contact information.

A complete playbook created for a relatively complex operation can easily run into the hundreds of pages. When you add the task of coordinating the actions of all the many individuals who will be involved in your recovery, the whole thing might feel a bit hard to manage. Unfortunately, you just have to do this: there's really no alternative.

How to Automate Your Recovery

Well, there's almost no alternative. Remember how I told you that you should include complete operations scripts and links to your code base in the playbook? Do you think our playbook could be convinced to play itself? Why not?

Think about it. Orchestration tools like Ansible or Terraform – or cloud-specific tools like Amazon's CloudFormation – allow you to very closely define every layer of your infrastructure in a format that can be invoked and launched with a single command.

In theory at least, there's no reason why you couldn't build your playbook as an actual script, complete with commands to pull software repos, launch complex virtual networks and compute instances, and route DNS domains. That would be a fantastic example of the power of infrastructure as code.

How to Test Your Playbook

While we're still on the topic of plans and playbooks, I should add one more very important note. If you're going to go to all the trouble of researching and then writing a playbook, you don't want to discover in the middle of a crisis that your plans don't actually work.

The safe assumption is that nothing in technology will work unless it's been carefully and repeatedly tested in advance. That's true of recovery playbooks, and it's just as true of backups: until you've successfully restored a backup archive into a real environment, you should assume it'll fail.

With what you've now seen about the scope, tools, documentation, updates and automation for playbooks, you're now all set to get to work creating your own. Well don't let me get in your way!

This article comes from my Introduction to Cybersecurity course. And there's much more technology goodness available at bootstrap-it.com

High Availability vs Fault Tolerance vs Disaster Recovery – Explained with an Analogy

Daniel Adetunji — Mon, 07 Nov 2022 17:10:24 +0000

High availability, fault tolerance and disaster recovery are important things to consider when designing a system.

These terms are sometimes used interchangeably by architects and developers. They are not, however, the same thing – and understanding the differences can save you many headaches, as well as time and money.

This article will go through the differences between the three terms and explain how you can implement them in AWS.

Highly Available vs Fault Tolerant vs Disaster Recovery

A highly available system is one that aims to be online as often as possible. While downtime can still occur in a highly available system, the aim of high availability is to limit the duration of the downtime, not to completely eliminate it.

A fault tolerant system is one that can operate through a fault without any downtime. Fault tolerance aims to avoid downtime completely.

In a complete system failure however, high availability and fault tolerance are not enough. Disaster recovery describes how the system can continue to operate when the cushion of high availability and fault tolerance disappears in a system wide failure.

What Does High Availability Mean?

First, let's describe what high availability is not. High availability does not mean that the system never fails or never experiences downtime. A highly available system is simply one that aims to be online as often as possible.

Imagine we have a pizza restaurant that is open 24 hours every day for 365 days. If that restaurant only has one chef, then its availability – that is, its ability to process orders – will not be 100%. This is because a single chef can only work for about eight hours a day with a one hour break – or effectively seven hours a day, for seven days in a week.

The chef can therefore only work for 49 hours in a week out of a possible 168 hours. This restaurant has an availability of 29%.

A low availability restaurant

This is of course not a high enough availability for a restaurant that wants to be open for 24 hours in a day throughout the year.

So how do we get a higher availability for the restaurant? Hire more chefs. If we have four chefs working six hour shifts in a day for seven days in a week, this gives us a theoretical availability of 100%.

A higher availability restaurant

This 100% availability is only theoretical because it assumes no chef misses work in an entire year. This is a poor assumption as chefs can get sick, their cars can break down on the way to work, or they may have to leave work early to pick up their kids.

Let's say all this chef downtime adds up to five hours in a year. This gives you an availability of 99.94%.

How can you make the restaurant even more available? Hire standby chefs that are ready to come to the restaurant at a moment's notice. But this comes at a steep price since you have to pay these chefs to wait until they are needed.

What these standby chefs give you is the ability to quickly recover from not having enough chefs to meet customer orders. You can never have 100% availability because of the constraints of reality. You can only approach an availability of 100% at an increasingly steep price.

What is Availability in a System?

Availability is the probability that a system will be able to respond to a request.

Note that high availability has nothing to say about the quality of the pizzas or how quickly they are delivered. High availability is simply concerned with the ability of the restaurant to respond to pizza orders from customers.

The major cloud providers typically have SLAs that describe the availability of a system.

Take a blob storage system, for example. AWS S3 standard has an availability SLA of 99.99%. This is the same figure for Azure blob storage and Google cloud storage.

What exactly does 99.99% availability mean? It means that in any year, there is a 99.99% probability that the system will be online. An uptime of 99.99% equals a downtime of 0.01%. This is equivalent to a downtime of approximately 53 minutes - just under an hour for an entire year.

How about an availability of 99.9%? Such a system would have a downtime of 0.1% which is 8.8 hours in a year.

While 99.9% availability may seem high, for a bank processing payments, air traffic control system, or any other critical system, such amount of downtime may simply be unacceptable.

What is the right amount of availability you should target? That depends on the requirements of the system you are building.

You are of course constrained by the availability SLAs of the cloud providers, so there is limited flexibility in achieving say 99.999% availability for a blob storage system, for example. And, the higher the availability you want to achieve, the more expensive and complex the solution becomes.

What Does Fault Tolerance Mean?

If a failure within a system occurs, can the system continue to operate without any disruption? If it can, then the system is fault tolerant.

So what is the difference between high availability and fault tolerance? With a highly available system, failures that cause downtime will occur, but rarely. The system is also able to recover from such failures. But when the system is down, it cannot respond to requests.

In a fault tolerant system, the system can continue to operate in spite of a failure.

Let's use the pizza restaurant as an example again. If the restaurant experiences a power outage, then no amount of chefs in the kitchen or chefs on standby will help with making pizzas for customers since the ovens need a power supply.

A backup generator that kicks in immediately when a power loss is experienced makes the restaurant fault tolerant.

Another good example of this is a commercial aircraft powered by jet engines. These aircraft are built to be fault tolerant so in the event that one engine fails, the aircraft can continue to fly and land without disruption or having to fix the failed engine in flight.

Helicopters or single engine aircraft, on the other hand, are not fault tolerant. A failure of the engine means the aircraft cannot fly. Such failures are usually catastrophic and partly explain the higher rate of helicopter and single engine aircraft crashes compared to dual engine aircraft.

An aircraft with two engines is fault tolerant

Helicopters and single engine aircraft are not fault tolerant

What Does Disaster Recovery Mean?

If the scale of the system failure is so large that the high availability and fault tolerance of the system are effectively neutralised, can the system continue to operate?

Let's go back to the restaurant example. If a fire, flood, or any other disaster befalls your pizza restaurant, how can you continue to make pizzas for your customers?

This is a somewhat facetious example since in the event of a fire, worrying about customer orders is not the main priority – but the logic of the example still holds.

In this instance, high availability is of no help. Having an infinite number of chefs in the kitchen or on standby in a restaurant engulfed in flames = no pizzas for customers.

Fault tolerance is also of no help. A backup generator is useless for the appliances it is meant to power if they have been destroyed.

The only way the system (restaurant) can continue to operate is by routing orders to another nearby restaurant unaffected by the fire. Disaster recovery is a proactive plan of action that details how to recover after a disaster has happened.

Bringing it All Together

Now, let's look at a single architecture that is simultaneously highly available, fault tolerant, and has built-in disaster recovery.

All in one - high availability, fault tolerance, and disaster recovery in a single architecture

The architecture above shows a multi-availability zone (AZ) Relational Database Service (Amazon RDS) deployment. It shows an RDS database with a standby instance in a separate AZ, a single read-only replica, and an S3 bucket used to store backups of the database on a daily basis.

This RDS is a fully managed DB as a service offering from AWS where AWS manages the underlying hardware, software, and application of the DB. You can find more information here on AWS RDS and availability zones.

Now let's dissect how this system would work and how the design ensures it is highly available, fault tolerant and can recover from a disaster.

How High Availability is Achieved

The primary RDS instance in AZ A synchronously replicates its data to the standby instance in AZ B.

With synchronous replication, the primary instance waits until the standby has received the latest write operation before the transaction is recorded as successful. This ensures that both databases have identical information – that is, they are consistent, admittedly at the expense of increased transaction latency.

The primary and standby instances are in an active-passive configuration. Only the primary receives read and write request. The job of the standby is to simply take over as the primary in the event of a failure of the primary instance.

The time it takes to failover from the primary to the standby instance is called the Recovery Time Objective (RTO). The RTO simply describes how long it takes to recover from a failure. In this case, the failover time for RDS in a multi-AZ configuration is currently between 1-2 minutes.

The standby instance has one purpose: to increase the availability of the system. If the primary instance fails, or if the entire AZ A goes down, the standby instance in a separate AZ will be promoted to the primary. This failover process takes 1-2 minutes. That is 1-2 minutes of downtime.

Recall that high availability is not about preventing downtime, but simply reducing it. Without a standby instance, there is a high probability that downtime will exceed the 1-2 mins it takes to recover with a standby instance.

Note that the standby instance does not help with fault tolerance, since the failure of the primary will still lead to downtime.

How Fault Tolerance is Achieved

To eliminate downtime, you need a configuration that involves no failover. This is a job for read-only replicas. These are asynchronously replicated copies of the primary instance. Writes are only made to the primary instance. Read replicas are, as the name implies, read only.

Such an approach is ideal for read heavy application since read replicas can remove the additional burden of read requests from the primary instance.

In asynchronous replication, writes to a primary instance do not wait for a response from the read-only replica before the transaction is recorded as a success. This means that, for a time, data across the primary and read replica may not be identical (but rather, inconsistent) after a write to the primary.

This eventual consistency (a topic for another article) is a drawback of asynchronous replication. The benefit of asynchronous replication is that it does not wait for the read replica to respond before the transaction is recorded as a success.

This is important because if the read replica is down or there is a network failure, the primary can still accept subsequent writes without waiting for a response from the read replica, confirming that the previous write was successfully replicated.

The architecture above has two replicas: one synchronous and the other asynchronous. If all replicas are synchronous, then a failure in the standby replica or the read only replica, or even a network failure, brings the entire cluster down. This is a fragile design that exposes the entire system to failure if a single component fails. Having some replicas that are synchronous and others that are asynchronous improves the fault tolerance of the system.

Where else does fault tolerance come in? Like an aircraft with two jet engines that provide thrust, a read replica and a primary can work together simultaneously. The the primary instance processes writes and the read replica responds to read requests.

Failure of the primary instance has no effect on the read replica's ability to respond to read requests. There is no downtime for reads since only the read replica responds to reads.

How about writes? The read replica can be promoted to a primary, although with RDS, this is currently a manual process.

How Disaster Recovery is Achieved

With the architecture above, you can handle disaster recovery in two ways. There is no constraint to limit disaster recovery to only one approach, so you can use both at the same time. And in fact, the more approaches you have, the better, since this provides extra redundancy.

Ultimately, you should weigh all this against cost, as implementing disaster recovery strategies can be expensive.

The first method is through automatic backups. Backups are taken from the standby instance, preventing performance degradation of the primary instance that has to serve writes (and reads if not configured with a read replica). Since there is synchronous replication between the primary and the standby, we have a guarantee that the standby is an up to date copy of the primary, so it's ideal to take backups from.

With RDS, backups are taken on a fixed schedule once a day (specified by you) and stored in an S3 bucket. Since this is an entirely separate component, any RDS-related system-wide failures will not affect the durability of the backups.

With backups, a loss of the primary, standby, and read-replicas does not equal a permanent loss of data. Backups can then be used to restore the database to a new DB instance.

The second method is to promote the read-only replica to a standalone instance if the primary instance fails. The read replica can be configured in another AWS region. This way, if there is a disaster on a regional scale where multiple AZs are down, a cross regional read replica will ensure that another instance is available in a different AWS region to serve read and write requests.

This is analogous to diverting orders to another restaurant in the event of a fire.

How different components improve the availability, fault tolerance and disaster recovery of a solution

Wrapping Up

Availability is measured in percentages - the larger the number, the more available the system is (hence less downtime).

Very few systems aim for 100% availability – although pacemakers are a notable exception. An availability of 99.999% has a downtime of 0.001% = 5 minutes of downtime in a year. This tends to be the upper limit for most software systems.

Aiming for higher levels of availability above this is increasingly complicated, expensive, and often unnecessary. This is especially true when you consider that the software system you are building relies on infrastructure like the power grid and internet service providers, which may have lower availability levels.

Fault tolerance, on other hand, cannot be measured. Your design is either fault tolerant or it is not. Similarly, disaster recovery cannot be measured. You either have a plan of action that precisely outlines how your system can recover from a disaster or you do not.

Knowing the difference between high availability, fault tolerance, and disaster recovery is important. It ensures you are building the correct architecture based on customer needs.

Over-engineering a solution by providing disaster recovery when all that is required is high availability or fault tolerance is often an expensive and complex exercise.

On the other hand, under-engineering a solution by only providing high availability when fault tolerance is required can lead to severe consequences for some critical systems that cannot afford any downtime.

How to Create a Disaster Recovery Plan for your IT Team

David Clinton — Thu, 14 May 2020 13:00:00 +0000

You know the old joke: there are two kinds of companies, those that've been hit with IT disaster, and those who don't yet realize they've been hit with IT disaster.

But what they all have in common is that there are plenty more disasters to come. So ask yourself whether you're ready for the next one.

This article, which is based on my Pluralsight course, Linux System Maintenance and Troubleshooting, is intended to start you thinking about what building an effective protocol will take.

What you need to have in place

It all begins with the business continuity plan (BCP). This is a formal plan that's meant to define the procedures an organization would use to ensure survival in the event of an emergency.

BCPs will generally include sub-plans to secure the immediate safety of employees and customers, work to restore previously-designated critical operations as soon as possible and, eventually, to restore full normal operations.

In addition, an effective BCP will also include two sub-plans that are specific to IT operations: the incident management protocol and disaster recovery plan.

The disaster recovery plan (DRP) aims to protect an organization's IT infrastructure in the event of a disaster. Its primary goals are to minimize damage and to restore functionality as quickly as possible.

The reason we call this a "plan" is because it simply won't work without serious prior preparation. Infrastructure protection, threat detection, and corrective protocols are critical parts of the plan.

An Incident Management Plan (IMP) is meant to address the specific threat of cyber attacks against IT infrastructure. Its goals are to minimize damage and remove the threat.

As you can easily tell, there will be some overlap between your DRP and IMP. But the key focus of disaster recovery is to get your infrastructure back on its feet, while incident management is much more closely aligned with the world of IT security.

For the rest of this short article we're going to look at what goes into creating incident management and disaster recovery plans and how to ensure that your plan is sound and should, when executed, actually work.

Developing an Incident Management Protocol

Since incident management is going to be your first response to trouble, we'll begin there.

The first indication that there's trouble can come from a user who notices that something's not right with the system. Or, if you've done a particularly good job configuring your infrastructure, it could also come to you in the form of an automated alert triggered by monitoring software.

When that alert comes in, it'll be the job of the technician or admin on call to decide how it's going to be handled and who has to handle it.

Escalation can happen through a direct phone call or email, a ticket submitted through a collaboration tool like Jira, or by using a purpose-built Security Information and Event Management (SIEM) tool.

Again, though, the more smart automation you build into the process, the faster and more efficient it's likely to be.

Whoever ends up with the ultimate responsibility will coordinate efforts to definitively diagnose and resolve the problem. Ideally, where necessary, such coordination will include admins, developers, and other key stakeholders to ensure you've got all the resources you'll need to address the problem.

When it's all over, once you've confirmed the problem is resolved, you'll want to close the incident by assessing what went wrong and what went right, how your response could have been better, and how you can rework things to reduce the risk of a repeat of the incident.

But what does all this have to do with IT administration? Well, responsible IT managers must be able to build resiliency into their infrastructure.

That will mean spending serious time fine-tuning their software monitoring systems so they'll catch and alert you to real problems while issuing alerts for as few false positives as possible.

And it'll probably also involve intelligently automating logging and intrusion detection systems and generally getting a good idea of how things are supposed to look.

Developing a Disaster Recovery Plan

Disaster recovery planning requires you to:

Define exactly what recovery means
Identify the resources that achieving recovery will require
Convert those observations into a formal plan format
Communicate the plan to the players who will one day have to carry it out

What does recovery mean? It's when your poor, stricken infrastructure has returned to the shape it was in the moment before disaster hit.

What you'll need to get you back to that point can be defined by establishing a Recovery Time Objective (RTO) and Recovery Point Objective (RPO) that fits your organization's needs.

A Recovery Time Objective represents the maximum number of minutes, hours, or days that your organization could survive an IT service outage. So your recovery plan will need to incorporate that hard deadline into its protocols.

Of course that means you'll need to have team members available to make it into the office even in the small hours of the night quickly enough to make a difference.

But it also means, say, that if your RTO is six hours, but restoring critical data from your backups would take a minimum of eight hours just to handle the transfer, then you'll have to rethink those numbers before signing off on the plan.

A Recovery Point Objective is the amount of transaction data your organization could afford to lose during an outage and survive.

To illustrate, an e-commerce website that normally processes 25 transactions each minute could, perhaps, afford to issue apologies and refunds to 30 minutes worth of angry customers wondering why their credit cards were billed but their electric train sets weren't delivered. Refunding more than 30 minutes worth, however, could deplete your financial reserves to the point that you're no longer viable.

In any case, calculating accurate and reliable RTOs and RPOs is how you set the limits within which your recovery plan will have to operate. Or, in other words, you'll have defined what recovery means.

Now what about resources? By which I mean the data backups and, when necessary, the physical equipment you'll need to get your application back on its feet.

To make that work you'll have to decide on an infrastructure backup system. Whether you choose to go with incremental or differential, on-site or off-site, and single or multiple media types, you'll have to map out exactly how the recovery will go and whether or not it'll meet your RTO and RPO limits.

Of course there's no end of really bad things that can happen to make those plans utterly useless. What if your local server facility just burns down? What if it's lost to some kind of political upheaval or widespread power disruption?

Even if you've conscientiously maintained up-to-date data backups off-site, what good will they do you if your hardware effectively no longer exists?

Thinking about all those horrors can make preparing a cloud-based backup protocol using platforms like AWS and Azure sound mighty attractive. The big public clouds have the resources to distribute their infrastructure widely enough that it's virtually impossible for the whole thing to ever go down.

So you could, for instance, maintain a reliably replicated data store on a public cloud platform that mirrors your main deployment. You could also design an infrastructure template that could be loaded up with your backup data and then launched on demand to take over in the event of an outage. Because nothing is kept running until it's actually needed, it can take a good few minutes to bring this one up to speed.

A warm standby recovery design might maintain your data running 24/7 on a minimal number of virtual servers. In an emergency, you can hit the switch and the platform's auto scaling will fire up all the instances you'll need.

You could set the scaling to kick in when triggered by an alert from your primary system. The public cloud presents endless possibilities, but they all require planning and preparation.

A solid disaster recovery plan must be effectively communicated long before crunch time. Practically speaking, that means it'll all be written up, printed, and distributed to each of the key players who will carry out the plan.

That's not to say it ends there: those players will of course have actually read the thing and, ideally, engage in realistic simulations until they're confident they can make it work under pressure.

What goes in this book?

An enumeration of all the stuff that could go wrong and bring down your system
An inventory of exactly what you've got running in your server room and what would be needed to replace it
The information you'll need to access and restore backed up data
An up-to-date contact list of the people who will be responsible for every aspect of the plan
The exact sequence of the tasks and events that will make up the recovery

That's a lot of detail. But it's barely a drop in the bucket when compared with the total amount of preparation and plain old hard work that goes into creating a real-world recovery plan.

But for now, the key takeaway from this module is simply to keep all this in mind. Why? Because the next time you sit down to configure a monitoring package or administration framework, you'll think about incident management protocols and disaster recovery plans and wonder how you should include them in your configuration.

There's much more administration goodness in the form of books, courses, and articles available at my bootstrap-it.com.