It’s tempting to believe that your cloud resources are protected from outages and disasters simply because they’re not on-premises, but the truth is that a cloud service interruption could happen at any time. Whether the outage is caused by a natural disaster, a software bug, or a critical network failure, your business needs to have a plan in place for restoring access to your cloud services. Robust cloud disaster recovery planning will ensure you’re able to minimize the business impact of an outage by restoring access to critical cloud services, applications, and data quickly and securely.
7 Key Cloud Disaster Recovery Protocols
You likely already have a traditional disaster recovery (DR) plan for your on-prem infrastructure, which may involve backing up services to a cloud provider. This is the scenario most people think of when they hear “cloud disaster recovery.” However, as you transition your resources to the cloud, you need a new DR plan that accounts for the unique challenges of a cloud-based infrastructure. These key protocols will help you design and implement a cloud disaster recovery plan that will get your business-critical services back up and running when you need them.
1. Defining Your DR Goals
Before you can develop a cloud disaster recovery plan, you must first define the goals of that plan. There are two metrics you should consider as you analyze your disaster recovery needs:
- Recovery Time Objective (RTO) – This is the maximum acceptable amount of time that your cloud services can be offline. Your RTO may be determined by your service level agreement (SLA) with your clients, or by the specific needs of your business.
- Recovery Point Objective (RPO) – This is the maximum amount of cloud data, measured in time, that your business can acceptably lose due to an outage. For example, data that isn’t modified very frequently will have a higher RPO because you’re less likely to lose any changes during an outage. However, for critical data that is constantly accessed and updated, your RPO will be much lower.
2. Cloud Monitoring
Another key component of your DR plan will be cloud monitoring. You need a high level of visibility on your cloud infrastructure, services, and applications so you can detect issues as soon as possible and take whatever necessary measures to prevent or mitigate an outage. Automated cloud monitoring solutions are absolutely critical for achieving a low RTO and RPO. Depending on your cloud architecture and backup, failover, and recovery plans, you’ll likely need a cloud-agnostic solution so you can manage multiple clouds from one centralized location.
3. Offsite Backups
It should go without saying at this point, but offsite backups are crucial to any disaster recovery plan, whether it’s on-premises or cloud-based. For your cloud services, data, and applications, you have two options for offsite backups: on-prem, or multi-cloud (backing up from one cloud to another). There are advantages and disadvantages to both approaches, so you’ll need to analyze your business needs to determine which one works best for your cloud architecture. You will want to ensure that your application spans regions and that data is replicated across regions as well, while also keeping in mind compliance regulations. For example, under the GDPR, data collected from EU citizens must be stored in the EU. This will impact which regions you can use for DR.
Cloud-agnostic centralized management tools and containerization strategies make it much easier to manage a multi-cloud architecture and disaster recovery plan now than it used to be, but many organizations still prefer to keep their cloud backups in a data center for greater control and security.
4. Software Recovery
You need to ensure that your critical cloud software can be restored in its recovery location and run without errors. This also means that you must patch, update, and deploy your applications in both production and backup environments simultaneously so you can provide a seamless experience for your users if your cloud applications failover. You also need to ensure the platforms that your apps run on are up to date and patched. For example, if your app is running on EC2, it should also be running the latest patched AMI.You should also be doing this with your other cloud services as well—for example, your databases need to be simultaneously updated so, upon a failover event, your users don’t encounter any out-of-date or missing data.
5. Security and Compliance
In addition to making sure your backup applications, services, and data match your production cloud environment—as well as developing a strong automation strategy that ensures all environments are consistent and updated regularly—you also need to match the security and compliance controls. For instance, you’ll need to replicate your user access controls with a solution that allows you to centrally manage user permissions across your production and DR clouds.
In addition, you need to ensure that your cloud backups meet compliance requirements; for example, under the GDPR, data must be encrypted. This means your DR cloud provider should hold any relevant certifications and have policies and procedures in place to maintain adequate cloud data privacy and portability.
6. User Training
Any of your staff that are responsible for cloud disaster recovery tasks need adequate training to ensure they’re prepared to take action when an outage event occurs. Clear communication between all channels is essential, and this training should be reinforced with test scenarios, which will be discussed further in the next section.
Additionally, if your end-users need to change anything about how they use your cloud services during a failover event, they should be trained on how to do so ahead of time. This will not only improve end-user experience, but also will minimize the amount of support your teams will need to provide during a disaster recovery scenario, freeing them up to assist in restoring your cloud production environment if necessary.
7. DR Testing
The final key to a successful cloud disaster recovery plan is testing. You need to QA your disaster recovery plan and test whether the controls, protocols, and backups you’ve implemented will actually work during an outage. Some of the most important things to test for include:
- The replication of user access permissions to your cloud backups, and whether users are able to login and perform their tasks in the DR environment.
- The security controls protecting your DR environment and whether they can pass a penetration test.
- Whether or not you’re able to meet your RTO and RPO, and what may be preventing you from honoring your SLA.
- Your users will have been unable to access your system during the outage. Can your DR site accommodate the increased load when users are able to access your system again?
You can test each of these things individually, but ideally you should conduct a live simulation of a disaster event. This will allow you to see your cloud DR plan in action and ensure that your people know how to enact it, and determine where your weaknesses are so you can shore up your defenses before a real disaster. For more complex cloud environments and disaster recovery plans, you should conduct multiple tests simulating different types of outages and disasters with varying levels of severity, which you can accomplish with the help of chaos testing tools such as Netflix’s Chaos Monkey.
Achieving Your Cloud Disaster Recovery Goals
Cloud disaster recovery follows many of the same principles as traditional DR, but with some added challenges and complexities. Having a robust cloud disaster recovery plan in place following the key protocols listed above will ensure you’re prepared to swiftly take action in an outage and get your cloud services, data, and applications back up and running.