Cloud Disaster Recovery: Five Steps to Avoid Risk and Protect Your Data

Don't assume your infrastructure is automatically disaster proof and redundant because it's in the cloud.
Because many hosting providers maintain multiple data centers, IT managers often assume that disaster recovery (DR) is inherent in the architecture or that DR is not an issue. The recent five-day service outage at that knocked thousands of Web sites off-line was a wake-up call for IT managers to perform the same DR due-diligence as they would with their in-house infrastructure.

1. Assessing the Risks

Cloud data centers can fall victim to a range of disasters. Some are natural, such as floods, tornados or earthquakes. Others are man-made, such as terror attacks. A data center can fail for technical reasons, or even because the data center provider goes out of business. Some disasters strike a site, others an area or an entire region. IT managers must not only consider the full range of possible risks to a cloud provider, but how they would complete a recovery in every circumstance.

2. Determining Requirements

IT organizations must classify their recovery requirements in the context of the Recovery Point Objective (RPO), the amount of time for which data loss can be tolerated, and Recovery Time Objective (RTO), the maximum tolerable time for recovering the data and bringing the application back online.

RPO and RTO requirements are driven by the cost of downtime. This can include loss of revenue, employee productivity, customer goodwill and/or reputation. Tangible financial losses are the easiest to consider and most directly correlate to the cost of mitigation. Loss of customer goodwill and reputation are less tangible, but just as important.

3. Understanding DR Options

A DR site less than 10 miles from the primary data center is insufficient to guard against area disasters and pandemics, because it is possible that neither the data center nor the DR site would be accessible. A common guideline calls for 90 miles of separation.

Asynchronous site-to-site data replication moves data offsite not to tape but to disk drives, which can significantly reduce recovery time. Most backup/recovery applications can back up to disk using compression and/or de-duplication technology to reduce the size of the backup image. With change-only backup methods, transmission bandwidth and data storage requirements are minimized. Virtual tape libraries (VTLs) are specialized storage devices that can further automate the DR process.

Synchronous data replication ensures that all data entered or changed is simultaneously replicated. This is typically the most expensive off-site replication option, but may be justified for some critical applications. Synchronization delivers an immediate RPO and an RTO limited to the time that it takes to declare a disaster, restart the application from the second site and re-establish communication.

4. Auditing Cloud Providers

IT organizations should also understand the data protection solutions offered by the provider. Most offer daily backup to disk, with some also offering periodic tape backup. However, these backups are usually on-site; off-site tape transfer is rarely included in the base service. While on-site backups can help recovery from data corruption and inadvertent data deletion, and allow point-in-time restores, they provide little protection from disasters.

IT managers should compare service providers' DR documentation to their own requirements, just as they would for their own data center. Elements to audit include the location of the data center, possible events that could compromise it, the availability of power and communications, the data center's relationship to recovery destinations, data center hardening features and the vendor's DR contingencies. Make sure your cloud provider has a process for simulating and testing your DR solution and ensuring it performs as promised.

5. Implementing and Managing

IT environments evolve as applications are added, terms of service change and cloud vendors get acquired. So it's important to continually test your cloud disaster recovery solution. A rolling quarterly DR test on a subset of applications may be sufficient, as long as most or all of your systems are also tested annually. Plan a comprehensive annual audit of your cloud DR solution to assure it meets your evolving needs.

IT infrastructure can and does fail, regardless of whether it's in-house or in the cloud. The onus is on IT managers to have the right disaster recovery solution in place to avoid the serious damage an outage can cause their business.

Quick links