Disaster Recovery

Disaster Recovery


Introduction

Disaster recovery (DR) is the science of returning a system to operating status after a site-wide disaster. DR enables business continuity for significant data center failures for which high availability features cannot cover. Computer systems generally support DR in two ways: backups and replication. Backups entail full or partial copies of data from the master cluster that are stored on separate media. Replication, also known as mirroring, continuously copies data from the master cluster to a geographically remote instance of the system (“replicas” or “mirrors”). For production deployments, mirroring is the preferred strategy for DR. With either method, a copy of the data is available to restore and thus recover from the disaster. Backups involve restoring the saved data into an alternate cluster and enabling that cluster as the new master. DR with mirroring entails activating the mirror, which already has the data loaded, as the new master cluster. (Note that “replication” is also used to refer to the copying of data within a cluster in a data center to eliminate single points of failure and enable high availability.)

In a related area, some systems support point-in-time snapshots, also known as checkpoints, to allow rolling data back to a prior state. This feature is generally used to recover from data corruption due to application or user error. For more information, please see the MapR Snapshots tech brief.

DR requires planning to determine two objectives. The recovery point objective (RPO) is a planned estimate on how much data the organization can afford to lose in case of a disaster. In other words, this is a measure of the level of potential data loss. The recovery time objective (RTO) is the amount of time the organization can be on hold while the system is being recovered. This is a measure of potential downtime. These two objectives indicate that DR is a sliding scale, so organizations must plan how much cost and effort should be applied to limit data loss. Lower RPO and RTO values enable greater protection against data loss and downtime, but those will take more resources to implement. Backups tend to be the much cheaper option, but consequently result in both high RPO and RTO. Mirroring is more expensive due to the redundant hardware in the remote mirrors, but enables lower risk of data loss.

Disaster Recovery in the MapR Converged Data Platform

The MapR Converged Data Platform includes backup and mirroring capabilities to protect against data loss after a site-wide disaster. MapR is the only big data platform that provides built-in, enterprise-grade DR for files, databases, and events. MapR was built to address real-world DR scenarios where lost data and downtime result in lost revenue, lost productivity, and/or failed opportunities.

To create backups, administrators first take a snapshot of the MapR cluster at the volume level. The snapshot will include all data in the volume, including files, MapR-DB database tables and documents, and MapR Streams topics. The snapshot completes in a few seconds and represents a consistent view of the data. This means that unlike other big data platforms, the state of the snapshot will always be the same. The snapshot then can be written to another medium as a backup.

In other big data platforms, snapshots might change over time, depending on the state of open files when the snapshot was taken. Also, partially written files won’t be captured when the snapshot is taken, making it difficult to create an accurate backup.

To create remote replicas, MapR provides two features that enable DR for different use cases: Mirroring and table and stream replication.

MapR mirroring is used to create remote mirrors of files. Mirroring supports the following characteristics that are critical for proper DR deployments:

  • Scheduled. Using the browser-based MapR Control System (MCS), administrators can schedule how often mirrors are updated. Higher frequency of updates lead to lower RPO.
  • Incremental. Only deltas are transferred from the master cluster to the replicas. If only an 8Kb lock is updated at the master cluster, then only that block will be transferred in the next mirroring job.
  • Efficient. Transferred data is compressed, and sent asynchronously and in parallel, and does not significantly impact system performance.
  • Consistent. Prior to creating remote mirrors, a snapshot is automatically taken to ensure a remote mirror of a consistent, known state of the master. Checksums are run to ensure integrity.
  • Atomic. Changes on the mirror are made only after all data has been received for a given mirroring operations.
  • Flexible. Multiple mirroring topologies are supported, including cascaded and one-to-many mirroring.
  • Resilient. Should there be a network partition during a mirroring operation, the system periodically retries the connection and resumes once the network is restored.
  • Secure. Configurable over-the-wire encryption prevents network eavesdropping on the mirrored data.

Table and stream replication is the (near) real-time mechanism for replicating the data in MapR-DB database tables and the data in MapR Streams topics. Since database and topic updates tend to occur much more frequently, rapidly, and granularly than file updates, this feature is required to minimize the differential between the master data and the replicas. Table and stream replication has the following advantages:

  • Immediate. Every data base or stream update at the master cluster will be immediately transferred to the remote replica. This enables a very low RPO.
  • Efficient. Transferred data is compressed, and sent asynchronously and in parallel, and does not significantly impact system performance.
  • Multi-master. For global deployments that share common data, multi-master support lets geographically disbursed user groups perform both reads and writes on the data, and all distributed replicas will by synchronized.
  • Resilient. Should there be a network partition during a mirroring operation, the system periodically retries the connection and resumes once the network is restored.
  • Secure.Configurable over-the-wire encryption prevents network eavesdropping on the replicated data.

MapR Disaster Recovery Implementation

Once you’ve determined your DR strategy, and thus your RPO and RTO requirements, you can leverage MapR features to support that strategy. Assuming you have a business-critical environment, this discussion will skip the backup option and instead focus on mirroring and table and stream replication. In most big data deployments, especially on MapR, a combination of files, database tables, and streaming will be used, so using both features will enable a robust DR implementation.

Achieving Low RPO with Scheduled Mirroring

For files in your MapR cluster, use mirroring on a scheduled basis to ensure remote mirrors frequently get the latest updates. The window of potential data loss depends on how frequently your mirroring operations are scheduled.

For an extra level of DR protection, such as to guard against multiple data center failures, use of different mirroring topologies including a cascaded mirror chain will create multiple remote copies. Cascaded mirror chains are also useful for creating more efficient delivery of mirror updates. For example, if your master cluster is in New York, and you want to mirror to Sydney and Singapore, it would make sense to mirror from New York to Sydney, and then have a separate mirror chain from Sydney to Singapore.

Achieving Low RPO with Table Replication

With database tables and streams topics, you automatically get low RPO since table and stream replication continuously transfers all database and topic updates to the remote clusters. This ensures that the master database and replica databases are closely synchronized. The window of potential data loss is never more than a few seconds.

Achieving Low RTO with MapR Promotable Mirrors

MapR remote mirrors are initially read-only to prevent inadvertent writes to the replica that result in inconsistency between master and mirror. But should a disaster occur, the mirror needs to be enabled as the (temporary) master cluster. The Promotable Mirrors feature lets you quickly activate (or “promote”) a mirror into a read/write state, thus enabling it for use as the new master cluster. This means that the bulk of the RTO time will entail redirecting users at the network or application level to the new master cluster

Achieving Low RTO with Table and Stream Replication

Since table and stream replication ensures tight synchronization of the master database tables and MapR Streams topics with the replica tables and topics, and those replica tables and topics are already read/write enabled, no additional effort is required to activate a replica as the master. This means that as above, the bulk of the RTO time will entail redirecting users at the network or application level to the new master cluster.

Conclusion

When running a production deployment for big data, some of the same business continuity practices that you’ve applied in your existing data architecture must be used. Should you face a site-wide disaster, you want to make sure you have a strategy in place to minimize data loss and downtime. With the MapR Converged Data Platform, you get the enterprise-grade disaster recovery capabilities that you would expect from any production-grade software system. MapR lets you define low recovery point objectives and recovery time objectives to meet your business requirements, while also minimizing the administrative overhead to achieve those objectives.



DOWNLOAD PDF