The ability to create and manage snapshots is an essential feature expected from enterprise-grade storage systems. This capability is increasingly seen as critical with big data systems as well. Snapshot means capturing the state of the storage system at an exact point in time and is used to provide full recovery of data when lost.
Snapshots-within the ApacheTM Hadoop® context-are useful in both storage and compute. Some of the prominent uses of MapR Snapshots include:
- Rollback from Errors
Operational as well as analytical applications manipulate the data in the distributed file system on behalf of the user or the administrator. Application-level errors or even inadvertent user errors can mistakenly delete data or modify data in an unexpected way. In this case, snapshots can be used to recover to a known, well-defined state.
- Hot Backups
With snapshots, it is possible to create backups on the fly, often required for auditing or compliance reasons.
- Model Training
Machine-learning frameworks such as Mahout can use snapshots to enable a reproducible and auditable model training process. Snapshots allow the training process to work against a preserved image of the training data from a precise moment in time. In most cases the use of snapshots requires no additional storage space and can be completed in less than a second.
- Managing Real-time Data Analysis
By using snapshots, query engines like Apache Drill can produce precise synchronic summaries of data sources subject to constant updates such as sensor data or social media streams. Using a snapshot for such analyses allows very precise comparisons to be done across multiple ever-changing data sources without having to stop real-time data ingestion.
In MapR, a snapshot is a read-only image of a volume at a specific point in time. You can create a snapshot manually or automate the process with a schedule. A snapshot takes almost no time to create, and initially uses no disk space, because it stores only the incremental changes needed to roll the volume back to the state at the time the snapshot was created.
MapR Snapshots are implemented using a redirect-on-write method, which provides protection without duplicating the data. In other words, you can take a snapshot of a 1 PB cluster in seconds with no additional data storage.
Another important fact is that in MapR, the snapshots are implemented directly in the storage layer in an efficient and fast way. Any application that saves data in MapR, benefits from snapshots out of the box. Moreover, since MapR Snapshots are atomic and consistent, applications have exactly the view of the data at the time that the snapshot was taken. This is not true on other distributions for Hadoop, as will be explained in the next section.
Snapshots are intended to provide point-in-time recovery, that is, to provide the ability to recover the data to a precise and consistent state in the past. MapR Snapshots do just that. HDFS and HBase snapshots, in contrast, do not provide consistency, and lack many other important capabilities.
- HDFS Snapshots
HDFS is an append-only file system, so intuitively it appears easy to implement snapshots. However, the separation of data and metadata in HDFS, combined with the NameNode being a bottleneck in the system, makes it difficult or impossible to implement consistent snapshots. As a result, HDFS snapshots took years to implement and are not consistent, and hence do not work with applications that were not specifically designed to support HDFS snapshots and their limitations.
Applications must be made snapshot-aware by calling a new HDFS API that sends up-to-date file length information to the NameNode (SuperSync/SuperFlush). It is hard to design such an application to work correctly, since the use of SuperSync across a cluster can overwhelm the NameNode, causing the entire cluster to fail or causing other processes to come to a halt. Moreover, applications making use of snapshots must avoid modifying files during the creation of the snapshot.
HDFS snapshots apply only the metadata (on the NameNode), so they do not work correctly while files are being written. This happens because the NameNode cannot handle even 1000 metadata updates per second without crashing. To avoid this, HDFS avoids sending the file length to the NameNode on every hsync/hflush because that would kill the NameNode.
The effect of this implementation is that files that are being written continue to change inside the snapshot, meaning the snapshot is not actually a snapshot and can contain data that was written after the snapshot was taken, or can fail to contain data that was written and flushed before the snapshot was taken.
- HBase Snapshots
Because of the storage semantics, HBase snapshots cannot rely on the underlying HDFS snapshots and need to be built separately. This is unlike MapR where there is a common snapshot capability that applies to all data in the cluster. However, HBase snapshots are also not consistent, as each RegionServer snapshots its own data at different times. HBase snapshots exhibit the same problems as HDFS snapshots in terms of containing transactions committed after the snapshot and missing transactions committed before the snapshot.
Enterprise data storage solutions have offered the consistent snapshot capability for years, and only MapR offers the same for Hadoop. MapR provides snapshots as first-class citizens of the distributed file system, enabling any kind of application to benefit from it, out of the box and with no specially application modifications. Last but not least, MapR Snapshots are battle-tested, having been used by customers in production throughout different verticals since 2011.