Snapshots come up in most technical discussions of enterprise-grade applications. Hadoop applications are no exception. Jack Norris talks with Industry Analyst Donnie Berkholz about snapshots to help cut to the core of what the issues are in the context of Hadoop on the What Are The Facts video series. Donnie is from RedMonk, a respected analyst firm and actually the first and only analyst firm that's focused on developers.
You can watch the video here:
Following is an excerpt from their conversation.
Jack Norris: Let's start first with what are snapshots?
Donnie Berkholz: Snapshots are a view of data, a window into what the data was at a certain point in time. There are a lot of reasons to use snapshots, from recovery, from user error or data corruption, from back-ups, and even for developers to play around in a sandbox so that you can create a snapshot and start working with it without destroying your main data set.
Jack Norris: I've actually heard them refer to that as data versioning, so they can do different algorithms against a consistent set of data.
Donnie Berkholz: Exactly – there's a lot of value to the idea of applying version control to your data as well as your code, so that the two can stay in synch.
Jack Norris: How exactly do snapshots work?
Donnie Berkholz: The important thing to know about snapshots is that if they're going to work right, you have to really nail everything at a specific point in time. You can't wait for files to open and close and things to change because you might have some inconsistency across different files, across different nodes in the cluster. If you want the snapshot to work the way people are expecting it to work, you have to make sure you get the timestamps right across every file you're snapshotting.
Jack Norris: You mentioned consistency – what happens if the snapshots are inconsistent or are from different times?
Donnie Berkholz: If they're inconsistent, you have no idea what you're looking at any more. You're trying to do some analysis, and you've got data from different points in time. You think you're looking at a certain time when you're trying to do some analytics, for example, but you're not. You're looking at a wide spread of times, and so, any results you get from that might be meaningless.
Jack Norris: What about for certain application behavior?
Donnie Berkholz: An application behavior is very similar where if you're working with an inconsistent data set, applications can be surprised and be unable to deal with that. In some cases, they may even crash.
Jack Norris: With respect to snapshots and Hadoop, let's look at... what are the facts? How do they work?
Donnie Berkholz: Snapshots in Hadoop today, in open-source Hadoop, work by capturing the state of the file when the file is closed. If files are kept open, and a snapshot is taken, the open files may be inconsistent with the closed ones on the same cluster.
Jack Norris: Can you define that in terms of point-in-time consistency?
Donnie Berkholz: In terms of having a snapshot of a certain point in time, what you're getting is a snapshot of the closed files at that point in time, and the open ones may lag behind by minutes or even hours.
Jack Norris: All right. How do snapshots work in other systems?
Donnie Berkholz: The typical approach to snapshots (it's used broadly across back-up software, for example), is taking a shot of the file as it exists on the file system at that moment and making sure that's consistent across the entire file system.
Jack Norris: I guess this would be a good time to point out that's exactly how MapR works. MapR's snapshots work across a cluster, are instantaneous and consistent across all of the files regardless of their open or closed state.
Donnie Berkholz: Sounds like a good approach.
Jack Norris: Thank you for joining us, and the next time someone mentions point-in-time snapshots and Hadoop, make sure you ask: “What are the facts?”
What Are the Facts is a series of short videos where we talk to industry analysts and examine claims and issues surrounding Hadoop.