MapR-FS vs. HDFS: The 5-Minute Guide to Understanding Their Differences – Whiteboard Walkthrough

In this week's Whiteboard Walkthrough, Dale Kim, Director of Industry Solutions at MapR, explains the architectural differences between MapR-FS (the MapR File System) and the Hadoop Distributed File System (HDFS).

Here's the unedited transcription: 

Hello, I'm Dale Kim of MapR Technologies, and welcome to my Whiteboard Walkthrough. In this episode, I'd like to compare MapR-FS (the MapR File System) with the Hadoop Distributed File System (HDFS). MapR-FS of course, is a core component of the MapR Converged Data Platform.

First, I'd like to talk about the similarities. Now, both technologies support the HDFS API, so any application that talks to that API will work on either environment. They're both distributed file systems, which means that they'll run on individual nodes but are tied together in a cluster, so you have one big cohesive unit for processing data. In any distributed environment, you need to have replication between the nodes so that you create copies of data. Should one node go down, you still have a copy to ensure business continuity. Finally, you can take advantage of commodity hardware as well as locally attached disks, so you get the benefits of low cost as well as the performance of having the disks attached to the servers.

Let me talk about some of the architectural differences. If you're familiar with the HDFS architecture, you'll know about the NameNode concept, which is a separate server process that handles the locations of files within your clusters. MapR-FS doesn't have such a concept, because all that information is embedded within all the data nodes, so it's distributed across the cluster. The difference there is that with a NameNode, you generally have to have a separate server dedicated to that function. Historically, the NameNode has been a single point of failure, and the community has done a great job in trying to resolve that for the purposes of high availability, but it still requires a bit of configuration to set up. If scalability is an issue for you, where you want to handle millions of files or hundreds of millions of files within the same cluster, the configuration for that becomes even more complex. You end up with a number of servers dedicated to handling file system metadata.

The second architectural difference I want to talk about is the fact that MapR is written in native code and talks to directly to disk. We've created a lot of optimizations in writing to disk that help with performance and scalability. Contrast that with HDFS, which is written in Java. It'll necessarily run in the JVM and then it'll talk to a Linux file system before it talks to disks, so you have a few layers there that will impact performance and scalability.

Now when it comes to read/write capabilities, MapR fully supports the random read/write access that you would expect in any type of file system. If you have a file within your file system, you can access the file at any point and read and write at any point, so you get that full read/write capability. Now because of HDFS's batch roots, it was only really designed to handle an append-only format, where, if you have a file in existence, you can add more data to the end. If you want to make any changes earlier on in the file, you essentially have to rewrite the entire file with the change in mind.

Both technologies support the Network File System Protocol or NFS, which is a great technology for being able to mount these disks onto another server, but they're handled quite differently. MapR-FS has native NFS gateway support, so that you can mount MapR-FS as if it were a SAN. Now it's a bit different with HDFS for two notable reasons. The first is the append-only paradigm that I mentioned. Also, NFS doesn't guarantee the order of network packets being delivered over the network. Because of that, you have to make sure they're in order before you can write to an append-only file system. With the NFS gateway on HDFS, you first have to write to temporary space on the Linux file system, which of course supports fully read/write capabilities. Then once the file is completely written, you can send it back to the gateway, which then writes it into HDFS. So you can imagine there's quite a bit of work required in importing data by NFS with HDFS. Whereas with MapR-FS, the NFS capabilities are fully real-time, so you can run any application that has file system requirements on MapR-FS.

Now for any environment where you might have edge nodes running high performance computing or any other type of requirements that need fast and maybe even encrypted communications with your file systems, we have what's known as the POSIX Client, which is an add-on technology that uses the NFS protocol so that you can have your high performance applications write into MapR-FS. It compresses the data, it encrypts the data, and it talks to the MapR-FS cluster in parallel, so you get a lot of performance and security and you can run any type of mission-critical environment while using MapR-FS as a SAN.

Hopefully that gives you a good sense of the architectural differences between MapR-FS and HDFS. If you have any questions or comments, feel free to comment below. Of course, if you have any other thoughts on topics that you'd like us cover, please comment on that as well. I hope you enjoyed this walkthroughthanks for watching!



Ebook: Getting Started with Apache Spark
Interested in Apache Spark? Experience our interactive ebook with real code, running in real time, to learn more about Spark.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free