MapR NFS vs. HDFS NFS – What would you use? – Whiteboard Walkthrough

In this week's Whiteboard Walkthrough, Bruce Penn, Sales Engineer at MapR, explains how NFS access works on MapR-FS vs. HDFS, and helps you decide what you would use in production.

Here's the unedited transcription: 

Hello, my name is Bruce Penn and I'm part of the Bay Area sales team for MapR. Today I'm going to discuss the difference between MapR NFS and HDFS NFS and get into the different architectures and why the MapR architecture is very much superior. It's more robust. It's more efficient and is actually used by many enterprises in pretty much 100% of our customers.

What I have here in red is the MapR architecture and what's in green is the HDFS architecture for NFS. On the left-hand side, you'll see that I have application servers, so that could be a web server or an app server of any kind sending over log files, so there's a lot of log file generation that gets pushed into Hadoop. I have an edge node, which is often very common, where you'll run the NFS gateway as well as have security set up so that you can't log in directly to cluster from a client node – you have to go through your edge node. Then, of course, the cluster nodes and if you notice over here, MapR is just all data nodes. With HDFS, they have a name node. MapR does not; it's not necessary.

Let's get into the value of MapR NFS and again, why it's superior. Let's just go through a use case where I have an application server. I'm going to go ahead and push a 5GB log file into the cluster via the edge node. With MapR, the app server can just mount the edge server and really mount it and just write its data into a path with /mapr and then a cluster name and then a path. It's very easy for these application and web servers to just say, "I'm going to write to a path.” Write that log file in, it actually is a pass-through, basically. The NFS gateway becomes a pass-through. It converts the NFS protocol to the MapR file system protocol. It compresses the data, compresses the file and chunks it up, and sends it over to the data nodes. This 5 gig could ultimately maybe only be 2 gig as it's pushed into the cluster. It's very clean, very scalable, used again by many customers of MapR, almost let's just say 100% of MapR customers are using MapR NFS.

Let's compare that with the HDFS NFS. As you can see, similar NFS client pushing in a 5GB file. You have the HDFS NFS gateway on the edge node. The issue is, NFS requires a read/write file system because when the NFS protocol writes data, it needs to be able to reorder it and needs a file system that can reorder that data, and that can only be done by a read/write file system, which MapR is. HDFS is an append-only file system. As the data comes into the NFS gateway, it has to write that 5GB file down to the local Linux file system that's read/write, so oftentimes that's EXT4 or XFS.

Now I've got this extra hop where I have to write data down. I've got to make sure there's enough space in /tmp, because that's actually where it writes it. You got to make sure you've got plenty of room in your /tmp, so it's taking up extra storage just to use NFS. That data gets written down, gets reordered, and then behind the scenes, Hadoop put is used to push that data into the cluster.

What are the problems with this? The problems with this are scalability. If you have hundreds of users or hundreds of application servers or clients writing data in, writing log files in, it doesn't even have to be hundreds; it can be tens, and you're pushing these log files in, it's constantly writing down to the local file system. You got to make sure there's enough space. It's going to take time. It's not going to be real time or near real time, like the MapR NFS server is. It's just not scalable and it's just not used in production today.

Another point to be made is, the MapR NFS server was part of MapR 1.0. It was thought of, it was designed in from the ground up, again, because MapR is a fully read/write file system. The HDFS NFS gateway was added on many years later. Several years later, it's going to add on to catch up to MapR, but as you can see, the architectures are very different. MapR's architecture for NFS is much more superior and that's why so many MapR customers use it. It greatly improves productivity. It simplifies application development time and pushing data into a MapR cluster, whereas the NFS gateway for HDFS is very rarely used and can only really be used in small environments and not true enterprise environments.

Thank you very much for listening to this Whiteboard Walkthrough.

no

CTA_Inside

Ebook: Getting Started with Apache Spark
Interested in Apache Spark? Experience our interactive ebook with real code, running in real time, to learn more about Spark.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free