If you’ve had a chance to work with Hadoop or Spark a little, you probably already know that HDFS doesn't support full random read-writes or many other capabilities typically required in a production-ready file system. Attempting to use HDFS as an enterprise filestore for landing and moving data through transformation pipelines means trying to get various hacks to work or employing additional nodes acting as gateways, getting you only halfway there.
This post *isn’t* about how MapR solves the problem and has a much easier way of landing data than our competitors (even though we do). Instead, I’ll show you a concrete example that both illustrates the benefits of using MapR-FS in a way that you might be able to relate to your own workflows, and also serves as a quick "how-to" for another way to leverage all that big storage in a MapR cluster if you happen to have one running or are considering it.
Using a MapR Cluster as a Datastore for VMware ESXi
Gartner estimates that around 75% of server workloads are now virtualized. If you have VMware in your datacenter, you might be aware that it supports the use of datastores over NFS (as well as iSCSI, but let’s leave that for another article). It's often a multi-faceted, complex effort to support this in a big environment, having to plan ahead for things like backups, having enough space and cycles for snapshots, and handling both the known and unknown scale-out behaviors.
Virtual machine files and their associated disks (one or more .vmdk files) also tend to be large and consume a lot of space. VM "sprawl" and the sheer space required to host large environments can compound the problem.
What if you could put all that scalable, fast, transparently compressed, replicated space in a MapR cluster to work by using it as a datastore in ESXi? Of course you can! You can even configure volumes and policies that make it pretty easy to dedicate a portion or all of the storage to your virtualized store.
One last thing before we jump in: this particular example involves VMware, but if you are a Docker user, check out the recent announcements around our platform’s unique ability to provide a data services layer for Docker containers.
Let's start hosting some virtual machines.
By the way, don't try this with just any Hadoop cluster, as you'll quickly find it doesn't work!
In this example we'll use the following components:
- A 10-node Hadoop cluster running MapR and YARN. The cluster has NFS enabled.
- An ESXi server running ESXi-6.0.0-2494585-standard
- The individual node hardware consists of HP ProLiant DL380G6 servers with 12 cores per node (Xeon X5670), as well as 6 7200RPM disks and 128G RAM per node.
This is a snapshot of the current parameters we've used in our lab, and this will most likely work with any version of ESXi that supports NFS and any recent (4.x or later) MapR version. The same goes for the hardware.
Video Demo and Tutorial
Watch the below video for a live example of how this works, or follow the steps in the next section for the quick version.
1) In the MapR MCS console, configure one or more volumes (by selecting Volumes on the left side, then New Volume) to hold the VM data.
2) Ensure that NFS services are configured on at least one node. Consult the documentation steps here.
3) In the VMware VSphere Client (or web interface), select the ESXi server where you want to mount the datastore.
4) Select the ‘Configuration’ tab, then ‘Add Storage’ on the far right side.
5) Select ‘NFS’ then enter the MapR server name or address running NFS services. In the below example the server is 10.200.1.101. For ‘Folder’ select the mount point of the volume you just created.
You’re done! The datastore will now appear in the main list of datastores.
You can now use the distributed storage in the MapR cluster as you would any other NFS-mounted datastore (for both read-only and random read-write applications).
This example highlights the benefits of having an underlying random read/write capable filesystem for your data platform -- you can leverage it in ways that are highly compatible with your existing environment and applications. Having MapR-FS at the center of the architecture allows you to harness distributed storage capacity and enterprise filesystem capabilities as part of your virtualized infrastructure or enable additional services on top of your cluster.
With a few simple steps you immediately get access to a huge amount of ready-to-use space that can be efficiently snapshotted, compressed and even mirrored to other locations as part of a DR strategy. In this article, I didn't touch upon other uses of the data (in an analytical pipeline, for example) but many are possible. If you have use cases like this, leave a comment for us below.
Give this a run in your own VMware environment with these next steps:
- Fire up the quick installer for a free, unlimited production use version of the MapR Converged Data Platform. Take a look here to compare editions of the platform.
- For a single-node example, you can download the sandbox (for either VMware or VirtualBox) for a single-node Hadoop cluster in just a few minutes.