How to Use MapR Volumes to Manage Your Big Data

This is the second post in a series about the MapR Command Line Interface. The first post gave you an idea of how to use the command line to review your cluster nodes, what services were running, and how to manage them. This second post will introduce you to the notion of MapR Volumes, how to create and modify them, and how they will help manage your big data.

What Are MapR Volumes?

MapR Volumes are well documented in our online docs and elsewhere, but here’s a brief description of what they are and how they are incredibly helpful.

Volumes are a key management tool that make organizing and managing data in a Hadoop cluster incredibly easy. This tool is unique to the MapR File System, and does not exist in any other Hadoop distribution. Think of a “volume” not as a physical entity, but as a logical entity that allows you to manage data from a system administration perspective. I like to think of it as a directory with superpowers that allow you to manage tons of features in your “directory.” This tool allows you to manage replication factors, ownership, and quotas, and it even gives you the ability to mirror and snapshot your data. On an advanced level, you can use volumes to isolate data to specific nodes which will make it possible to differentiate between “cold” and “hot” data as far as compute and storage resources go.

Let’s take a look at some concrete examples that demonstrate the advantages of this “super directory”:

The first step would be to think about how you want to lay out your data. If you have tens or hundreds of developers working on individual projects and generating their own data, you may want to create volumes for each user in order to manage them. However, if you have one team that works on new projects every week or month, then you can create volumes for each project instead of each user. But what if you’re ingesting terabytes of data on a daily basis, and you only process one day’s worth of data? Then you can make volumes every day to store that day’s data. The idea here is that volumes can be used however you want, but it is important to think about how you want to organize and lay out your data first, because this will pay off big time as you expand your big data platform over the years.

Create and Manage Volumes Using “maprcli volume”

A MapR cluster comes with certain system volumes out of the box, so even if you haven’t explicitly created volumes, there are some pre-existing ones that you can quickly view via the maprcli (or MCS). Using the same trick from Part I of this series, we can see the options of our maprcli command. Right now we just want to see the existing system volume names, so all we need in our query is the “volumename” column:

[mapr@ip-10-0-10-56 ~]$ maprcli volume list -columns volumename
volumename
mapr.cldb.internal
mapr.cluster.root
mapr.configuration
mapr.hbase
mapr.ip-10-0-10-56.us-west-1.compute.internal.local.audit
mapr.ip-10-0-10-56.us-west-1.compute.internal.local.logs
mapr.ip-10-0-10-56.us-west-1.compute.internal.local.mapred
mapr.ip-10-0-10-56.us-west-1.compute.internal.local.metrics
mapr.ip-10-0-10-57.us-west-1.compute.internal.local.audit
mapr.ip-10-0-10-57.us-west-1.compute.internal.local.logs
mapr.ip-10-0-10-57.us-west-1.compute.internal.local.mapred
mapr.ip-10-0-10-57.us-west-1.compute.internal.local.metrics
mapr.ip-10-0-10-58.us-west-1.compute.internal.local.audit
mapr.ip-10-0-10-58.us-west-1.compute.internal.local.logs
mapr.ip-10-0-10-58.us-west-1.compute.internal.local.mapred
mapr.ip-10-0-10-58.us-west-1.compute.internal.local.metrics
mapr.metrics
mapr.opt
mapr.resourcemanager.volume
mapr.tmp
mapr.var
users

Note: There is no need to understand what the exact use of each volume is, but they were all created by the MapR File System for a reason; unless you have an advanced understanding of MapR-FS, do not delete or modify them.

Now that we can see the system volumes, let’s create our own new ones. As mentioned earlier, these volumes can control a lot of different things, which means that there can be a lot of possible configuration parameters. However, to just create a volume, all you need is a volume name. Let’s name it “volumeOne”:

[mapr@ip-10-0-10-56 ~]$ maprcli volume create -name volumeOne

MapR will not show you any output in the terminal if the volume is created successfully, but if there is an error, it will let you know. For example, running the same command a second time will fail, because a volume with that name already exists:

[mapr@ip-10-0-10-56 ~]$ maprcli volume create -name volumeOne
ERROR (17) -  Volume Name volumeOne, Already In Use

Running the first command ( maprcli volume list -columns volumename) will show us all the volumes including the new volumeOne, but where was this volume created? Currently, this volume cannot be found in the directory tree of the file system, so it is inaccessible. In order to ingest, access, or play with data in a volume, it is necessary to mount it in the MapR File System. Let’s mount volumeOne under the topmost root “/” directory under the same name:

[mapr@ip-10-0-10-56 ~]$ maprcli volume mount -name volumeOne -path /volumeOne
Now we can quickly find volumeOne in the MapR-FS:
[mapr@ip-10-0-10-56 ~]$ hadoop fs -ls /
Found 7 items
drwxr-xr-x   - mapr mapr          0 2016-01-11 19:58 /apps
drwxr-xr-x   - mapr mapr          0 2016-01-11 19:58 /hbase
drwxr-xr-x   - mapr mapr          0 2016-01-11 20:01 /opt
drwxrwxrwx   - mapr mapr          3 2016-01-16 17:10 /tmp
drwxr-xr-x   - mapr mapr          2 2016-01-11 20:12 /user
drwxr-xr-x   - mapr mapr          1 2016-01-11 19:59 /var
drwxr-xr-x   - mapr mapr          0 2016-01-25 19:10 /volumeOne

As you probably guessed, the system volumes we saw earlier are actually mapped to “directories” that MapR has created out of the box. And we can see these mappings by adding the mount directory (mountdir) column in the volume list:

[mapr@ip-10-0-10-56 ~]$ maprcli volume list -columns volumename,mountdir
mountdir

volumename
 

mapr.cldb.internal
/

mapr.cluster.root
/var/mapr/configuration

mapr.configuration
/hbase

mapr.hbase
/var/mapr/local/ip-10-0-10-56.us-west-1.compute.internal/audit

mapr.ip-10-0-10-56.us-west-1.compute.internal.audit
/var/mapr/local/ip-10-0-10-56.us-west-1.compute.internal/logs

mapr.ip-10-0-10-56.us-west-1.compute.internal.logs
/var/mapr/local/ip-10-0-10-56.us-west-1.compute.internal/mapred

mapr.ip-10-0-10-56.us-west-1.compute.internal.mapred
/var/mapr/local/ip-10-0-10-56.us-west-1.compute.internal/metrics

mapr.ip-10-0-10-56.us-west-1.compute.internal.metrics
/var/mapr/local/ip-10-0-10-57.us-west-1.compute.internal/audit

mapr.ip-10-0-10-57.us-west-1.compute.internal.audit
/var/mapr/local/ip-10-0-10-57.us-west-1.compute.internal/logs

mapr.ip-10-0-10-57.us-west-1.compute.internal.logs
/var/mapr/local/ip-10-0-10-57.us-west-1.compute.internal/mapred

mapr.ip-10-0-10-57.us-west-1.compute.internal.mapred
/var/mapr/local/ip-10-0-10-57.us-west-1.compute.internal/metrics

mapr.ip-10-0-10-57.us-west-1.compute.internal.metrics
/var/mapr/local/ip-10-0-10-58.us-west-1.compute.internal/audit

mapr.ip-10-0-10-58.us-west-1.compute.internal.audit
/var/mapr/local/ip-10-0-10-57.us-west-1.compute.internal/metrics

mapr.ip-10-0-10-57.us-west-1.compute.internal.metrics
/var/mapr/local/ip-10-0-10-58.us-west-1.compute.internal/audit

mapr.ip-10-0-10-58.us-west-1.compute.internal.audit
/var/mapr/local/ip-10-0-10-58.us-west-1.compute.internal/logs

mapr.ip-10-0-10-58.us-west-1.compute.internal.logs
/var/mapr/local/ip-10-0-10-58.us-west-1.compute.internal/mapred

mapr.ip-10-0-10-58.us-west-1.compute.internal.mapred
/var/mapr/local/ip-10-0-10-58.us-west-1.compute.internal/metrics

mapr.ip-10-0-10-58.us-west-1.compute.internal.metrics
/var/mapr/metrics

mapr.metrics
/opt

mapr.opt
/var/mapr/cluster/yarn

mapr.resourcemanager.volume
/tmp

mapr.tmp
/var/mapr

mapr.var
/user

users
/volumeOne

volumeOne

Let’s download some sample datasets to test how volumes can help manage data: (http://stat-computing.org/dataexpo/2009/the-data.html):

[mapr@ip-10-0-10-56 mapr]$ cd /mapr/MapR-5.0.0-RHEL-6.5/volumeOne/

[mapr@ip-10-0-10-56 volumeOne]$ mkdir flights_dataset

[mapr@ip-10-0-10-56 volumeOne]$ cd flights_dataset

[mapr@ip-10-0-10-56 flights_dataset1]$ for year in 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008; do wget http://stat-computing.org/dataexpo/2009/$year.csv.bz2; done
[mapr@ip-10-0-10-56 volumeOne]$ bzip2 -d /mapr/MapR-5.0.0-RHEL-6.5/volumeOne/flights_dataset/*.csv.bz2

[mapr@ip-10-0-10-56 volumeOne]$ maprcli volume list -columns logicalUsed,volumename,mountdir -filter volumename==volume*
mountdir    logicalUsed  volumename
/volumeOne  11491        volumeOne

We now have 11.5 GBs of raw data in the Hadoop File System, but MapR compresses the data whenever it can, and the used space in the volume is actually around 5GB. If you’ve worked with other Hadoop distributions, you may have noticed that we interacted with the distributed file system just as if it was an extension of the Linux file system. We used all the same commands without any special Hadoop APIs. Additionally, you may have noticed that our volumeOne volume was treated just the same was as a directory. As a matter of fact, we created a directory inside it, and the volumeOne is added to the path just like any other directory. These are incredibly powerful and indispensible advantages thanks to the architecture behind MapR-FS.

So far, the only thing we have done with volumeOne is mounted it and ingested data. We haven’t modified any of its properties. Now that we need to create a couple of volumes, let’s reorganize our data by creating a “projects” parent volume and mount all the new volumes under /projects. Additionally, we will set quotas and replication in the second volume.

# Create a new volume that will hold volumeOne and volumeTwo

[mapr@ip-10-0-10-56 ~]$ maprcli volume create -name projects -path /projects

# In order to rearrange the data, we will unmount and remount volumeOne

[mapr@ip-10-0-10-56 ~]$ maprcli volume unmount -name volumeOne
[mapr@ip-10-0-10-56 ~]$ maprcli volume mount -name volumeOne -path /projects/volumeOne

# Create volumeTwo with replication = 4 and a hard quota of 5 GB

[mapr@ip-10-0-10-56 ~]$ maprcli volumcreate -name volumeTwo -path /projects/volumeTwo -replication 4 -advisoryquota 3G -quota 5G

# List the volumes we created and the relevant properties

[mapr@ip-10-0-10-56 ~]$ maprcli volume list -columns used,volumename,mountdir,advisoryquota,quota,numreplicas -filter [volumename=="volume*||projects"]
quota  mountdir             numreplicas  used  volumename  advisoryquota
0      /projects            3            0     projects    0
0      /projects/volumeOne  3            5296  volumeOne   0
5120   /projects/volumeTwo  4            0  volumeTwo   3072

# Confirm the mount paths of each volume using “hadoop fs” and standard Linux command via NFS:

[mapr@ip-10-0-10-56 projects]$ ls -l /mapr/MapR-5.0.0-RHEL-6.5/projects/
total 1
drwxr-xr-x. 3 mapr mapr 1 Jan 25 21:18 volumeOne
drwxr-xr-x. 2 mapr mapr 0 Jan 26 00:55 volumeTwo

[mapr@ip-10-0-10-56 projects]$ hadoop fs -ls /projects
Found 2 items
drwxr-xr-x   - mapr mapr          1 2016-01-25 21:18 /projects/volumeOne
drwxr-xr-x   - mapr mapr          0 2016-01-26 00:55 /projects/volumeTwo

Just to change it up, the replication factor of volumeTwo was set to 4, an advisory quota was set to 3GB, and the hard quota was set to 5GB. As you can see from volumeOne and projects, there are no quotas by default and replication is set to 3. To test, the advisory 3 GB of data were copied to volumeTwo, and sure enough, the MCS displayed an alarm (which can be configured to email the system administrator), and the maprcli also reflects this:

[mapr@ip-10-0-10-56 ~]$ maprcli volume list -columns used,volumename,mountdir,advisoryquota,quota,numreplicas -filter  [volumename=="volume*||projects"]
quota  mountdir             numreplicas  used  volumename  advisoryquota
0      /projects            3            0     projects    0
0      /projects/volumeOne  3            5296  volumeOne   0
5120   /projects/volumeTwo  4            3174  volumeTwo   3072

[mapr@ip-10-0-10-56 ~]$ maprcli alarm list
1            Volume usage exceeded advisory quota. Used: 3.09 GB Advisory Quota : 3 GB  volumeTwo  VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED   1453789162871

As we can see, we get alarms, but it still doesn’t stop the data from getting transferred into volumeTwo. This alarm, however, can be a good heads up, as it will tell you to clean up your data or to request a larger quota (if it makes sense). If we push the limit a little more, and push more data to volumeTwo, we will soon hit the hard quota. It will allow the last file to copy, but will not let you ingest more data:

[mapr@ip-10-0-10-56 projects]$ touch /mapr/MapR-5.0.0-RHEL-6.5/projects/volumeTwo/touch1
touch: cannot touch `/mapr/MapR-5.0.0-RHEL-6.5/projects/volumeTwo/touch1': Disk quota exceeded

[mapr@ip-10-0-10-56 projects]$ hadoop fs -touchz /projects/volumeTwo/touch1
2016-01-26 01:42:02,7923 ERROR Client fs/client/fileclient/cc/client.cc:1240 Thread: 19563 Create failed for file touch1, error Disk quota exceeded(122) fid 2127.16.2
touchz: Create failed for file: /projects/volumeTwo/touch1, error: Disk quota exceeded (122)

In this second part of the blog series about the MapR Command Line, you learned what MapR volumes are and how they can be used to manage data in your big data platform. You also learned the actual commands needed to create, mount, and modify the characteristics of these volumes.

To learn more about advanced volume usage via the mapr command line, please stay tuned for the next blog post.

Do you have any questions about how to use the MapR Command Line? Ask them in the comments section below.

no

CTA_Inside

Ebook: Getting Started with Apache Spark
Interested in Apache Spark? Experience our interactive ebook with real code, running in real time, to learn more about Spark.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free