RHadoop and MapR technical brief now available
If you are a data analyst or statistician familiar with the R programming language and you want to use Hadoop to run MapReduce jobs or access HBase tables, Revolution Analytics has created RHadoop to make your life easier. You can use all of your existing R programs and add MapReduce and HBase functionality. You get all the statistical analysis capabilities of your R environment with the enterprise grade, massively scalable, distributed compute provided by MapR’s Hadoop distribution.
RHadoop is a collection of three R packages that let you run MapReduce jobs entirely from within R as well as giving you access to Hadoop files and HBase tables. Revolution Analytics has a great tutorial on running MapReduce jobs from within R, but understanding what needs to be installed on client systems versus MapR cluster nodes for each of the RHadoop packages isn’t entirely clear. See our paper “RHadoop and MapR” for full details and step-by-step set up instructions, but in a nutshell:
The rmr2 package uses Hadoop streaming to invoke R on individual tasktracker nodes so R and the rmr2 package need to be installed on the client machine from which you run R as well as on all the tasktracker nodes in your MapR cluster. Once installed, just set up environment variables to point to the hadoop command and the hadoop streaming jar, and you can run R MapReduce jobs on your MapR cluster.
The rhdfs package provides a client interface to files on your MapR cluster through the hadoop command. Unlike rmr2, it only needs to be installed on the client machine where you are running R. This machine does need to have the MapR client software installed and be configured to access your cluster. As long as you can run “hadoop fs” commands from the shell, you can use rhdfs. Note that unlike other Hadoop distributions, MapR allows you to mount directories from your Hadoop cluster right on your client machine. If you do that, you can just access your Hadoop files from R like any other local file and you can bypass rhdfs entirely.
The rhbase package accesses HBase via the HBase Thrift server which is included in the MapR HBase distribution. The rhbase package is a Thrift client that sends requests and receives responses from the Thrift server. The Thrift server listens for rhbase’s Thrift requests and in turn uses the HBase HTable java class to access HBase. For an R developer, this is all transparent! For simplicity, rhbase defaults to using a local Thrift server on the machine where R and rhbase are installed. This is a client machine where you would run the HBase shell. Since rhbase is a client-side technology, it only needs to be installed on the client system that will access the MapR HBase cluster. Nothing additional needs to be installed on your HBase cluster nodes.
Click here to download the “RHadoop and MapR” paper and start harnessing the combined power of R and Hadoop.