Accumulo (http://accumulo.apache.org/) is a popular BigTable like framework created by the NSA and recently open sourced as an Apache project. Today MapR is excited to announce that we have completed testing of Accumulo on the MapR distribution. Users of Accumulo get all the benefits of MapR’s advanced Hadoop distribution.
Accumulo users inherit the same strong dependability of the MapR platform that HBase™ users have enjoyed for a long time. For example, snapshots provide point-in-time recovery in the event of user and application errors, while mirrors provide disaster recovery and backup for Accumulo-based applications.
In brief, here are the MapR specific steps to install Accumulo on your existing MapR cluster. Remember that we expect that you will follow the official Accumulo installation documentation. Here we document only the MapR specific pieces. After the summary, we’ll go into more details about each of the steps.
In order to run Accumulo on MapR, ensure that you have a current version of Accumulo as well as MapR as there are few fixes that are needed:
- Accumulo 1.4.0 – this includes Accumulo fix ACCUMULO-476.
- MapR version 1.2.7 – this will be released soon. Until then, if you are interested in a test version of the fix, send me email directly and I’ll send a test fix to you with instructions.
Now perform the following steps.
- Download Accumulo 1.4.0 or later and untar the tar file under /opt. This will create /opt/accumulo-1.4.0.
- Create a MapR volume for Accumulo and mount it under /accumulo. You can use the UI or execute this command:
maprcli volume create –name project.accumulo –path /accumulo
- Disable compression for Accumulo data using this command:
hadoop mfs -setcompression off /accumulo
- Create an Accumulo specific “shadow tree” for Hadoop so we can disable read/write caching.
cd /opt/accumulo-1.4.0 mkdir hadoop/hadoop-0.20.2 cd hadoop/hadoop-0.20.2 ln -s /opt/mapr/hadoop/hadoop-0.20.2/* . rm conf mkdir conf cd conf ln -s /opt/mapr/hadoop/hadoop-0.20.2/conf/* cp core-site.xml t mv t core-site.xml
- Edit /opt/accumulo-1.4.0/hadoop/hadoop-0.20.2/conf/core-site.xml and add this:
<property> <name>fs.mapr.readbuffering</name> <value>false</value> </property>
<property> <name>fs.mapr.aggregate.writes</name> <value>false</value> </property>
- Edit warden.conf to leave space for Accumulo (refer to Accumulo docs for amount needed). In this example we are assuming 2GB is needed. Change this:
service.command.os.heapsize.max=2750 (from 750) service.command.os.heapsize.min=2256 (from 256)
- Create appropriate initial configuration files in /opt/accumulo-1.4.0/conf by following the Accumulo installation instructions for copying from the examples.
- Edit accumulo-env.sh. Follow the Accumulo documentation for this, but take note of these two MapR related settings:
HADOOP_HOME=/opt/accumulo-1.4.0/hadoop/hadoop-0.20.2/ ZOOKEEPER_HOME=/ opt/accumulo-1.4.0/hadoop/hadoop-0.20.2/lib
- Edit accumulo-site.xml. Edit the stanza for zookeeper (the Zookeeper location information can be found in warden.conf):
<property> <name>instance.zookeeper.host</name> <value>host1:5181,host2:5181,host3:5181</value> <description>comma separated list of zookeeper servers</description> </property>
Add this stanza to change the Accumulo tablet server port:
<property> <name>tserver.port.client</name> <value>9996</value> </property>
At this point, you can complete the installation and configuration of Accumulo on a single node in a MapR cluster by continuing to follow Accumulo installation instructions. Perform the same steps on additional nodes when you put Accumulo on them.
We hope that you find running Accumulo on MapR to be an excellent pairing of the many enterprise features of MapR with the function and power of Accumulo. Please let us know what you think – we are listening.
We at MapR Technologies would like to thank the following individuals that generously gave their time towards this work: Todd Stavish who started the initial port and blogged about it here: http://t.co/RJJ8Ht4B; Eric Newton (Accumulo committer), Keith Turner (Accumulo committer). Eric and Keith were very helpful in explaining the Accumulo test framework so that we could validate Accumulo properly. Both were also invaluable in tracking down some issues that initially led to test failures.
Now that we’ve shown you the steps to configure Accumulo and MapR, we are going to explain the reasoning behind these settings. For each interesting step previously, we provide more detailed information here.
Step #2: Create a volume for Accumulo
We recommend creating a volume for Accumulo data and mounting it at /accumulo which is the default Accumulo location. By creating a volume just for Accumulo, you’ll be able to leverage MapR’s many volume management features (such as snapshots, mirroring, and quotas) with Accumulo data. For example, this makes it possible for you to easily schedule snapshots of your Accumulo data and mirrored replication of that same data for enhanced protection.
Step #3: Disable compression
Recall that by default MapR transparently compresses all data, which greatly reduces storage requirements and in general improves performance by reducing I/O requirements. However, for some database like applications such as HBase™, that transparent compression can lead to high CPU load and should be disabled. Accumulo is no different.
Steps #4 and #5: Create an Accumulo specific “shadow tree” for Hadoop
As with HBase™, MapR’s transparent read and write aggregation is not appropriate since database like systems tend to have fairly random data access profiles (as opposed to sequential access) Therefore, read caching and write aggregation should be disabled by setting the properties shown earlier.
Ordinarily these properties would be set in a localized site configuration file such as accumulo-site.xml since the total Hadoop configuration is normally a merger of the core-site.xml file and every subsidiary configuration file. This is what we do with HBase™ (in fact you’ll see those exact same properties set in hbase-site.xml in MapR). Unfortunately, Accumulo does not pass properties found in accumulo-site.xml to the parent Apache Hadoop configuration layers. As a result, we have to do something a bit cleverer. Essentially we will create a custom Hadoop core-site.xml that is used only by Accumulo – we want the rest of your cluster to be able to take advantage of MapR’s transparent caching and aggregation.
In order to ensure that the Hadoop runtime that is used by Accumulo is using this Accumulo specific core-site.xml file, there are two options. In the first option, you would edit the Accumulo classpath setttings to remove the normal Hadoop conf directory from the classpath (removing $HADOOP_HOME/conf). These settings are in the accumulo-site.xml file and specified as the general.classpaths property. You could then copy the modified core-site.xml file to the Accumulo conf directory and rely on the usual Hadoop classpath based search for core-site.xml to pick up the correct file. Note that other Hadoop conf files are not in the Accumulo conf directory so you may have to copy them as well depending on the scenario.
The second approach (and the one chosen) is to create a shadow Hadoop tree that is identical to the real MapR defined Hadoop tree but has this custom configuration file. We leverage symbolic links to make this all work. Earlier we showed you the steps to make this work. Notice that everything in that tree is just a symbolic link to the real Hadoop tree except for the core-site.xml file.
Once we have the shadow tree with only a real core-site.xml file (everything else being links), we simply edit core-site.xml as described earlier to add the two properties. As is usual with Hadoop, make sure you perform these steps on every node that is running an Accumulo server component – or just copy the changes.
Note that a side effect of either approach is that if you edit the real core-site.xml file those changes are not visible to Accumulo. You’ll need to define a process to ensure that the two copies of core-site.xml are consistent.
Step #6: Edit warden.conf to leave space for Accumulo
We need to ensure that Warden is aware of the resources used by Accumulo. Recall that MapR’s Warden automatically starts, stops, and manages the resources used by the various Hadoop components. In particular, the Warden takes into account the expected memory utilization of components to use memory appropriately. Since the Warden does not know about Accumulo, we need to make some changes to the warden.conf file on each node running Accumulo to ensure that the Warden leaves sufficient resources for Accumulo. If you look in the warden.conf file you’ll see that there are a number of settings related to heap size for each service. We need to ensure that Warden sets aside enough memory for Accumulo. Since today there is no explicit option for Accumulo, we instead tell the Warden that the operating system is consuming additional memory that it cannot use.
If you look at the default warden.conf, you’ll see these properties with these values:
Increase the max and min values to take into account the expected memory usage of the Accumulo servers on nodes that will run them. For example, if you expect that the Accumulo server processes will consume 2GB of memory, you would change the values to this:
We are not claiming to be experts in Accumulo sizing. As such you will need to determine yourself what Accumulo processes are running on each node and what is their expected memory utilization and then update warden.conf appropriately.
Step #8: Edit accumulo-env.sh
The Accumulo documentation explains the meaning of the values in this file and the need to change them. We just want to point out two things that are relevant to MapR.
First, since we’ve created a shadow Hadoop tree, you’ll want to point Accumulo to that tree rather than the cluster default under /opt/mapr/hadoop. Secondly, MapR includes a Zookeeper client on every node and that is all Accumulo actually requires. There is no need for a separate Zookeeper install. As such the Zookeeper “home” is really just the location of the Zookeeper client library – /opt/accumulo-1.4.0/hadoop/hadoop-0.20.2/lib.
Step #9: Edit accumulo-site.xml
The Accumulo documentation explains the meaning of the values in this file and the need to change them. We just point out how to determine those values with MapR.
The Zookeeper endpoints are defined in warden.conf and can easily be found by looking for the value of the zookeeper.servers property. In addition, the default port for tablet servers in Accumulo conflicts with a MapR default port – therefore we recommend changing the tablet server port.