From Zero to Cluster-Ready

How to get your hardware ready for a MapR installation

Getting a MapR cluster up and running can be simple! The most important part of the process is to make sure that your hardware is set up properly. In this short blog post, we'll go over a few things that will help you get started with confidence.
Earlier this year, we updated the Installation Guide to simplify the installation procedure. If you are used to installing packages on Linux, you will notice that the installation process looks familiar. It might be tempting to forge ahead on your own, referring to the documentation only when you get stuck—but you will be much more successful if you follow the documentation carefully, performing each step in order. There are dependencies that you might miss otherwise.

Disks and Memory

It's important to make sure that your hardware has plenty of memory and plenty of disks, to get the most performance possible out of your cluster. 32 GB of RAM per node is a good baseline. MapR manages disks directly, without an intervening layer, aggregating them to provide performance by running more spindles in parallel. To let MapR do this job, you should avoid using RAID or LVM, and instead just provide JBOD (Just a Bunch of Disks) on each node.
MapR can handle a lot of disks—typically, 12 disks per node. Normally, one of the disks is used for the Linux operating system and MapR services, and the remaining disks are allocated for MapR storage.

Users and Groups

Every node should have the same Linux users and groups, with the same UIDs and GIDs, and the same login credentials. One easy way to accomplish this is with LDAP. MapR supports PAM, which means that you can integrate with LDAP or other authentication mechanisms (see PAM Configuration). You should create one user under which MapR services will run—mapr is a good username for this MapR user.

Network

MapR can use multiple NICs on each node to provide high network bandwidth. In fact, you should provide enough NICs to equal network bandwidth that is equal to half the disk I/O bandwidth. Make sure that your network switches can handle a lot of throughput. You can use several fast NICs for high-bandwidth data transfer, and reserve one slower NIC for administrators to ssh into each node for maintenance purposes. To accomplish this, set the MAPR_SUBNETS environment variable to tell MapR which NICs to use for data transfer. All the data transfer NICs should be the same speed; if you use different speeds, all the NICs will operate at the lowest speed. If MAPR_SUBNETS is not set, MapR uses all NICs present on the node. For more information, see  Designating NICs for MapR.

Operating System and Software

MapR works with RedHat, CentOS, Ubuntu, and SUSE Linux. However, be careful not to mix different Linux distributions within the cluster! Use Sun Java JDK 1.6 or 1.7, or OpenJDK 1.6. If you plan to use MapR NFS, you should uninstall or disable the stock Linux NFS server, as it will interfere. It should go without saying: don’t run unnecessary, unrelated software on your cluster, or performance will suffer as MapR fights for resources.

Hostnames and Name Resolution

MapR requires forward and reverse name resolution between all nodes. This means that unique hostnames are necessary for all nodes. Choose your hostnames wisely, so that you don’t have to change them once the cluster is running—changing hostnames can introduce confusion and chaos, or even cause problems. To fully utilize MapR’s ability to report disk status, you’ll need passwordless ssh set up between each node that runs the MapR webserver and all other nodes.

Time Synchronization

It’s important that all nodes are time-synchronized; this helps promote data consistency and ensures that logs are useful in case you need to do any troubleshooting. Use NTP on all nodes to keep their clocks in sync. For best results, choose one node to sync to an external NTP server and then sync all other nodes to that one. That way, if there is any trouble connecting to the external NTP server, at least all of the nodes will still be in sync.

MySQL

If you are planning to use MapR Metrics (See Job Metrics), you will need to install MySQL somewhere (it does not have to be on a cluster node). If you are planning to use Hive, you should consider installing MySQL to use as a metastore. The default Apache Derby metastore only allows one connection at a time, so you’ll need MySQL if you plan to support multiple concurrent connections.
These tips should give you an idea of how to prepare your nodes as you install a MapR cluster. If you follow all the steps in the Installation Guide in order, your nodes will be ready for the next step: installing MapR services.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free