Launching a MapR Cluster on Google Compute Engine


The MapR Distribution for ApacheTM Hadoop® adds enterprise-grade features to the Hadoop platform that make Hadoop easier to use and more dependable. The MapR Distribution for Hadoop is fully integrated with the Google Compute Engine (GCE) framework, allowing customers to deploy a MapR cluster with ready access to Google’s cloud infrastructure. MapR provides network file system (NFS) and open database connectivity (ODBC) interfaces, a comprehensive management suite, and automatic compression. MapR provides high availability with a No NameNode architecture and data protection with snapshots, disaster recovery, and cross-cluster mirroring.

Before You Start: Prerequisites

These instructions assume you meet the following prerequisites:

  • You have an active Google Cloud Platform account.
  • You have a client machine with the gcutil client installed and in your $PATH environment variable.
  • You have access to a GCE project where you can add instances.

Scripts Required

Deploying a MapR cluster within GCE relies on the following scripts:


You can download these scripts from

Launching a MapR Cluster on GCE

Invoke the script from the directory where the script is installed:

# ./ --project <project ID> --cluster <cluster name> --mapr-version <version number> --config-file <config file> --image <image name> --machine-type <type> --zone <zone> --license-file <path _ to _ license>
Parameter Description
--project The GCE project ID of the project where you want the cluster to be deployed. Note that the GCE project ID, the GCE project’s name, and the cluster’s name are all distinct.
-cluster The name of the new cluster. This is a MapR-specific property.
--mapr-version The version of the MapR distribution for Hadoop to install. The default version is 3.0.1. Other supported versions are 2.1.2, 2.1.3, and
--config-file This parameter specifies the location of a configuration file that determines the allocation of cluster roles to the nodes in the cluster. See The GCE Configuration File for more information.
--image The OS image to use on the nodes. Legal values can be found through your GCE console.
--machine-type Defines the hardware resources of the nodes in the cluster. Legal values can be constructed as n1--, where is highmem (6.5 GB of memory per CPU), highcpu (1.8 GB of memory per CPU), or standard (3.75 GB of memory per CPU), and indicates the number of CPU cores on the node. Legal values for are 2, 4, or 8 for all machine types. If the machine type is standard, 1 is also a legal number of CPU cores. Two other machine definitions are available: f1-micro, with 1 CPU and 0.6 GB of memory, and g1-small, with 1 CPU and 1.7 GB of memory. To use ephemeral disks, append -d to your machine type definition. For example, the machine type definition n1-standard- 4-d specifies a 4-core machine with 15GB of memory that includes ephemeral disks.
--persistent-disksx Optional: Use persistent disks if you’re not using a “-d” machine type. Specifies the number and size of persistent disks for this node in the format mxn, where m is the number of disks and n is the size in GB. For example, the value 4x128 specifies four 128GB disks. While you can specify any number of disks with any capacity, within the limits of your quota, more than 8 disks will not provide significant advantages in the GCE environment.
--zone The GCE zone for the virtual instances. Zones include us-central1-a, us-central1-b, us-central2-a, europe-west1-a, and europe-west1-b.
--license-file Optional: This provides a path to a trial MapR license file.

About Ephemeral Disks: Ephemeral disks do not maintain data after the instances have been shut down for an extended period of time.

Here is an example of a fully defined launch operation: --project MyProject --cluster MapRonGCE --maprversion 3.0.1 --config-file myrolesfile.txt --machine-type n1-highmem- 4-d --image debian-7-wheezy-v20130926 --zone us-central1-a

The GCE Configuration File

The configuration file that you pass to the script describes the allocation of cluster roles to the nodes in the cluster. The configuration file uses the following format:


Each element on an entry in a configuration file is separated by a space. Each entry consists of these elements:

  • Indexed identifier for the node in the cluster
  • A comma-delimited list of packages to be installed on that node

Nodes in a MapR cluster can assume the following roles:

  • cldb
  • zookeeper
  • fileserver
  • tasktracker
  • jobtracker
  • nfs
  • webserver
  • metrics

For more information about roles, see the main MapR documentation regarding planning service layout on a cluster.

Sample M3 Configuration File

This sample configuration file sets up a typical M3-licensed three-node cluster.

node0:zookeeper,cldb,fileserver,tasktracker,nfs,webserver node1:fileserver,tasktracker, node2:fileserver,tasktracker,jobtracker

Sample M5 Configuration File

This sample configuration file sets up a typical M5-licensed five-node cluster to illustrate MapR’s highavailability features, such as redundant CLDB nodes, redundant JobTracker nodes, and redundant NFS servers.

node1:zookeeper,cldb,fileserver,tasktracker,nfs,webserver,metrics node2:zookeeper,cldb,fileserver,tasktracker,nfs node3:zookeeper,fileserver,tasktracker,nfs,webserver,metrics node4:fileserver,tasktracker,nfs,jobtracker,metrics node5:fileserver,tasktracker,nfs,jobtracker,metrics

Licensing: Install the M5 trial license after installing the cluster to enable the High Availability features.

For more examples of cluster designs, see the MapR documentation at:

Using SSH to Access Nodes

You can use the gcutil ssh command. To log in to the nodes on your cluster. Use the following command to access the node launched above.
# gcutil ssh --project MyProject --zone us-central1-a MapRonGCE-node1

launching_mapr_cluster_on_gce_4.pdf305.15 KB