Bootstrap Apache Drill on Amazon EMR

I’m very pleased to announce the release of a custom EMR bootstrap action to deploy Apache Drill on a MapR cluster. MapR is the only commercial Hadoop distribution available for Amazon’s Elastic MapReduce service (EMR), and this addition allows EMR users to easily deploy and evaluate the powerful Drill query engine.

The bootstrap action is available at: s3://maprtech-emr/scripts/mapr_drill_bootstrap.sh. It can be invoked as part of a GUI-launched MapR-EMR cluster by simply adding a “Custom action“ to your selection of any MapR cluster (as illustrated in the excerpt of the larger EMR launch panel GUI below):

Amazon EMR and Hadoop GUI

Use the “Configure and add” button to specify the correct location of the script (s3://maprtech-emr/scripts/mapr_drill_bootstrap.sh). No arguments are necessary… the script always installs the latest version of the mapr-drill package.

Users who prefer to launch clusters using Amazon’s aws command-line tool can add the action via the – bootstrap-actions argument to the “aws emr create-cluster” command (see the documentation at http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html)

Upon successful completion of the cluster launch, the Drill software will be installed on all nodes. Users can access the Drill query engine in one of two ways :

  1. The sqlline tool, executed from any node in the EMR cluster
  2. The Drill control console at http://<cluster_master_node>:8047

NOTE: The default EC2 security group for the EMR cluster (usually named “ElasticMapReduce-master”) will NOT allow traffic on port 8047. Users will want to explicitly edit the security group associated with the Master node and enable inbound traffic for that port in order to access the Drill control console.

To get started quickly, simply ssh into the master node of the cluster and execute the command

    sqlline –u jdbc:drill:

This will invoke the sqlline command tool and enable Drill queries against any data in the cluster file system. The cluster is also configured to access to some pre-staged data in an Amazon S3 bucket (s3://mapr-public-files/). Sample queries against that data have been saved as /home/hadoop/dquery1.sql and /home/hadoop/dviews.sql on the master node. Running those queries is a simple as

    sqlline> !run dquery1.sql

For more information on Apache Drill on MapR, please see the overview and discussion at https://www.mapr.com/drill. There are some interesting examples and an in-depth discussion about configuring storage plug-ins to access your data.

Details on Amazon’s Elastic MapReduce service and how to plan your cluster (including a discussion of the MapR differentiators) can be found at: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan.html

If you are interested in getting hands-on experience with Apache Drill, there is a tutorial available on AWS Test Drive as well.  

no

CTA_Inside

Delivering Fastest Time-to-Value for SQL-on-Hadoop
Read this paper to learn about: How Drill enables self-service data analysis, An example scenario - analyzing Twitter JSON data with Drill, How Drill compares to Hive and Impala.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free