Google Cloud Storage Connector for Hadoop: A Quick Start Guide

Like most commercial cloud platforms, Google Cloud offers a range of different storage options.  The most common options are persistent disk volumes attached to Virtual Machine instances or object store buckets accessed via the Google Storage APIs.  Until recently, disk volumes were the only supported storage for Hadoop deployments in the Google Cloud. That situation changed for the better with the release of the Google Cloud Storage Connector for Hadoop. 

The connector enables the Hadoop cluster to access Google Storage buckets via the standard Hadoop File System interface. Users can then access their data in Google Storage buckets just as they would access data ingested directly into a Hadoop cluster.  

Integrating the connector with the MapR Distribution for Apache Hadoop follows the standard procedure, while bringing with it all the operational and performance advantages of the top-ranked Hadoop distribution (see the Forrester Wave Report and Google MinuteSort record).    

Here are some steps to get you started quickly with the connector:

For basic cluster deployment, use the MapR setup scripts available on github (https://github.com/mapr/gce). The deployed instances will be authorized to access the Google Cloud Storage buckets within your account. Use the configure-gcs.sh script also available in the above github repository to configure the cluster nodes. The script must be executed on all cluster nodes.  

Here are the steps:

  1. Use launch-mapr-cluster.sh to deploy a MapR cluster in the Google Cloud.
  2. Create a default Google Storage bucket in your Google Cloud console. The configure-gcs.sh script assumes a bucket name of gsdefault, but any name will do.
  3. Copy configure-gcs.sh to each node in the cluster.
    ( gcutil push <node> configure-gcs.sh /tmp )
  4. Execute the script as the root user on each node.
    ( sudo /tmp/configure-gcs.sh )
  5. Confirm the configuration on any node.
    ( hadoop fs –ls gs://  or  hadoop fs –ls gs://<bucket>/ )

After you complete these steps, your cluster will have full access to any data in your Google storage buckets. You can access it via the simple Hadoop command line, a custom map-reduce job, or even as Hive tables for structured queries.

 

 

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free