Using Drill Programmatically from Python, R and Perl

One of the challenges of connecting to multiple data sources in Hadoop is managing the proliferation of modules that must be loaded into an application to access them -- each with their own implementations to learn, code libraries to manage, and caveats to consider. There is (very often) more than one way to connect to the same data, in the same database, within the same language.

A lot of this goes away if you use Apache Drill -- by presenting a unified, schema-free ODBC interface to all of the common data sources in Hadoop, like files in HDFS, Hive, NoSQL databases, Amazon S3 buckets and more. If you’re a Tableau user, Drill provides a quick and easy way to run SQL queries on Hadoop-scale data with a distributed query engine.

But wait… did you know you can programmatically do the same thing from within an application or script, without a BI tool? Yes you can! In this post we will look at three different ways to do it, and you can pick the example that suits your favorite language: with Python via pyodbc, with R via RODBC, and Perl via DBD::ODBC. From within applications built in these languages (and others, this is just a sampling), Drill enables you to use SQL on data that’s ordinarily too large to manipulate directly with a script, connect to multiple Hadoop sources under one interface, and easily navigate semi-structured data without building and maintaining a schema.

Let’s look at how to do this using the pre-loaded example data on the MapR Drill sandbox. The overall view of this data is described here. These examples will make use of a subset of what’s available in the sandbox to keep things concise:

  • A Hive tabled called ‘orders’, which contains the order ID, customer ID, products, date, etc. and other details about a particular order.

  • A CSV file, ‘customers.all.csv’ which contains information about a specific customer, keyed by customer ID.

Prerequisite Steps: Setting up ODBC on Linux

In this example I will use ODBC and scripts running on an Ubuntu Linux 14.04.1 LTS desktop machine; the steps will be similar for other flavors. If you’re running scripts on a Mac, the steps are a little different; consult the Drill ODBC Mac page for up-to-date procedures on getting ODBC running. The good news is that once the configuration of unixODBC is complete, the connection to Drill is available to all of the languages we’ll use without any additional steps.

  1. A single node or cluster running Drill is assumed. Consult these pages to get your own sandbox running with a few minutes of setup, or here for the quick-start documentation on the Apache site for installing it on a Hadoop cluster.

  2. On the machine where you are running Python, R or Perl scripts, install the Drill ODBC driver following the instructions here. You can install the downloaded RPM file with ‘rpm -i ’ and you may have to add the --nodeps flag to the end if you get a warning about /bin/sh being a dependency.

  3. Copy the ODBC configuration files from the driver ‘Setup’ directory into your home directory. For example, assuming your driver was installed in /opt/mapr, this shell snippet will copy the files:
  4. for f in $(ls /opt/mapr/drillodbc/Setup/); do cp /opt/mapr/drillodbc/Setup/$f ~/.$f; done

    Follow the remaining instructions here for configuring the driver. At a minimum you will need to: edit the file ~/.odbc.ini to configure either a direct mode connection (directly to a Drillbit) or a cluster mode connection with your zookeeper quorum details.

    I used the following ~/.odbc.ini file for the subsequent examples, which sets up a DSN (Data Source Name) called ‘drill64’. This is important because the DSN we pass in each script must match the name in this file.

    In this case the source is configured to connect to host ‘maprdemo’ for zookeeper. For ZKClusterID you can usually use the name yourcluster-drillbits. If you’re using the sandbox, the value for ZKClusterID should be mapr_demo_com-drillbits as shown below.

    It’s a good idea to add an entry to /etc/hosts for ‘maprdemo’ to make sure the name resolves.

    [ODBC]
    Trace=no
    
    [ODBC Data Sources]
    [drill64]
    # This key is not necessary and is only to give a description of the data source.
    Description=MapR Drill ODBC Driver (64-bit) DSN
     
    # Driver: The location where the ODBC driver is installed to.
    Driver=/opt/mapr/drillodbc/lib/64/libmaprdrillodbc64.so
     
    ConnectionType=ZooKeeper
    ZKQuorum=maprdemo:5181
    ZKClusterID=mapr_demo_com-drillbits
    AuthenticationType=No Authentication
    Catalog=DRILL
    Schema=
    

  5. Install a Linux ODBC driver on the same machine (running the scripts), if one is not already present. On Ubuntu, you can do this with ‘apt-get install unixodbc-dev’. On other systems, do this with your preferred package manager.
  6. If you get an error saying ‘file not found’ from unixODBC, depending on your Linux configuration you may need to put the absolute path to your libodbcinst library by editing the file ~/.mapr.drillodbc.ini and setting:

    ODBCInstLib=/usr/lib/x86_64-linux-gnu/libodbcinst.so.1.0.0
    and ensuring the LD_LIBRARY_PATH includes the library, by adding the following line to your ~/.bashrc:
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mapr/drillodbc/lib/64:/usr/lib
    
    You should be ready to go. Let’s start with Python.

Connecting to Drill in Python and Querying Multiple Hadoop Sources

Using Drill from within Python scripts opens up a new world of data analysis capabilities by coupling the distributed query power of Drill with all of the Python open source modules and frameworks available like numpy and pandas. The following code snippet connects to Drill and runs a query on the Hive table ‘orders’, joined with a CSV-formatted customer database. Let’s say we want to see all customers with orders totaling at least $3000 to send them a special offer:

import pyodbc
from pandas import *
 
# initialize the connection
conn = pyodbc.connect("DSN=drill64", autocommit=True)
cursor = conn.cursor()
 
# setup the query and run it
s = "SELECT c.columns[1] as fullname, SUM(o.order_total) as ordertotal " \
   "FROM hive.orders o INNER JOIN " \
   "dfs.`root`.`customers.all.csv` c " \
   "ON CAST(o.cust_id as BIGINT) = CAST(c.columns[0] as BIGINT)" \
   "GROUP BY c.columns[1] ORDER BY ordertotal desc"
 
# fetch and display filtered output
cursor.execute(s)
b = Series(dict(cursor.fetchall()))
print b[b > 3000]

You can see in this example we created a pandas Series from the output of Drill, which was a result of a distributed query on the data sources. Both data sources were referenced as “tables” in our SQL syntax. We then did a quick filter on the output. Notice we didn’t have to add any modules to talk to Hadoop data sources -- Drill did it all for us and we even did a join across more than one source.

Connecting to Drill in R

Let’s see how to do the same thing in R. We’ll use the RODBC package and install it directly from within the R interpreter (trust me, this is easier than installing it by hand):

> install.packages("RODBC") 
This should run a configure script and do a download and build of the package and install it system-wide (if run as root). Using the following script (run with ‘Rscript’) we run a similar query, importing the results into an R DataFrame.
library(RODBC)
 
# initialize the connection
ch <- odbcConnect("drill64")
 
# run the query
df = sqlQuery(ch, paste("SELECT columns[2] as state, COUNT(*) as ",
                        "countbystate from `dfs`.`root`.`customers.all.csv` ",
                        " GROUP BY columns[2] ORDER BY countbystate DESC"))
df$perctotal <- df$countbystate / sum(df$countbystate) * 100
 
print(paste0("Total: ", sum(df$countbystate)))
print(head(df, 10))
 
# close the connection so we don't get a warning at the end
odbcClose(ch)

In this example, we pulled specific columns of our customer database, then add another calculated column called ‘perctotal’ that shows the overall percentage of customers from each state. Running this script should produce a list of the top ten states where our customers live.

Connecting to Drill in Perl

Last but not least, let’s look at the same example in Perl, using the module DBD::ODBC from CPAN. It will take roughly the same amount of code… plus or minus a few semicolons and dollar signs. Install this module on your system with the following command (as root):

# cpan -i DBD::ODBC

We can then connect to the driver within Perl and fetch rows in a similar fashion. This query will return total sales by member level (basic, silver, gold) and gender.

use DBI;
 
$dbh = DBI->connect('dbi:ODBC:DSN=drill64');
 
# setup the query and run it
my $results = $dbh->prepare(
    "SELECT c.columns[6] as level," .
    "c.columns[3] as gender," .
    "SUM(o.order_total) as ordertotal " .
    "FROM hive.orders o INNER JOIN " .
    "dfs.`root`.`customers.all.csv` c " .
    "ON CAST(o.cust_id as BIGINT) = CAST(c.columns[0] as BIGINT)" .
    "GROUP BY c.columns[6], c.columns[3] ORDER BY ordertotal desc");
 
$results->execute();
 
# fetch and display the value
while (my @row = $results->fetchrow_array) {
       print "@row\n";
}

As in the other examples, this is a synchronous version where the script waits for the result, and other ways of doing this (including asynchronously) are documented in the CPAN page above, but this gives you the general idea for how to get connected and run queries.

Conclusion and Next Steps

You’ll notice that in all these examples I connected to a few different Hadoop data sources (a CSV file in HDFS and a Hive table), only loaded one package/library in each language, and wrote a single query across the sources as tables -- this shows why Drill simplifies and enhances connecting to data in Hadoop and is a time-saver when analyzing data in applications and scripts at scale.

For use as a reference, all three of these code examples are available on github here and can be run against the sandbox VM or your own cluster. These aren’t the only languages that have ODBC bindings and can connect to Drill (there are many more!) but hopefully these examples can help get you started accessing Drill from within an application or script.

no

CTA_Inside

Delivering Fastest Time-to-Value for SQL-on-Hadoop
Apache Drill, the SQL-on-Hadoop query engine, delivers the fastest time-to-value through self-service data analysis.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free