Apache Drill 101: Data Analysis for (Almost) Everyone

Big Data analysis is intimidating to some people. They assume you need a background in statistics, deep technical knowledge, and other complex skills. But you don’t need to be a data scientist to extract insights and value from Big Data with Hadoop and Apache Drill.

Drill uses standard ANSI SQL to query a variety of structured and semi-structured data types. Users can therefore leverage their existing SQL skills, and preferred business intelligence (BI) tools, to run queries. And since Drill can combine data from multiple sources on the fly in a single query, even a non-expert user can dig deep into big data. Drill can run on a laptop and scale from a single node up to a large cluster of servers, enabling data querying in its native formats. Plus, there’s no need to define schemas, which allows for self-service data exploration.

This walkthrough is intended for those who have limited or no experience with Drill. It assumes some familiarity with running SQL queries, however. The following information can be utilized in corporate training programs or distributed to users to help them build confidence in working with Drill.

This is the first of a three-part series. Each article in the series will build on the basic skill sets detailed here. The material below shows how to use Drill to perform familiar SQL queries on structured data.

The two subsequent guides will show how to use Drill to:

The Drill Sandbox and Material

Even if Drill is already deployed on your network, you may find that users (and IT staff) are more comfortable if beginners learn Drill in a sandbox environment. The easiest way to do this is with MapR Sandbox with Apache Drill, a free and fully functional single-node cluster.

To access the materials and data to follow along with the steps in this blog, register for the Drill Essentials course and download the content from Course Materials sheet. You will also find a link to download the Drill Sandbox in there.

Introducing Drill

For the purposes of this walkthrough, we are all employees of The Big Office Supply Company, a large retail business. The Big Office Supply Company uses data analysis to better understand their customers’ activities and purchasing habits.

The Big Office Supply Company marketing department wants to begin a new sales promotion. We’re going to help them decide when and where to focus their promotional efforts. To do this, we will determine:

  • The top grossing month of sales
  • The rank of countries based on their gross sales in that month
  • And the top 10 products based on total sales volume

Preparing to Analyze Data

Our IT Department has offloaded its customer data from a transactional Oracle system into a Hive table on a Hadoop cluster.

This table, called “orders,” includes columns for the order ID number, the month the order was placed, the customer ID of the person who placed the order, the country where the customer lives, the product ID of the order, and the total amount of the order.

We’ll use Drill to query this table to find the information the marketing department needs.

Querying with Drill

As the slide above shows, you begin with Query 1 to determine the top month of gross sales.

  • SELECT `month`, SUM(order_total) as sales
  • FROM hive.orders
  • GROUP BY `month`
  • ORDER BY sales desc;

Query 2 provides the sales total in the specified month, broken down by state.

  • SELECT `month`, `state`, SUM(order_total) as sales
  • FROM hive.orders
  • WHERE `month`='June’
  • GROUP BY `month`, `state`
  • ORDER BY sales desc;

Query 3 determines the top 10 products based on volume of sales.

  • SELECT `prod_id`, SUM(order_total) as sales
  • FROM hive.orders
  • GROUP BY `prod_id`
  • ORDER BY 2 desc limit 10;

We now know what products are most popular, and when customers in which states are most likely to make purchases. This information will help The Big Office Supply Company marketing department determine which products to promote at what time of year, and enable them to target the campaign to specific states.

As seen here, running queries on Drill is extremely easy and intuitive for a SQL user. Drill also works with BI tools if you prefer to use it that way.

Stay tuned for part 2 of this series.

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free