This is the second in our three-part series focused on building basic skill sets for use in data analysis. The material is intended for those who have no prior, or very limited, experience with Apache Drill, but do have some familiarity with running SQL queries.
The following information can be utilized in corporate training programs, or provided as handouts to users, to help them better understand the possibilities provided by Drill. In Part 1, we walked through a basic process of querying data that is accessible to any user with basic SQL skills. In Part 2, we break down the steps involved in running a more complex query, working with semi-structured data.
Even if users are not expected to perform more sophisticated queries independently, this material will help them to understand the process, spark ideas about how data can be used, and enable them to have more productive conversations with their data analysis team members.
The Drill Sandbox
Even if Drill is already deployed on your network, you may find that users (and IT staff) are more comfortable if beginners learn Drill in a sandbox environment. The easiest way to do this is with MapR Sandbox with Apache Drill, a free and fully functional single-node cluster.
To access the materials and data to follow along with the steps in this blog, register (if you have not already done so for part 1 of this series) for the Drill Essentials course and download the content from Course Materials sheet. You will also find a link to download the Drill Sandbox in there.
Synopsis of Part 1: Data Analysis for (Almost) Everyone
We are all, for the purposes of this walkthrough, employees of The Big Office Supply Company, a large retail business. Our marketing department wants to begin a new sales promotion. We analyzed data to help them decide when and where to focus their promotional efforts. We have determined the top grossing month of sales, the rank of countries based on their gross sales in that month, and the top 10 products based on total sales volume.
Digging Deeper with Drill
Our marketing department is happy to know the optimum times and locations to launch a product promotion. But our data analysts feel that customer purchase information is, by itself, not enough information to allow The Big Office Supply Company to confidently launch a marketing campaign.
We’re going to look at the data we’ve collected regarding customers who visit our website, to gain a better understanding of who they are and what they want.
When people view product pages on our website, we collect click data in a nested JSON format and log files as flat text.
- Click data lets us understand how people are moving through our website – what links they are clicking on to find the information they need. It also shows us where people might be disengaging with our site, and whether our marketing messages and sales funnels are working.
- “Nested” indicates data has a complex hierarchical structure.
- Log files maintain a history of requests, which may include IP addresses, date/time, specific page requests, HTTP code, bytes served, referrers, browser type, etc.
We also collect all user product search requests, which are regularly converted into parquet format.
- Parquet is a column-oriented storage format designed to improve query speed.
Analyzing the Data
Our data analysts want to query all of the files mentioned above for additional insights that will enhance the effectiveness of our marketing promotion.
- First, they look at the search data to see which items were most frequently searched for over the last year to determine our most popular products. We will then target those items in our promotion.
- Next, they will use the click and log data to look at who viewed our site but did not buy the product they looked at.
- Since we know they have interest in that specific product, we can target our promotion to that item and expect a higher rate of return on our marketing efforts.
Additionally, our data analysts take a look at what products people are purchasing.
When coupled with information from our products database, we can provide targeted coupons for upgrades or consumables associated with those products.
How Drill Simplifies Querying Semi-Structured Data
Usually, analysts would not be able to use SQL to work with these types of semi-structured files.
- Structured data is neatly organized in tables, further broken down into columns and rows, in a relational database.
- Unstructured data are files such as photos, graphics, videos, webpages, and PDFs that are difficult to organize in a structured format.
- Semi-structured data is somewhere in between. JSON, Parquet fall under this category. Metadata tags are typically used to identify the data, but it isn’t organized in a rigid structure. This a common format for big data.
Most data query tools don’t work well with semi-structured data, relying on the structural organization provided by a schema to locate information. To query semi-structured data, an analyst would need to enlist a data engineer to help them write a MapReduce program, or perform a lengthy ETL process to flatten the data and create a centralized schema for it.
Drill makes querying semi-structured data extremely easy.
- Drill can discover the schema of the data on the fly, allowing our data analysts to query the data without creating any upfront schema definitions.
- Drill’s flexible data model makes it easy to query and manipulate nested data without having to flatten it.
- Data analysts can use Drill to query JSON and parquet files using the standard ANSI SQL that they are familiar with, just like they did with our structured database.
Part 3 of this three-part series will explore how Drill discovers the schema of data on the fly.