Drill-ing on Amazon

Over 40 developers gathered recently at OSCON for an Apache Drill hands-on workshop in Portland, OR to learn what Drill is, how it can be used and to jump in and try it out. Jacques Nadeau, Drill committer and MapR engineer, and Ted Dunning, Drill project champion and MapR Chief Application Architect, guided the workshop participants. Thought if you couldn’t make we’d share what the participants experienced.

The workshop started with an explanation of Drill’s purpose: to provide a way to easily query a variety of big data sources and types using true SQL or non-SQL queries. Drill is designed for ease of access for ad hoc, interactive queries particularly in the 100 ms sec to 20 min time range. Drill also offers multiple internal integration APIs and provides a showcase for a variety of high-performance query technologies.

A tour of Drill code followed to learn about the architecture of the logical plan and SQL parser, about value vectors and the goal of the execution engine. Drill’s benefits were also discussed, including Drill’s flexibility with reference to different sources of input data, layered API and how the design allows you to avoid deserialization.
Then everyone rolled up their sleeves and got to do some Drill-ing themselves, working with Drill running on several Amazon EC2 instances. Participants downloaded and compiled Drill, ran SQL queries, inspected logical plans and generally kicked all kinds of tires with the goal of gaining an understanding of how SQL queries are transformed into a logical plan. The logical plan is a key step in diving in to see how Drill handles a query, since the logical plan embodies the basic data flow expressed by the query.

Figure: Overview of Apache Drill Architecture
Going a bit deeper, participants learned about the physical plan. They explored how the cost-based optimizer in Drill transforms the logical plan into the physical plan that incorporates knowledge of how the plan can be parallelized as well as special capabilities of different data sources

The discussion of the physical plan naturally led to a description by Jacques of the way that Drill handles execution of plans internally. Drill has a data structure called ValueVector that allows data to be moved very efficiently between processes without any serialization overhead. Jacques also described how dynamic code generation is done in Drill so that data can be processed using pure native code.

How can you find out more about Drill?
You’ve got several avenues to learn more about Apache Drill, depending on whether you’re interested in technical aspects of Drill development or in the ways Apache Drill can be used as a business solution.
To get the code used at the OSCON Apache Drill workshop, go to Github:

Join Bay Area Apache Drill User Group and attend meet-ups for technical discussions of Drill development:

Also check out the Apache Drill website, which has a diagram of the architecture, useful links to code and links to the mailing lists. Consider subscribing to the developer mailing list to join the community or just click the link to follow the current thread on the developer mailing list, if you only want to observe for now.

Click here to follow @ApacheDrill on Twitter:

About Apache Drill
Apache Drill is an open-source collaboration being developed by engineers from MapR, Pentaho, Twitter and Microsoft, some supported by their companies and some working on their own.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free