Over 40 developers gathered recently at OSCON for an Apache Drill hands-on workshop in Portland, OR to learn what Drill is, how it can be used and to jump in and try it out. Jacques Nadeau, Drill committer and MapR engineer, and Ted Dunning, Drill project champion and MapR Chief Application Architect, guided the workshop participants. Thought if you couldn’t make we’d share what the participants experienced.
The workshop started with an explanation of Drill’s purpose: to provide a way to easily query a variety of big data sources and types using true SQL or non-SQL queries. Drill is designed for ease of access for ad hoc, interactive queries particularly in the 100 ms sec to 20 min time range. Drill also offers multiple internal integration APIs and provides a showcase for a variety of high-performance query technologies.
A tour of Drill code followed to learn about the architecture of the logical plan and SQL parser, about value vectors and the goal of the execution engine. Drill’s benefits were also discussed, including Drill’s flexibility with reference to different sources of input data, layered API and how the design allows you to avoid deserialization.
Then everyone rolled up their sleeves and got to do some Drill-ing themselves, working with Drill running on several Amazon EC2 instances. Participants downloaded and compiled Drill, ran SQL queries, inspected logical plans and generally kicked all kinds of tires with the goal of gaining an understanding of how SQL queries are transformed into a logical plan. The logical plan is a key step in diving in to see how Drill handles a query, since the logical plan embodies the basic data flow expressed by the query.
Going a bit deeper, participants learned about the physical plan. They explored how the cost-based optimizer in Drill transforms the logical plan into the physical plan that incorporates knowledge of how the plan can be parallelized as well as special capabilities of different data sources
The discussion of the physical plan naturally led to a description by Jacques of the way that Drill handles execution of plans internally. Drill has a data structure called ValueVector that allows data to be moved very efficiently between processes without any serialization overhead. Jacques also described how dynamic code generation is done in Drill so that data can be processed using pure native code.
How can you find out more about Drill?
You’ve got several avenues to learn more about Apache Drill, depending on whether you’re interested in technical aspects of Drill development or in the ways Apache Drill can be used as a business solution.
Also check out the Apache Drill website, which has a diagram of the architecture, useful links to code and links to the mailing lists. Consider subscribing to the developer mailing list to join the community or just click the link to follow the current thread on the developer mailing list, if you only want to observe for now.
Click here to follow @ApacheDrill on Twitter: Follow @ApacheDrill About Apache Drill
Apache Drill is an open-source collaboration being developed by engineers from MapR, Pentaho, Twitter and Microsoft, some supported by their companies and some working on their own.
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!
Apache Drill and Apache Mahout Committer, Big Data Consultant at MapR
Ellen Friedman is a consultant and commentator on big data topics. Active in open source, Ellen is committer for Apache Drill and Apache Mahout projects and co-author of many books on working with data in the Hadoop ecosystem. She has a PhD in biochemistry, years of experience as a research scientist and has written about a wide range of technical topics including biology, oceanography and the genetics of learning and memory.
Ellen thinks rabbits are funny, so she helped design magic-themed cartoons in the book "A Rabbit Under the Hat."