A packed room greeted Apache Drill project champion Ted Dunning last Wednesday when he spoke on the CU campus to the Boulder/Denver Big Data group. Apache Drill is a new, truly open source project being developed by an international community. Ted described the overall architecture of Drill and explained how this design will provide flexibility in terms of which data sources can be queried and what syntax a query can take.
One example of Drill’s flexible approach is its way of dealing with schemas. Drill can query data sources with a rich and well-defined schema, but this is not required. Many data sources do not have rigid schemas, and more importantly, schemas in modern data are not constant, largely due to the risk and cost of changing existing data sets. Schemas change frequently in some cases. Furthermore, in some cases such as the very sparse and wide rows found in some HBase applications, any schema would be too large to represent (imagine a table with one column per web-site).
The architecture of Apache Drill is designed to support queries made against unknown schemas. The user of Drill will have further flexibility in having a choice to define the schema explicitly or let the system discover it automatically. This can make management of schema evolution much simpler. Drill deals with semi-structured data without a strong schema by reading records in batches and inferring an operational schema on the fly from the data read. This approach can cost some performance relative to strongly structured columnar storage of data, but there are definitely times when flexible schemas are important to support.
The Boulder/Denver Big Data group was a particularly enthusiastic and interactive audience, asking great questions and moving the discussion on Drill development forward. What really seemed to strike home with the audience was Ted’s description of how Apache communities form and how they build a project. The exact final shape of Drill is not pre-defined, as the project is community driven. Even at this stage of active development, the project is open to change. Ted invited attendees to get involved, for example, by going to the Apache Drill website and adding comments to the logical plan document or contributing use-case definitions.
Ted also mentioned that Drill has now passed a mini-milestone. Drill recently had a first public demonstration of the reference interpreter, but at this stage of its development, the reference interpreter lacks a SQL parser, query optimizer, execution planner and distributed execution engine. It was demonstrated more as an element of functional documentation than production software. Just last week, however, at a customer site visit, Ted realized that the reference interpreter could already be used roughly in its current state as a component of a useful system. Drill combined with MapReduce could help manage the complexity of the life-cycle version control of feature variables in a production machine learning system. Why is Drill easier in this situation? Because its logical plan syntax is designed for manipulating data flows automatically and because of Drill’s architectural flexibility.
This mini-milestone is the beginning of a huge shift in the state of Drill from full of promise to full of practical utility. There is much yet to do, but turning points like this are important to mark.
For the slides from this Boulder Denver Big Data meetup presentation, see http://www.mapr.com/company/events/boulder-denver-big-data-2-13-13
To keep abreast of Apache Drill news, follow @ApacheDrill on Twitter and visit the Apache Drill home page and wiki at http://incubator.apache.org/drill/