M.C.Srivas, CTO and Co-Founder of MapR Technologies recently spoke at the Munich Hadoop User Group about the Apache Drill project. The following is a blog from HUG Muenchen originally published on the comSysto blog.
A deep dive into Apache Drill - fast interactive SQL on Hadoop
What is your personal role within the Drill project?
M.C. Srivas: I work with the Drill team to understand performance, figure out some of the architectural issues, and basically play around with it. But the project is mainly run by Jacques Nadeau, and he is a really great guy to work with.
What is the philosophy behind Apache Drill?
M.C. Srivas: Two main things: First, Drill is designed to be completely extensible, and second, Drill is designed to do things that the underlying data storage may not be capable of doing, yet it can exploit the power of the data storage when it is indeed capable.
How does Drill compare to Hive, Impala & Shark?
M.C. Srivas: Drill implements full ANSI SQL 2003, with some really cool extensions. Hive, Impala and Shark are all implementations of the Hive query language which is different from ANSI SQL.
What makes Apache Drill special?
M.C. Srivas: Apache Drill is the first time anyone has tried to handle semi-structured data in a meaningful manner within the SQL language. It is also the first time that SQL can handle self-described data without requiring a meta-data manager, so an analyst can query data without requiring a schema definition. The raw data can be directly processed without ETL. The toughest challenge was to detect the schema automatically, and to compensate when the schema changes during the query itself.
How are the different data sources reflected in the query language?
M.C. Srivas: The source of the data is included directly in the FROM clause instead of using connectors. Other interesting innovations include tokenizing the directory structure in the data file pathnames, so those tokens can be used in the query.
How does Drill handle nested data?
M.C. Srivas: Drill introduces a FLATTEN clause to promote nested data to the top level, where it can be queried. Drill also borrowed an idea from Google’s Dremel and BigQuery to query inside the nested data, by implementing the WITHIN RECORD clause in the FROM clause.
You mentioned changing schemas while the query is running. How does Drill manage these?
M.C. Srivas: Drill does all its work in 256K boundaries. If a schema change is detected within the last 256K of data, it will first emit whatever it has computed so far, and then reconfigure its operators to the new schema and continue execution.
How is MapR involved in the Drill project?
M.C. Srivas: MapR kicked off the Drill project about 18 months ago, and now has almost 20 engineers working full time on it. But the project itself is much larger than MapR and there are several companies and individuals involved in the project. Including the folks at MapR, there are about 35-40 people actively working on Drill.
Does Drill take advantage of any of MapR’s special features?
M.C. Srivas: No, because it’s not possible to do so. The special features of MapR are all administrative improvements and do not modify the API.
Finally: What’s your impression of the Munich Hadoop User Group?
M.C. Srivas: I think there’s always a lot of interest in Hadoop and Big Data in general in Munich. There are many companies in Munich that are doing Hadoop projects. I am very grateful to comSysto for sponsoring and organizing the HUG regularly. comSysto is a great company and the people and the management are really terrific to work with.