This blog delves into the Drill features that our beta customers felt were exciting and important for them, and also discusses some noteworthy features that the Drill community implemented based on some of our feedback.
Features that our beta customers loved about Drill include:
Getting Started with Drill is Extremely Easy: You can get started with Drill either by installing Drill on your Hadoop cluster, or by downloading the MapR Sandbox for Drill, or by simply downloading Drill binaries on to your laptop and running it in embedded mode. Most beta users found the laptop choice to be most effective. This was especially true for BI users who wanted to get educated on Drill, but they did not have to get introduced to Hadoop first before learning about a new SQL tool. You can find more information on the different options at the end of the blog.
Improving Data Pipelining Processes: Drill prodded our end users to start re-thinking about their day-to-day data pipelining processes and activities. The BI teams were accustomed to relying on IT to get the heavier ETL work done prior to any analytical work, and had established certain processes and informal SLAs to do so. The problem only got worse with the onset of big data. Given that Drill does not require schema definitions upfront, the BI groups now feel much more empowered to easily explore new datasets. Meanwhile, the IT teams are also welcoming Drill, because they can now focus on their larger data governance, reliability and security objectives and avoid being a bottleneck in the data analytics process.
Seamless Connectivity to Existing BI Tools: Given that Drill supports standard ODBC connectivity (thanks to Simba for keeping the drivers up-to-date), users were able to plug in their existing BI tools to Drill through just a few clicks. Drill also supports ANSI SQL semantics, which was also exciting because there was no learning curve on the BI user’s end.
We received a lot of product feedback from our beta customers. MapR duly submitted these requests to the community, which has already released 0.7 with a number of feature enhancements and bug fixes. You can learn more about the releases here. At a high level however, there were certain core technology themes and requirements that surfaced during the last few months.
Need to Standardize on an Efficient Data Format
One of the most frequently asked questions during the beta program was, “What is the most efficient format to store analytics-bound data on Hadoop?” For those who come from the columnar storage world, the answer may not come as a surprise. In our experience, the Parquet data format is currently the most advanced and flexible data format for SQL analytics on Hadoop. It is binary and columnar, it has efficient compression characteristics, and it allows for JSON-like flexible semantics. Drill leverages the Parquet format to the maximum extent possible for query optimization, and has extended it to include new features such as support for scalar and complex data types. More work is happening on this front; you can read more about Parquet on Drill here.
Automatic Partition Pruning
Partition pruning emerged as a key requirement that Drill users wanted as soon as possible. As you may already know, partition pruning can substantially improve performance by limiting the datasets that are scanned during a query. Drill automatically partitions data lying in different directory structures, and ensures the query optimizer scans only relevant datasets when the appropriate filters are set. You can learn more about this feature here.
KVGEN and FLATTEN
Working on nested data can be tricky, especially when you are used to tabular rows and columns in the relational world. Based on beta customer feedback and to bring the nested world closer to the relational world, the Drill community developed two specific functions that can come in extremely handy for developers and analysts. The first one is called the KVGEN function that generates key-value pairs out of any arbitrary “map” in your dataset. You can imagine deep hierarchical maps to be a simple collection of key value pairs still maintained in that hierarchical structure. As a next step, in order to bring all these key-value pairs to one level as individual records, the FLATTEN function was added. Once you flatten your data, you can continue to use regular SQL constructs such as Where clause, Join, Count, Average, etc. to analyze the data easily.
These were just a few highlights. As we head towards GA, we will keep you posted on newer insights we gain from end users. If you are already using Drill in your environments, feel free to share your experiences here.
Here are some links you will find useful to get started with Drill: