Drill into Your Big Data Today with Apache Drill

Apache Drill has been gaining significant user adoption and community momentum since its initial Beta availability in September 2014. The generally available version of DrillDrill 1.0was released in May 2015, and numerous customers have deployed and used Drill in production since then. In this blog post, I will briefly summarize some of the key capabilities that customers are finding immensely valuable in Drill. I’ll also cover common use cases where Drill is deployed, as well as resources for getting started with Drill.

Why Drill is compelling for customers

1) Drill provides SQL access on any type of data, with extreme flexibility and ease of use

With Drill, you can query data in files, a Hive data warehouse, HBase tables, or even non-Hadoop based storage systems in just a few minutes, and you can combine data from these sources on the fly. There’s no need to define and maintain any central metadata definitions. Drill queries data in-situ and discovers schema on-the-fly. Along with comprehensive SQL support offered by leveraging an advanced SQL parser (Apache Calcite), Drill also provides extensions to SQL to natively query and manipulate complex data types such as arrays and maps commonly seen in most new data sources (such as web site clicks, social, sensor data) in big data environments. Drill also comes with ODBC/JDBC drivers, so it can be plugged into BI tools such as Tableau and MicroStrategy very easily for wide usage in the organization.

2) Drill provides low latency performance at scale

Drill is a distributed and columnar SQL query engine built from the ground up for complex data. It doesn’t use MapReduce, Tez, or Spark. Drill can be deployed on a single node or can be horizontally scaled to 10s to 100s to 1000s of nodes, depending on the number of users that need to be supported, performance SLAs to be met, and the amount of data you that needs processing. Along with scale, Drill is built for performance. The in-memory columnar execution engine, designed for optimistic processing of short queries, is combined with advanced and pluggable optimizations including partition pruning, pushdown operators, and rule-based and cost-based query re-write capabilities. These capabilities make Drill a powerful interactive tool in the big data ecosystem.

3) Drill provides a granular and de-centralized security model

The views in Drill typically serve as management units to provide granular row and column-level access control on Hadoop data. Unlike other SQL technologies/tools, Drill views are de-centralized entities, and simply maintained as files on the file system (users can choose the file system location to create views as part of the query). This means that the views can be secured using file system permissions without any need to standup a separate security repository for managing permissions.

Additionally, Drill supports user impersonation, so the specific user identity can be used to access these views instead of system or process users accessing the data, which is not acceptable in several user environments. Drill also offers powerful ownership-chaining capabilities that control how many levels of nested views a given user can access, so organizations can strike a balance between self-service data exploration with controlled governance.

Use cases for Drill

At a broader level, the use case for Drill is to provide self-service BI/adhoc queries on the data stored in a Hadoop data lake/data hub. Several sub use cases exist under this umbrella, and below are some common usage patterns we see customers leveraging Drill for in their environments. Note that there is often a mix of these use cases that are used simultaneously, depending on the type of data processing and reporting requirements.

  • Raw data exploration: Data comes into Hadoop cluster typically in raw data formats such as text and JSON. The goal is to make it available for queries to end users, analysts, data scientists and other SQL experts as quickly as possible in a self-service fashion. This is the most powerful and low barrier entry point we have seen customers using to get started with Drill. Drill brings light to these large raw datasets (and some times ignored datasets due to the complexity and cost involved in processing), instantly opening up new types of BI use cases such as supporting adhoc proof of concepts and queries, new product development, data discovery for building models, data exploration and data quality reporting.
  • Low latency queries on Hive tables: In this use case, data arrived in a Hadoop cluster from a variety of data sources (often offloads from traditional systems) are first modeled, pre-processed, and transformed using Hive ETL jobs. The goal is to open up the datasets stored in Hive for BI/adhoc queries. This is the standard use case, and almost all of the SQL on Hadoop tools are ocused on solving this. Drill offers a strong value for this use case with its ANSI SQL capabilities, deep integration with Hive allowing reuse of Hive assets (such as file formats, UDFs, and metadata definitions), and huge performance gains over queries done via Hive.
  • Operational analytics on HBase/MapR-DB: In this use case, HBase/MapR-DB is used as an operational data store/data hub for wide, sparse, often dynamic datasets that require frequent updates. With its ability to discover schema on the fly from NoSQL data sources in real time, and comprehensive SQL function support to read/interpret a variety of data types and encodings, Drill serves as a natural tool to query the data in these systems.  

Product progress

The Drill community is making rapid progress on the product with iterative releases. Soon after the core foundation was delivered in GA, a new 1.1 release was delivered in July (refer to the release notes), building on the feature set to support the above use cases along with continued improvements on SQL support, performance, scale and enterprise manageability. There are more exciting enhancements in the Drill 1.2 release for you to check out as well.

How to get started with Drill

For full documentation, please refer to http://drill.apache.org/docs. Additional resources can be found at http://mapr.com/apachedrill

Do you have any questions about Apache Drill? Ask them in the comments section below.

no

CTA_Inside

Delivering Fastest Time-to-Value for SQL-on-Hadoop
Apache Drill, the SQL-on-Hadoop query engine, delivers the fastest time-to-value through self-service data analysis.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free