Apache Drill: Introduction, Differentiation and Use Cases – Webinar Q & A Follow Up

We recently wrapped up a webinar series, covering global audience, on the topic of “Apache Drill: Introduction, Differentiation and Use Cases” that proved to be highly interactive and engaging.The webinar provided a quick introduction to Drill, covered key Drill differentiators for SQL specialists and business analysts, and provided an overview of new Hadoop use cases that were uncovered during the Drill Beta at MapR.

If you missed the webinar, you can watch the replay here.

Here’s a summary of questions and answers that were asked during these webinars.

Q: How does Drill compare to Impala?

A: On a broader category level, Drill, Impala, Hive and Spark SQL all fit into the SQL-on-Hadoop category. But in terms of differentiation capabilities, Drill has the ability to allow data exploration on datasets without having to define any schema definitions upfront in the Hive metastore. Whether your data is text files, JSON files, or whatever other file formats, Drill is built to work with schema that is dynamic, as well as data that is complex. Drill differs from Impala in that it can handle nested data better, and it can also work with data without having to define schema definitions upfront. From a performance standpoint, Drill and Impala are targeting interactive use cases, so both are optimized in terms of performance and SLAs.

Q: Where does Drill metadata reside?

A: Drill has the ability to discover schema on-the-fly, so there is no central, persistent store for metadata. That is the beauty of it – Drill is the query engine on top of the data sources; it can understand the metadata, but Drill doesn’t require any centralized metadata repository. Some of the work that you would do, for example creating views or creating tables – they’re just getting persisted back to the file system. There is no metadata repository that you are creating and managing just for Drill.

Q:  What is the recommended configuration of memory for each drillbit process on the data nodes?

A:  Drill is an in-memory execution engine. It is designed for performance; it is designed for short queries. So clearly, the more memory you have, you can make it available to Drill. Obviously, you can see speedups in terms of performance. But at the same time, there is the ability to control how much memory you want Drill to utilize; you can specify that as part of the configuration. The drillbit default limit is 8G, but you can go to the configuration file and change it according to your needs.

Q:  Does Drill support updates and deletes to the Hive tables?

A:  To reiterate, Drill has the ability to integrate with the Hive metastore, so you can do low-latency queries on any Hive tables and views that you have defined in the Hive metastore by using Drill. At this point, the focus for Drill is mainly analytics (queries), so you’re able to read the data. In terms of updates and deletes, it’s in the roadmap, but that’s not something that’s available in the product today.

Q:  Can Drill be used to query ElasticSearch or Solr?

A: Drill has a very flexible, extensible kind of storage plugin interface. As I mentioned, today we have plugins for file system, HBase, MapR-DB, Mongo DB, and Cassandra. There’s a lot of interest in the community to go beyond just the Hadoop types of data sources. There is flexibility to add that, there are no out-of-the-box plugins today for ElasticSearch or Solr.

Q:  Does Drill support windowing and SQL functions?

A:   Drill supports a number of SQL functionalities. At this point, windowing is not yet available – it’s on the roadmap, so we are working on it. You should expect to see windowing around the later part of 2015.

Q:  How does Drill manage security?

A: We didn’t have a chance to cover this in detail, but Drill has the ability to create views. Views is a key mechanism that you would use in order to provide granular access control to the users. First of all, there is authentication, so you can configure any kind of authentication module. Drill can validate against the particular authentication module that is configured. In terms of authorization and security permissions, you would do that through Views. Views are simply files on the file system, so whatever file system security model you have – it’s also applicable for Views. It is a very decentralized kind of permission model. You don’t need to define or manage a centralized repository of security permissions; it’s just through the file system.

Q: Does Drill provide advantages in querying relational tables that are unwieldy using standard SQL queries? Queries that take many hours, for example?

A: I believe the question here is, “If I have a lot of relational data which is probably very large scale – large amounts of data – and the relational system I have is challenging in terms of being able to handle the SQL queries - can Drill handle that?” The answer is yes. Drill is designed for scale, so you can scale the cluster by adding more and more drillbits to the cluster, so it can certainly handle relational queries at scale. Another aspect to note here is the fact that Drill is optimized for short queries, so if you have for example, 18-hour queries, there is no fault tolerance like MapReduce. Fault tolerance comes with a cost. At each stage, you have to read it in, process it, write the next step back to the disk, so there is a cost associated with check pointing and fault tolerance. Drill and similar SQL-on-Hadoop engines that are focused on providing interactive performance are optimistic execution engines. However, if you have systems that you believe are capable of handling several hours of workloads – Drill can certainly handle that. There are no limitations in terms of being able to handle large queries or large-scale data.

Q: Your Tableau demo looks great. Can I do the same thing with Excel?

A: Yes, you can do that. Any BI tool – Excel, Tableau, MicroStrategy, SAP Lumira – there are a number of BI tools that can be connected through JDBC and ODBC, so you can do the same demo with any of the BI tools.

Q:  Would Excel generate subqueries based on drag-and-drop, or does it pull the entire data into the local cache and then analyze it? We faced this problem with Excel on Hive ODBC.

A: In this particular case, I’m not quite sure how Excel behaves, but with most of the BI tools, there is a mechanism in which you can do the live queries, but at the same time, you can also bring the data in-memory. You can do the same thing with Tableau, so I think you can do the same thing with Excel, using the BI functionality. You can bring it into memory and do the slice-and-dice on the data in memory, but at the same time, you can also do live queries against Drill through ODBC.

Q: Do you have any benchmark results vs. Impala or Stinger or Spark SQL?

A: Drill has a very different value proposition in terms of the time to insight. If you look at any other SQL-on-Hadoop system – performance is not just about query performance, it’s about the data management. To get value from the data in Hadoop or NoSQL systems using Impala or Stinger or Spark SQL, it could take the users weeks to months before they even get to the data, because of the modeling step. Whereas with Drill, you can get to the data in just minutes, and you can start getting the value from the data. From a pure query performance standpoint, Drill is on parity with several of the systems. We are planning to publish the benchmarks in the next couple of months.

Q:  What is the easiest way to get started with Drill?

A: For more examples on how to use Drill, download Apache Drill sandbox and try out the sandbox tutorial. Refer to Apache Drill web site for additional information.

Learn More: 

MicroStrategy recently certified its Analytics Platform with Apache Drill. Learn more about the integration and its value for business analysts in the upcoming webinar on April 15th: Harnessing the Power of Multi-structured, Fast-changing Big DataRegister Here


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free