Editor's Note: In this week's Whiteboard Walkthrough, Tomer Shiran, PMC member and Apache Drill committer, walks you through the deployment of Apache Drill with different storage systems and the connection with BI tools.
Here's the transcription:
Let's talk a little bit about how you would deploy Apache Drill. Of course, you could download it and run it on your laptop, and that's a great way to get started, but once you're ready to run it at scale and process and query lots of data, you would want to run it on a cluster of Linux boxes. Drill actually allows you to run it in a co-located mode where the drill bit processes, these are where the drill processes, are actually running on the same nodes as the underlying storage system.
In this case, in a Hadoop cluster you would have a data node on each node and a drill bit on every node. Drill makes an effort to achieve data localities. The fragment of the query that's running on a specific node is actually trying to... the system tries to make it so that that fragment is actually processing data that's on the local node to the extent possible. If you were, for example, running Drill in order to query data in a MongoDB cluster, then you just replace in this picture, you replace the data nodes with the MongoDB process that you have in a Mongo cluster. Basically, it's the exact same concept.
Now, sometimes you may want to query data that actually doesn't sit on the cluster and actually can't sit on the same physical cluster where Drill is running. For example, if you think about something like AmazonS3, that's actually, think of that as some remote storage system whose implementation details aren't important, but Drill has to be able to talk to that remotely. In that case, you have just the Drill bit running on every node in the cluster and these Drill bits are communicating directly with the S3 storage through the standard S3 APIs. That's how Drill operates.
Now, when a client wants to submit a query, or in general, interact with a cluster, they can actually talk to any of these Drill bit processes. This is a symmetric architecture, and the Drill bit process to which a client is submitting the query, we call that the Foreman. It doesn't do a lot of work but it does do the query planning. The actual execution of the query runs in parallel on the entire cluster, and so the client could be an ODBC OR a JBDC client. That would be a BI application something like Tableau or QlikView or Microstrategy or Spotfire or Excel even. That goes through the ODBC driver on the client side, so that would be the ODBC driver and then here you would have a tool like Tableau. The ODBC driver is then interacting with one of the Drill bits, and it doesn't really matter which one of these Drill bits. It can actually talk to any of them.
Then, you could have also REST clients, so there's a REST API for Drill and your custom applications in Python or Java or any other thing, which that can use the REST API as well. Let's say you wanted to do custom dashboards, for example, so you can use the REST interface as well to talk to the Drill cluster. That's the high-level description of how a Drill cluster is deployed and how the client communicates with the cluster.