The Future of Data Warehouses and SQL-on-Hadoop

Q & A from “The Future of Hadoop Analytics: Total Data Warehouses and Self-Service Data Exploration” Webinar

The recent MapR webinar titled “The Future of Hadoop Analytics: Total Data Warehouses and Self-Service Data Exploration” proved to be a highly informative, in-depth look at the future of data warehouses and how SQL-on-Hadoop technologies will play a pivotal role in those settings. Matt Aslett, Research Director for 451 Research, along with Apache Drill architect Jacques Nadeau, discussed what lies ahead for enterprise data warehouse architects and BI users in 2015 and beyond.

If you missed the webinar, you can watch the replay here. Following are the answers to the questions that were asked at the end of the webinar:

Q:  How is Apache Drill different from SQL-on-Hadoop systems?

Jacques: Traditional SQL and Hadoop systems have been very focused on trying to make Hadoop look like an RDBMS. You start out by sending out schema, and then you query with a SQL-like language. Many of them are not fully embracing SQL as a standard. At the core of it, they’re trying to make it look like a database again.

That’s where Drill is very different. Drill is focused on providing an additional level of flexibility so that users can interact with the new types of data without making it look like a database first. If you think about the chart that Matt showed early on, he talked about users who were replacing their existing data warehouses, and then there are users who are trying to add Hadoop for new uses cases. By and large, traditional SQL and Hadoop is really focused on trying to replace your existing data warehouse. While Drill is fully capable of supporting that use case, Drill is also focused on enabling a lot of new use cases that you traditionally couldn’t do without a lot of IT involvement.

Matt: And the way in which Hadoop is going to complement the existing analytic databases is not just by being another platform for storing more data, but for opening up new approaches to engaging with that data. Certainly from our analytics coverage, we see a lot of interest in self-service analytics and self-service integration. Hadoop and schema-on-read will be key to opening up those opportunities to get additional value from that data.

Jacques: The key to Hadoop, like any parallel system, is the ability to scale out as you need to in order to support a certain quantity of data at a certain speed. Drill actually sidesteps using MapReduce altogether. It maintains its own execution engine that allows it to achieve very low latency. That being said, if you have one computer and you have seven terabytes of data, you’re probably not going to get your answer back very quickly, so sizing is also a consideration. The key difference is that Drill allows you to analyze this semi-structured without additional work. The reason that Drill allows you to do that is because you may not be aware of the data patterns (there are typically patterns inside the data). Drill does a good job of analyzing the data as it’s coming through, and re-compiling runtime compilation code to process that code very quickly, just as if we had known the schema ahead of time. Then we simply recompile as we see changes in that schema. Drill can not only provide users with the highest level of flexibility, but it also focuses on providing performance at the highest levels. In addition, Drill is very much in parity with the fastest of the other SQL-on-Hadoop solutions.

Q: Do you have a query plan optimizer?

Jacques: We have an advanced parallel query optimizer that allows Drill to make smart decisions about how to place the work as well as how to replan the query. With any kind of Hadoop system where you’re going to have minimal statistics, the optimizer can’t do quite as well as if you had full statistics on the data. However, Drill does generate full statistics in providing those as well, in order to do more advanced levels of optimization. Beyond that, Drill provides a substantial amount of extensibility so that you can customize your query workloads to make sure that it goes fast and to really control how that query is going to be operated.

Q: What are the use cases for Apache Drill?

Jacques: It runs the gamut. There are people who are saying, “I just want to take a bunch of extra data that I can’t fit into my data warehouse, and I want to analyze that with SQL.” So that’s the traditional way to replace or offload the data warehouse; Drill is very well suited for that type of workload.

But it was also designed for the situation where you have new types of data that you need to interact with quickly, and come up with conclusions based on that data. And that could mean interacting with JSON without manipulating it, or working with a NoSQL solution like HBase. Drill allows you to query using those tools or using standard SQL, but with extensions so that you can interact with the data in a reasonable way.

Q. Are there different types of connectors for Apache Drill?

Jacques: Drill provides the standard JDBC and ODBC. It also provides lower level, schemaless APIs in both C++ and Java. It provides a CLI (command line interface) as well as a REST interface.

Q. Does Apache Drill store any info in its own DB?

Jacques: No. Drill is designed as a distributed query layer. One of the strengths of Hadoop is the loose coupling between systems; trying to be monolithic about approaching one of these problems generally causes problems for other tools that you want to use to interact with that data. The key thing that is changing in Hadoop is that you may have data stored in a certain way, and you want six different systems to interact with it. So having it stored in one particular system is not the ideal approach. Apache Drill is data agnostic; it’s more optimized for certain types of data formats, but its focus is to support all of the different queries that exist, as well as other native formats.

Q. Does Apache Drill also take one of the sources as in-memory?

Jacques:  Yes. Initially, we have a certain set of meta sources which do not include an in-memory source. That being said, we actually have shown in numerous cases that providing a new storage plug-in connector for a new type of data source is very straightforward. We are actually looking at supporting REST APIs as well as certain socket interfaces as direct interfaces for Drill. Since we’re in the open source community, if you’re interested in something, you can always come to the community and suggest a feature; we can look at it together and work on that.

Q. Does Apache Drill integrate with other tools such as Spring?

Jacques:  Right now, Drill is primarily focused on analytical workloads, not operational workloads – inserts, updates and deletes. It’s fully SQL-compliant for analytical workloads, but it doesn’t yet support operational workloads. That’s something that will be coming in the next 6-12 months. At that time, we will probably look at integrating Drill with Spring, ORMs, and those types of tools. Generally speaking, for analytical workloads, that’s not the most common interface.

Q. What about subquery support?

Jacques: Yes, Drill supports correlated subqueries with Exists, IN, etc. Drill includes standard syntax for all of those. One of the things that Drill did as an open source project was to pick up a very mature SQL parser that supports a very strong SQL spec. This is a foundational principle of Drill: it allows you to interact with it by using standard SQL. We won’t be deviating like HiveQL does; we are focused on standard SQL, and making it work the way it should work.

Q. Does Drill support distributed joins?

Jacques: Yes. Anything that’s a data table within Drill can be queried across. Drill actually exposes things in terms of databases and schemas, but Drill is different from most data systems in that Drill supports analysis and joins across any of those databases and schemas in the same query.

Q: With all of these new approaches, what does it mean in terms of the importance of Hive?

Matt: You talked about how Drill can enable you to do things that you can’t do with Hive. From our perspective, Hive is going to continue to be important; obviously there are a lot of people out there that are involved in Hadoop projects that are using Hive already and have applications that depend on Hive. But the new technologies and products enable them to take different approaches for newer applications. Hive will become one of many approaches to SQL and Hadoop rather than being the dominant approach. 

Jacques: I would second that. I think Hive is critical to many of our customers, and it will continue to be an important player in the space. The reality is that each of these tools will find use cases where it’s really, really good, as well as use cases where it’s not. Five years ago, people asked the same questions about Cassandra and HBase – why would you have both Cassandra and HBase? What we’ve found, over the last five years in both of those cases, is that both tools became specialized in certain sub use cases. I think that will continue to happen; Hive has an extraordinarily rich history and a very large feature base, and it’s actually much broader that most of the other open source solutions. But it also carries with it a legacy of complicated interworkings that were built organically. The pieces are all going to fit together; the strength of the Hadoop system is the fact that there doesn’t have to be one answer, and only one answer. Depending on particular use cases, and depending on people’s comfort with a particular type of technology, all of these projects will continue to have a place. There will probably be more specialization over the next five years with things like Spark, SQL, etc. There will be a lot of specialization towards workloads that specifically involve things like machine learning and more complicated semi-interactive work flows. I call those work flows  semi-interactive only because you have to write a bunch of code before you can run them, and that means it’s going to be a little less important to have those few-second response times. Hive is known for its extreme scale, in part because it relies heavily on MapReduce for its execution, but it also has some legacy around how it can serve a large numbers of users because of that architecture. So each of these tools will have a huge amount of specialization.

Q: How do you do transformation with Hadoop?

Matt: Clearly, you can write the job itself in terms of transformation, but more commonly, we see that people are using existing ETL tools. Most of those now support Hadoop as the underlying engine, as well as the source and the target for ETL jobs.

Q: How do you get started with Apache Drill?

Jacques: We provide a MapR Sandbox with Apache Drill, which is a vm that you can download, play with the tools, and begin to understand how they all work together. As part of that, we provide a large number of tutorials so that you can walk through a use case and better understand what types of different technologies will solve your particular problem.

Want to learn more? Check out these resources on MapR and Apache Drill:

Download the MapR Sandbox with Apache Drill: https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill

Download Apache Drill: http://incubator.apache.org/drill/

Ask questions: drill-user@incubator.apache.org

Contribute: http://github.com/apache/incubator-drill/

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free