SQL has become really hot – Why? Customers are looking for interactive performance in big data solutions with streamlined work flow and flexibility in their choices. Being able to use SQL effectively on Hadoop and other big data systems is a big step toward meeting that goal.
One reason for this need is that tools talk SQL, but previously big data solutions did not. IBM researcher Ted Codd developed SQL in the 1970s because people needed a standardized way to access and use data from relational databases. That need is still there, and it is even more important than ever because so many systems have been coded to produce standard SQL. With modern scalable systems like Hadoop and with the addition of non-relational databases, however, standard transactional SQL no longer was a good fit, so systems became separated. That mismatch potentially meant cumbersome, expensive and often elaborate works-arounds to connect the widespread need for SQL compatibility with the low-cost advantage of Hadoop-based big data systems.
MapR Technologies is addressing these problems in a number of ways. MapR provides a broad level of support both through its own big data platform and through its contributions to the open source project, Apache Drill.
There are as many as 30 new products and open-source projects that attempt to address the need for SQL or SQL-like on Hadoop, including Apache Hive, Impala from Cloudera, open source Apache Drill or the open source SQL solution for MapReduce and Hadoop called Lingual, developed by Cascading. MapR’s data platform supports these and more of the important big data SQL or SQL-like solutions.
What is Apache Drill and why is MapR investing so much in it?
Customers want a whole spectrum of capabilities, and Apache Drill is designed to make it easy to connect the function of a wide range of analytic tools and data sources while introducing new technologies.
Many other SQL on Hadoop projects are re-implementing things that were invented in the small data world, trying to adapt them to big data needs. They are addressing a real need, but they are fundamentally rear-view mirror type projects. Apache Drill, in contrast, is a vehicle for introducing new technologies into this problem space. While inspired by Google’s Dremel project, Apache Drill reaches beyond that and is being designed with new capabilities.
Apache Drill provides interactive ad hoc query capabilities for accessing large data stores. Speed is a key feature of Drill, as it is designed to handle petabytes of data with low-latency responses.
The most important aspect of Drill is that it is not resolving problems of the past 5- 10 years but instead is going forward to build a new technology that addresses current needs and anticipates those of the next 5 years.
Drill’s highly flexible architecture is designed to provide key technologies that include:
- Schema can be optional
- Ability to handle nested data (such as JSON, Protobuf, Parquet)
- Columnar in-memory storage and execution
- Full standard ANSI SQL:2003 query capability
- Advanced cost-based optimizer
- Highly extensible architecture to provide the widest benefit for multiple communities (for example, to extend capabilities to non-SQL PIG or to build machine learning primitives that could be integrated into Drill to give Apache Mahout an advanced execution engine)
- YARN integration
The community-driven aspect of the open source Apache Drill project is important. In addition to the support provided by MapR, Apache Drill contributors come from a variety of locations and companies, including Pentaho, Oracle, and VMware among others. Drill developers have been working collaboratively to produce a large amount of code, preparing for the alpha release.
With these new technologies to connect traditional tools with modern Hadoop-based systems, we are moving into an exciting period of big data analytics and machine learning at scale.
This blog was originally published on The Hive.