Structured SQL or Mongo-like Flexibility? With Hadoop, You Can Have Both

The advancement in SQL development for Hadoop is making it possible for SQL professionals to apply their skills and SQL tools of choice to Big Data projects. Based on their use case, SQL pros can choose from Apache projects Hive and Drill, Apache-licensed Impala, and proprietary technologies such as Hadapt, HAWQ, and Splice Machine. Hive, the most mature of these technologies, is widely used and best suited for long running queries. The other technologies are very new to the marketplace.

At a “Battle of the Titans” panel about a number of these SQL technologies, the discussions centered on whether these SQL implementations were complete, and if they complied with ANSI-standard SQL. HAWQ backers have been beating the drum about ANSI compliance, and they were called out on stage for not being 100% ANSI-compliant. Interesting, since HAWQ should be a relatively complete implementation, as it was based on a 12+ year old MPP analytic database.

However, the “SQL completeness” argument shouldn’t be the key criteria for judging these old and new SQL technologies. The key differentiators in SQL implementations should be temporal-based. Although it can be annoying if you generate SQL that is rejected by the platform, those gaps should not be all that difficult to close given enough time. Further, is SQL reporting on structured data all you really want/need? If so, wouldn’t you be better off with a relational database to begin with? The value Hadoop can provide a company needs to be much more than ANSI SQL.

SQL requires data structure. SQL understands tables on columns, tuples (unordered sets of known values with names), and attributes. The difficulty is that traditionally we centrally structure the data, as is the case with Hive Metastore (or HCatalog), the metadata store for Hive also used by Drill and Impala. Centrally structuring data causes initial delays in gaining value from new data sources or reacting to changes in the structure. Altering the schema and reloading that data for a referential set can take significant administration effort and delays gaining value from the data. Centralized database structures are well understood, as technologies like the Oracle Data Dictionary Catalog can basically be described as tables describing tables. A centralized structure also aids in building high-quality optimizers and query builders. But even in the 1980s, companies could not keep up with the rate of new data source ingest and schema change when building and administering that centralized structure. In the 2010s, this problem has gone through the roof.

MongoDB is an example of a wildly popular technology that takes a different approach where the data is self-describing. Applications generating data publish the structure with the data using standard technology interchange formats including JSON, Parquet, and Apache Avro. Using a self-describing data structure eliminates the delay in gaining access to new data sources, or to data where its structure has changed.

Apache Drill is taking on a more aggressive—but more impactful—approach to SQL on Hadoop and how to manage the metadata. Rather than creating yet another SQL on Hadoop (“YASH”, anyone? ☺) technology focused primarily on leveraging centralized structure , the Drill project is making a broader engineering investment to marry the agility of “MongoDB-like” decentralized structure and also integrate the investments in HCatalog/Hive Metastore.

Apache Drill can take advantage of both centralized and self-describing (i.e., decentralized) structures. A centralized structure defined in Hive Metastore can be used by Drill as well as by Hive and Impala. Structuring investments implemented in Hive Metastore can be shared by all of these technologies. However, Drill can uniquely execute queries and joins between data that is structured using both methods.

For long-running queries, you should use Hive. As your interactive query needs expand, consider Apache Drill, as it is taking a unique approach that combines the familiar SQL with innovative Big Data advancements. Thanks to these advancements in SQL development for Hadoop, you can take advantage of both traditional SQL and self-describing structures for your Big Data projects.

Centralized structure or self-describing structure—why not choose both?


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free