This blog post is the third post of a four-part series, and is based on the Radiant Advisors white paper titled “Driving the Next Generation Data Architecture with Hadoop Adoption” which examines the emergence of Hadoop as an operational data platform. (Here's part 1 and part 2 of the series)
In the past few decades, the standard for working with and managing data has been SQL. SQL largely dominates the enterprise, and is used for everything from operational workloads and reporting to analytics. This standard will continue on Hadoop. Major initiatives brought Hadoop from its batch-oriented roots to the interactive capabilities that are delivering improved performance in SQL engines and with distributed in-memory engines.
However, there are a number of vendor-specific RDBMS variations, as well as inconsistent ANSI standard adoptions used in most applications. In the past, Hadoop supported a basic subset of the ANSI SQL 1992 or 1999 standard for original Hive users. To combat this, development roadmaps for SQL-on-Hadoop engines are increasing ANSI SQL completeness, with a priority on later ANSI standard analytic operations that can maximize the benefits of SQL-on-Hadoop for both analysts and applications. For the growing amounts of structured data inside of Hadoop, SQL-on-Hadoop will need to: 1) run on all of the data in the cluster without limiting the number of nodes it can use, and 2) aim for SQL version consistency.
To avoid the vendor-specific file formats of relational data engines such as DB2, Oracle, and SQL Server, a standard file format is desired to serve both the SQL-on-Hadoop engines and the other Hadoop compute engines. Text files take the most disk space, while columnar formats have improved in terms of compression and increased analytic performance. One feature of the optimized row-column (ORC) format that has gained acceptance is the ability to balance rows for transactions and columns for reporting. Also, since its inception nearly two years ago, Apache Parquet has emerged as way to enable efficient reusability across multiple Hadoop engines on the platform. This signifies a major shift, as companies are increasingly adopting this latest file format standard.
These days, organizations can choose from a wide range of SQL-on-Hadoop engines. Within the big data landscape, there are several different approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.
Want to learn more?
What are your thoughts about SQL-on-Hadoop? Share your thoughts in the comment section below.