Industry's First Schema-free SQL Engine - Apache Drill 1.0 is Now Generally Available

Today, we are extremely excited and proud to announce the general availability (GA) of Apache Drill 1.0, as part of the MapR Distribution. Congratulations to the Drill community on this significant milestone and achievement!

Incubated in September 2012 as an Apache project, Drill started with an ambitious goal to provide a low-latency SQL engine for the modern big data era by combining the familiarity of relational databases with the scale and agility of Hadoop/NoSQL systems. With nearly two and half years of solid engineering effort building this next generation SQL technology, and backed by a strong community of ~50 contributors and thousands of users and customers across various industries, Apache Drill has quickly evolved to become the most flexible SQL query engine in the big data ecosystem since the Beta release in September 2014. Built from the ground up to support interactive queries on a variety of complex/multi-structured datasets at TB/PB scale, Drill opens up new frontiers for the innovation required to make Hadoop and big data accessible for a broader set of users in a self-service fashion, leveraging the ANSI SQL skill sets and tools already abundant as part of an organization’s BI/analytics infrastructure.

A look back at the journey

Here is a quick snapshot of the momentum of the Apache Drill project along with some notable milestones along the way. The project has been on the fast track in the last nine months since the developer preview in August 2014, delivering seven significant iterative releases, each adding exciting new features and most importantly, improving on the stability, scale, and performance required for broader enterprise deployments. Overall, 2200+ JIRAs have been resolved in this effort. Thanks to the excellent progress, numerous customers have used and experienced the value of Drill and have rolled it into production. Drill also graduated as an Apache Top-Level Project during the journey, and is recognized as the top-rated SQL-on-Hadoop technology by industry analysts.

Now what is Drill all about?

Drill is all about providing flexibility without compromising performance. There are two core aspects critical to achieving this:

  • First, Drill is a schema-free query engine. Put simply , Drill is like an x-ray into data. It lets you discover the structure of the data from a wide variety of sources and data types on-the-fly, and lets you analyze it without time consuming/expensive schema management and flattening of data.
  • Second, Drill is a scale-out and columnar execution engine designed for low latency queries on petabytes of data. Depending on an organization’s requirements around the number of users to support, and the amount of data to query and SLA requirements to meet, Drill can scale from a single node to thousands of nodes.

    With Drill , now users can get to the data faster in just minutes, rather than endure weeks and months of data preparation/ETL cycles, and users can open up new, complex/multi-structured data that they couldn’t get to before—all by leveraging the ANSI SQL skill sets/BI tools available in the organization.

    Here’s a quote from an industry analyst that we believe precisely summarizes the value of Drill:

    “Drill isn’t just about SQL-on-Hadoop. It’s about SQL-on-pretty-much-anything, immediately, and without formality.”
    - Andrew Burst, Gigaom Research, January 2015

    Features at a glance

    Here is a brief list of Apache Drill features available in the 1.0 release:

    1. Schema discovery on-the-fly:Support for direct queries on data without schema definitions in a Hive metastore or any other central repository.
    2. Wide data source access: Support for querying and combining data from a variety of file formats such as text, JSON, and Parquet, as well as from HBase and Hive tables. Extensible to any non-Hadoop data sources via a storage plugin API.
    3. Complex data: Support for nested datatypes such as Maps and Arrays. Built-in SQL extensions to query and operate on nested data types (such as Flatten, KVGen, Repeated_Count, Convert_To/Convert_From functions).
    4. SQL support: ANSI SQL 2003 syntax (not HiveQL, all core SQL query functionality such as Joins, Filters, aggregates, sort, union(all), Having, With, Distinct, Explain plans, Create or Replace Table/View As and SQL Datatypes is available), as well as in-memory and on-disk joins and aggregates.
    5. User-defined functions: Powerful Java API to build simple and complex custom UDF/UDAFs; ability to reuse Hive UDFs as part of Drill queries.
    6. BI tool integration: JDBC/ODBC drivers to integrate with the tools such as Tableau, MicroStrategy, Qlikview, TIBCO Spotfire and any other BI/Analytics tools.
    7. Decentralized and granular security: PAM-based authentication model, Row level and Column level security controls using file system-based Drill views, user impersonation and ownership chaining.
    8. Performance and optimizations: Distributed scale-out query processing, shredded columnar execution engine, parallel optimizer, statistics & CBO, projection and partition pruning, filter pushdown.
    9. Query profiles and tuning: Visual interface to view query profiles and diagnostics, logical and physical plans, sessions options to influence query plans, query auditing.
    10. Drill explorer: Easy-to-use visual interface as part of Drill ODBC driver. Allows browsing the data available via Drill, seeing the structure, and creating logical views.

    For the Drill 1.0 only specific improvements, refer to the Apache release blog post here.

    Use cases

    Drill expands the spectrum of BI use cases by providing the ability to get value from all of the raw datasets available in organizations, wherever it is. The ability to explore and ask ad hoc questions on full fidelity data—in its native format as it comes in—is what sets Drill apart from traditional SQL technologies, which only solve part of the puzzle by working with only centrally-structured data. The BI/Analytics use cases that Drill enables include self-service raw data exploration and complex IoT/JSON data analytics, as well as ad hoc queries on Hadoop-powered enterprise data hubs.

    The road ahead

    1.0 GA is just the beginning for the next phase of the journey. With the solid foundation paved with the GA release, the Drill community is planning to add new, exciting features in a variety of areas such as JSON, complex data functions, new file formats and SQL. The project will also continue the momentum of iterative releases every 4-6 weeks going forward. For a detailed roadmap that shows what’s coming in 2015, please refer to this blog post.

    Resources

    Getting started with Drill is extremely easy, and there are numerous resources available to help. Here are some useful links:

    Congratulations again to the Drill community on this accomplishment. We’re looking forward to continued game-changing innovation that is shaping the future of scalable big data access in enterprises.

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free