Apache Drill 1.6 and the MapR Converged Data Platform - The Emergence of a New Generation Stack for “JSON-Enabled” Big Data Analytics

Today we are very excited to announce the release of Apache Drill 1.6 on the MapR Converged Data Platform. Drill has been on the path of rapid iterative releases for one and a half years now, gathering amazing traction with customers and OSS community users on the way. The community has delivered 13 releases since the beta launch in September 2014, and they will continue to move forward at a fast pace.

Here are the highlights of the Drill 1.6 release.

  • New storage plugin for MapR-DB (document database)
  • Enhanced stability and scale with a new and improved memory allocator
  • Enhanced query  performance
    • Query planning speedups via early application of partition pruning
    • Faster query planning on Hive table queries
    • Optimized reading of Parquet metadata cache
    • Row count-based pruning to speed up Limit N queries
    • Optimized Tableau experience with limit 0 performance improvements
  • End-to-end security from BI tool to MapR/Drill with client impersonation (this is supported only with JDBC;  a new ODBC driver with this feature will  follow shortly).
  • New SQL Window function frame syntax
    • RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    • RANGE BETWEEN CURRENT ROW AND CURRENT ROW
  • JDK 1.8 support
  • Many more bug fixes and enhancements

As you can see from the above list, the Drill 1.6 release is another critical stepping stone in driving ANSI SQL support and performance/scale to the next phase.

At MapR, we are also very excited about this release due to the unique value it brings to you by combining Drill with the recently announced MapR-DB document database capabilities as part of the MapR 5.1 release.  A new MapR-DB document database format plugin is introduced in Drill 1.6, which enables you to query JSON tables in MapR-DB directly. This means that no ETL and no transformation is required at any layer. This combination gives you end-to-end flexibility when it comes to JSON in order to store, update, and query the data in its natural form and fidelity at Hadoop scale in global environments using familiar ANSI SQL capabilities. The result is that you now have operational analytics capabilities, and most importantly, are able to quickly adapt to changes to data models in the underlying applications. This agility means you’ll realize much faster time to value.

Drill is designed with JSON and schemaless/semi-structured data at its core, and it already is able to query and manipulate raw JSON files in MapR-FS. This is now extended to MapR-DB for operational and fast-changing data. We will be extending the power of Drill very soon to MapR Streams, a global publish-subscribe messaging framework, so SQL can be used to query real-time streaming events. Essentially, Drill becomes the unified, high- performance, and flexible SQL access layer across files, tables and streams in the MapR Converged Data Platform.

You can try out the Drill+MapR-DB JSON capabilities for yourself by downloading the MapR Converged Data Platform community edition from https://www.mapr.com/products/hadoop-download

Let me demonstrate a few Drill queries on MapR-DB JSON tables. For these examples, I am using a Yelp dataset.

The first step is to ingest the JSON data into MapR-DB, which you can do using the OJAI API or using an out-of-box importJSON utility that comes with MapR.

  • Using OJAI API

    • Bulk load from file

Table table = MapRDB.getTable(“/maprdb/json/yelp");
RecordStream jsonStream = Json.newDocumentStream(fs, "/business.json");
table.insertOrReplace(jsonStream);
  • Single record inserts from apps
Table table = MapRDB.createTable(“/my_app/user_profiles”);
Document record = MapRDB.newDocument()
                 .set("firstName", "John")
                 .set("lastName", "Doe")
                 .set("age", 50);
table.insert("jdoe", record);
  • Through ImportJSON utility
mapr importJSON -src /maprdb/json_files/yelp/business -dst /maprdb/json/yelp/business  -mapreduce false -bulkload false -idfield business_id

From the MapR administration console, you can see the MapR-DB JSON tables. MapR-DB now has multiple data models:  binary-like HBase, and JSON-like MongoDB, and you can configure the “type” on a per table level. 

Let’s quickly ensure that the data is indeed loaded into DB correctly using the MapR-DB shell. This is not a prerequisite to query with Drill, but I am using it to simply demonstrate different ways of interacting with MapR-DB.

maprdb root:> find '/maprdb/json/yelp/business' --limit 1
{"_id":"--1emggGHgoG6ipd_RMb-g","business_id":"--1emggGHgoG6ipd_RMb-g","full_address":"3280 S Decatur Blvd\nWestside\nLas Vegas, NV 89102","hours":{},"open":true,"categories":["Food","Convenience Stores"],"city":"Las Vegas","review_count":4,"name":"Sinclair","neighborhoods":["Westside"],"longitude":-115.2072382,"state":"NV","stars":4,"latitude":36.1305306,"attributes":{"Parking":{"garage":false,"street":false,"validated":false,"lot":true,"valet":false},"Accepts Credit Cards":true,"Price Range":1},"type":"business"}

1 document(s) found.

maprdb root:> findbyid /maprdb/json/yelp/business --id "--1emggGHgoG6ipd_RMb-g"
{"_id":"--1emggGHgoG6ipd_RMb-g","business_id":"--1emggGHgoG6ipd_RMb-g","full_address":"3280 S Decatur Blvd\nWestside\nLas Vegas, NV 89102","hours":{},"open":true,"categories":["Food","Convenience Stores"],"city":"Las Vegas","review_count":4,"name":"Sinclair","neighborhoods":["Westside"],"longitude":-115.2072382,"state":"NV","stars":4,"latitude":36.1305306,"attributes":{"Parking":{"garage":false,"street":false,"validated":false,"lot":true,"valet":false},"Accepts Credit Cards":true,"Price Range":1},"type":"business"}

Now the data looks like it’s loaded in the DB, so let’s query this through Drill.

The first step to query DB data from Drill is to configure a storage plugin from the Drill web UI.

Below is the sample storage plugin configuration. Note that the JSON tables in MapR-DB are just another storage format on the MapR File System, so you simply create a format plugin in Drill as shown below.

"yelp_maprdb": {
     "location": "/maprdb/json/yelp",
     "writable": true,
     "defaultInputFormat": "maprdb"
   },
.

Let’s check the structure of the business table using a simple Drill query.

0: jdbc:drill:drillbit=10.10.103.32> SELECT * FROM mfs.yelp_maprdb.business LIMIT 1;
+-----+------------+-------------+------------+------+--------------+-------+----------+-----------+------+---------------+------+--------------+-------+-------+------+
| _id | attributes | business_id | categories | city | full_address | hours | latitude | longitude | name | neighborhoods | open | review_count | stars | state | type |
+-----+------------+-------------+------------+------+--------------+-------+----------+-----------+------+---------------+------+--------------+-------+-------+------+
| --1emggGHgoG6ipd_RMb-g | {"Accepts Credit Cards":true,"Parking":{"garage":false,"lot":true,"street":false,"valet":false,"validated":false},"Price Range":1.0,"Ambience":{},"Good For":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} | --1emggGHgoG6ipd_RMb-g | ["Food","Convenience Stores"] | Las Vegas | 3280 S Decatur Blvd Westside
Las Vegas, NV 89102 | {"Friday":{},"Monday":{},"Saturday":{},"Sunday":{},"Thursday":{},"Tuesday":{},"Wednesday":{}} | 36.1305306 | -115.2072382 | Sinclair | ["Westside"] | true | 4.0 | 4.0 | NV | business |
+-----+------------+-------------+------------+------+--------------+-------+----------+-----------+------+---------------+------+--------------+-------+-------+------+

Let’s see how many records there are in the business table.

0: jdbc:drill:drillbit=10.10.103.32> SELECT count(*) FROM mfs.yelp_maprdb.business;
+---------+
| EXPR$0  |
+---------+
| 42153   |
+---------+

Drill allows you to use its full ANSI SQL power directly on MapR-DB JSON tables, including all operators and functions.

Here is an example of using a SQL window function on the JSON tables.

0: jdbc:drill:drillbit=10.10.103.32> WITH X
. . . . . . . . . . . . . . . . . .> AS
. . . . . . . . . . . . . . . . . .> (SELECT name, city, review_count,
. . . . . . . . . . . . . . . . . .> RANK() OVER (PARTITION BY city ORDER BY review_count DESC)
                                    AS review_rank
. . . . . . . . . . . . . . . . . .> FROM mfs.yelp_maprdb.business)
. . . . . . . . . . . . . . . . . .> SELECT X.name, X.city, X.review_count
. . . . . . . . . . . . . . . . . .> FROM X
. . . . . . . . . . . . . . . . . .> WHERE X.review_rank =1 ORDER BY review_count DESC LIMIT 10;
+-------------------------------------------+-------------+---------------+
|                   name                                     |    city     | review_count  |
+-------------------------------------------+-------------+---------------+
| Mon Ami Gabi                                           | Las Vegas    | 4084.0        |
| Studio B                                                     | Henderson   | 1336.0        | 
| Phoenix Sky Harbor International Airport  | Phoenix        | 1325.0        |
| Four Peaks Brewing Co                             | Tempe          | 1110.0        |
| The Mission                                                | Scottsdale    | 783.0         |
| Joe's Farm Grill                                          | Gilbert           | 770.0         |
| The Old Fashioned                                     | Madison        | 619.0         |
| Cornish Pasty Company                             | Mesa            | 578.0         |
| SanTan Brewing Company                         | Chandler       | 469.0         |
| Yard House                                                 | Glendale       | 321.0         |
+-------------------------------------------+-------------+---------------+

You can query nested JSON data directly from DB without flattening it, and you can use Drill’s power of complex data capabilities and all ANSI SQL extensions.

0: jdbc:drill:drillbit=10.10.103.32> SELECT name, stars, b.hours.Friday friday, categories
. . . . . . . . . . . . . . . . . .>   FROM mfs.yelp_maprdb.business b
. . . . . . . . . . . . . . . . . .>   WHERE b.hours.Friday.`open` < '22:00' AND
. . . . . . . . . . . . . . . . . .>                  b.hours.Friday.`close` > '22:00' AND
. . . . . . . . . . . . . . . . . .>                  city = 'Las Vegas'
. . . . . . . . . . . . . . . . . .>   ORDER BY stars DESC 
. . . . . . . . . . . . . . . . . .>   limit 5;
+------------------------------------+--------+-----------------------------------+------------------------------------------------------------------------------------------+
|                name                | stars  |              friday               |                                        categories                                        |
+------------------------------------+--------+-----------------------------------+------------------------------------------------------------------------------------------+
| Peachy Keen Unions By Angie Kelly  | 5.0    | {"close":"23:30","open":"06:00"}  | ["Officiants","Wedding Planning","Event Planning & Services"]                            |
| Concierge Makeup                   | 5.0    | {"close":"23:30","open":"00:30"}  | ["Makeup Artists","Beauty & Spas"]                                                       |
| Hermès Encore                      | 5.0    | {"close":"23:00","open":"10:00"}  | ["Shopping"]                                                                             |
| Hague Quality Water                | 5.0    | {"close":"23:00","open":"06:00"}  | ["Plumbing","Home Services","Contractors"]                                               |
| The Sci-Fi Center                  | 5.0    | {"close":"23:00","open":"11:00"}  | ["Comic Books","Arts & Entertainment","Cinema","Shopping","Books, Mags, Music & Video"]  |
+------------------------------------+--------+-----------------------------------+------------------------------------------------------------------------------------------+

A few of my earlier blogs covered how Drill works with JSON, and all these queries can be now tried on MapR-DB directly. Drill does a variety of optimizations (such as projection pushdown, filter pushdown)  to make sure the queries are faster. This is a significant performance benefit of storing data in MapR-DB compared to files.

Here is a quick example on how pushdown works. Let’s take the same nested data query that has a number of filter predicates.

SELECT name, stars, b.hours.Friday friday, categories FROM mfs.yelp_maprdb.business b
WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND city = 'Las Vegas' ORDER BY stars DESC LIMIT  5

To see how Drill optimizes the query, you can look into the EXPLAIN PLAN.  As highlighted below, Drill does projection pushdown and retries only name, starts, hours, and categories columns from DB rather than all columns. It also does filter pushdown by passing all the filters to the underlying DB, rather than bringing it to Drill and processing it.

There are many additional exciting features in Drill 1.6. Download the MapR release and try it out!

How to get started with Drill:

For full documentation, please refer to http://drill.apache.org/docs. Additional resources can be found at http://mapr.com/apachedrill

If you have any additional questions about Drill 1.6, please ask them in the comments section below.

no

CTA_Inside

Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams
Life happens as a continuous flow of events (a stream). Ted Dunning and Ellen Friedman describe new designs for streaming data architecture that help you get real-time insights and greatly improve the efficiency of your organization.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free