Earlier this year, I published a series of posts on the deployment of Apache Drill to Azure. While the steps covered in those posts work, I’d like to speed up the process significantly. With the MapR Converged Data Platform available in the Azure Marketplace, I can have a Drill-enabled MapR cluster up and running much faster and with much less effort.
Apache Drill Blog Posts
Automatic replication of MapR-DB data to Elasticsearch is useful for many environments, and I want to share information about a specific customer deployment I worked on recently. Their use case is related to log security analytics and is centered around using Drill for running interactive queries on aggregated data.
In this week's Whiteboard Walkthrough video, Neeraja Rentachintala, Senior Director of Product Management at MapR Technologies, explains how Apache Drill optimization achieves interactive performance for low latency SQL queries on very large data sets when working with familiar BI tools such as Tableau, Microstrategy or Qlikview and includes techniques used for successful optimization using Drill in production. Neeraja describes Drill optimization capabilities based on Apache Calcite that include projection pruning, filter push down, partition pruning, cost-based optimization and meta-data caching.
One of the challenges when working with streams is the transitory nature of their data. Many applications require data to be persisted far beyond the point at which said data has any practical value to streaming analytics.
In this Whiteboard Walkthrough Parth Chandra, Chair of PMC for Apache Drill project and member of MapR engineering team, describes how the Apache Drill SQL query engine reads data in Parquet format and some of the best practices to get maximum performance from Parquet.
A very common use case for the MapR Converged Data Platform is collecting and analyzing data from a variety of sources, including traditional relational databases. Until recently, data engineers would build an ETL pipeline that periodically walks the relational database and loads the data into files on the MapR cluster, then perform batch analytics on that data.
Apache Drill is an engine that can connect to many different data sources, and provide a SQL interface to them. It's not just a wanna-be SQL interface that trips over at anything complex - it's a hugely functional one including support for many built in functions as well as windowing functions. Whilst it can connect to standard data sources that you'd be able to query with SQL anyway, like Oracle or MySQL, it can also work with flat files such as CSV or JSON, as well as Avro and Parquet formats.
Today we are excited to announce the availability of Drill 1.8 on the MapR Converged Data Platform. As part of the Apache Drill community, we continue to deliver iterative releases of Drill, providing significant feature enhancements along with enterprise readiness improvements based on feedback from a variety of customer deployments.
Apache Drill enables querying with SQL against a multitude of data sources, including JSON files, Parquet and Avro, Hive tables, RDBMS, and more. MapR has released an ODBC driver for it, and I thought it'd be neat to get it to work with OBIEE. It evidently does work for OBIEE running on Windows, but I wanted to be able to use it on my standard environment, Linux.
In this week’s Whiteboard Walkthrough, Vinay Bhat, Solution Architect at MapR Technologies, takes you step-by-step through a widespread big data use case: data warehouse offload and building an interactive analytics application using Apache Spark and Apache Drill. Vinay explains how the MapR Converged Data Platform provides unique capabilities to make this process easy and efficient, including support for multi-tenancy.
In this week’s Whiteboard Walkthrough, Neeraja Rentachintala, Senior Director of Product Management at MapR Technologies, gives an overview of how open source Apache Drill achieves low latency for interactive SQL queries carried out on large datasets. With Drill, you can use familiar ANSI SQL BI tools, such as Tableau or MicroStrategy, plus do exploration directly on big data.
One of the customer questions has centered around wanting to understand how to determine the degree of parallelism being used for various operators in queries. We’ll address this question and the best practice that originated from this in the rest of this blog post.
In this blog post, I’ll describe how to install Apache Drill on the MapR Sandbox for Hadoop, resulting in a "super" sandbox environment that essentially provides the best of both worlds—a fully-functional, single-node MapR/Hadoop/Spark deployment with Apache Drill.
The power of SQL for business analytics is a given, but the challenge in big data settings is that SQL is normally a static language that assumes pre-defined, fixed and well-known schema. SQL also needs flat data structures. It has been assumed that you need fixed schema for performance.
Drill is a fantastic tool for querying JSON data. But Drill isn’t magical, and sometimes it runs into some data that it can’t quite handle (yet). This post walks through an example of such a scenario, and how you might work through the issue using a little bit of Python code.
A few months ago, I created the first XML plugin for Apache Drill. The idea behind the plugin is simple: Since Apache Drill already has great support for JSON, why not convert the XML documents to JSON, and feed the information into the JSON driver for further processing and presentation in Apache Drill?
In this article we will explore what it means to have a converged data platform for building and delivering business applications. This sample application will be to create blog articles for a personal website.
Today we are very excited to announce the release of Apache Drill 1.6 on the MapR Converged Data Platform. Drill has been on the path of rapid iterative releases for one and a half years now, gathering amazing traction with customers and OSS community users on the way.
During the early days of developing Apache Drill, the Drill team realized the need for an efficient way to represent complex, columnar data in memory. Projects like Protobuf provided an efficient way to represent data that had a predefined schema for transmission over the network, and the Apache Parquet project had implemented an efficient way to represent complex columnar data on disk.
Today we are excited to announce that Apache Drill 1.4 is now available on the MapR Distribution. Drill 1.4 is a production-ready and supported version on MapR and can be downloaded from here and the find the 1.4 release notes here
Apache Drill has a hidden gem: an easy to use REST interface. This API can be used to Query, Profile and Configure Drill engine.
SQL engines for Hadoop differ in their approach and functionality. My focus for this blog post is to compare and contrast the functions and performance of Apache Spark and Apache Drill and discuss their expected use cases.
In this blog post, I would like to briefly introduce the new analytics capabilities added to Drill namely ANSI SQL compliant Analytic and Window functions and how to get started with these.
It’s difficult to describe what a real breach looks like, but you will know it when you see it. To identify a potential breach, we assess the amount of activity of accounts later experiencing fraud at each merchant and then visualize the results.
A very common use case when working with Hadoop is to store and query simple files (such as CSV or TSV), and then to convert these files into a more efficient format such as Apache Parquet in order to achieve better performance and efficient storage.
Apache Drill allows users to explore any type of data using ANSI SQL. This is great, but Drill goes even further than that and allows you to create custom functions to extend the query engine. These custom functions have all the performance of any of the Drill primitive operations, but allowing that performance makes writing these functions a little trickier than you might expect.
I’m very pleased to announce the release of a custom EMR bootstrap action to deploy Apache Drill on a MapR cluster. MapR is the only commercial Hadoop distribution available for Amazon’s Elastic MapReduce service (EMR), and this addition allows EMR users to easily deploy and evaluate the powerful Drill query engine.
In part one of this series, Drilling into Healthy Choices we explored using Drill to create Parquet tables as well as configuring Drill to read data formats that are not very standard. In part two of this series we are going to utilize this same database to think beyond traditional database design.
Drill is a SQL-engine for everything (almost). From simple tabular data, to semi-structured to even the most complex structured JSON data. In this two-part series we will explore what Apache Drill can do and how it enables us to rethink database design to make everyone's life easier.
Drill offers life-changing ways to simplify connecting to Hadoop-scale data in an application or script. OK, maybe not life-changing, but still pretty cool. In this post we will look at how to do it in your language of choice.
Did you know you can run Apache Drill on your laptop? This is great news for business analysts who need to explore complex and semi-structured data. Let's look at a particular example.
JReport is an embeddable BI solution that empowers users to create reports, dashboards, and data analysis. JReport accesses data from Hadoop, such as the MapR Distribution through Apache Drill, as well as other big data and transactional data sources. By visualizing data through Drill, users can perform their own reporting and data discovery for agile, on-the-fly decision-making.
This is the third and final entry in our three-part series focused on building basic skill sets for use in data analysis. The series is aimed at those who have some familiarity with using SQL to query data but limited or no experience with Apache Drill.
Today, we are extremely excited and proud to announce the general availability (GA) of Apache Drill 1.0, as part of the MapR Distribution. Congratulations to the Drill community on this significant milestone and achievement!
This is the second in our three-part series focused on building basic skill sets for use in data analysis. The material is intended for those who have no prior, or very limited, experience with Apache Drill, but do have some familiarity with running SQL queries.
In this week's Whiteboard Walkthrough, Tomer Shiran, PMC member and Apache Drill committer, walks you through the history of the non-relational datastore and why Apache Drill is so important for this type of technology.
In this post, I’ll show you how to build a simple real-time dashboard using Spark on MapR.
Today, the Apache Drill community announced the release of Drill 0.9, and MapR is very excited to package this release as part of the MapR Distribution including Hadoop.
Data across the enterprise are typically stored in silos belonging to different business divisions and even to different projects within the same division. These silos may be further segmented by services/products and functions. Silos (which stifle data-sharing and innovation) are often identified as a primary impediment (both practically and culturally) to business progress and thus they may be the cause of numerous difficulties.
Twitter, as we all know, is a powerful social media platform that can be used to harness incredibly useful information about products, brands and customer experience. This blog will explain how to: 1) Quickly configure an environment to stream Twitter data (filtered on keywords and languages) using Apache Flume, 2) analyze the data in native JSON format with SQL using Apache Drill, and 3) run interactive reports and analysis using MicroStrategy
Since its Beta release in September '14, Apache Drill, the most flexible SQL-on-Hadoop technology, is making great strides in terms of the product progress as well as the community adoption. With four significant iterative releases (0.5, 0.6, 0.7, 0.8) in less than six months, thousands of downloads from the MapR website, nearly 1500 message threads in the Apache Drill user email alias, and an active open source community, Drill is well on its way to becoming generally available in the Q2 '15 time frame.
We recently wrapped up a webinar series, covering global audience, on the topic of “Apache Drill: Introduction, Differentiation and Use Cases” that proved to be highly interactive and engaging.The webinar provided a quick introduction to Drill, covered key Drill differentiators for SQL specialists and business analysts, and provided an overview of new Hadoop use cases that were uncovered during the Drill Beta at MapR.
The value of Apache Drill becomes apparent when integrated with powerful analytics and BI platforms. Today, MicroStrategy announced that Apache Drill is certified with the MicroStrategy Analytics Enterprise Platform™. MicroStrategy Analytics Enterprise connected to Apache Drill allows users to explore multiple data formats instantly on Hadoop enabling direct access to semi-structured data, without having to rely on IT teams for schema creation.
This is part two of the MapR - Apache Drill beta blog. You can read part one of the series here that talks about the different use cases we uncovered during the Drill Beta program at MapR. This blog delves into the Drill features that our beta customers felt were exciting and important for them, and also discusses some noteworthy features that the Drill community implemented based on some of our feedback. Features that our beta customers loved about Drill include: Getting Started with Drill is Extremely Easy, Improving Data Pipelining Processes, Seamless Connectivity to Existing BI Tools.
Today’s data is dynamic and application-driven. The growth of a new era of business applications driven by industry trends such as web/social/mobile/IOT are generating datasets with new data types and new data models. These applications are iterative, and the associated data models typically are semi-structured, schema-less and constantly evolving. Semi-structured where an element can be complex/nested, and schema-less with its ability to allow varying fields in every single row and constantly evolving where fields get added and removed frequently to meet business requirements. In other words, the modern datasets are not only about volume and velocity, but also about variety and variability.
This is a two-part series that covers what we have learned so far in our ongoing Apache Drill beta program at MapR. Part one covers the use cases we are uncovering from our beta customer usage and interactions, and the second part will cover the new product features we have implemented thus far, based on customer feedback. The Apache Drill beta program was a great opportunity for our team to validate the power of Drill as the first schema-less SQL query engine that allows enterprise SQL users (BI developers, business analysts and others) to start harnessing Hadoop data without undergoing any learning curve.
SQL will become one of the most prolific use cases in the Hadoop ecosystem, according to Forrester Research. Apache Drill is an open source SQL query engine for big data exploration. REST services and clients have emerged as popular technologies on the Internet. Apache HBase is a hugely popular Hadoop NoSQL database. In this blog post, I will discuss combining all of these technologies: SQL, Hadoop, Drill, REST with JSON, NoSQL, and HBase, by showing how to use the Drill REST API to query HBase and Hive. I will also share a simple jQuery client that uses the Drill REST API, with JSON as the data exchange, to provide a basic user interface.
After being promoted to a top-level project earlier this month, Apache Drill has reached yet another milestone. Jacques Nadeau, Apache Drill PMC Chair, recently announced on the Drill blog that the community has released Drill 0.7. This release contains 228 resolved JIRAs and numerous enhancements, including more freedom - Drill will now work on EC2, since there is no more dependency on UDP/Multicast.
At the recent SAP TechED && d-code event, we were excited to see what SAP is doing in terms of their major initiatives and how SAP (and MapR) will be able to help organizations around the world achieve simplicity while embracing the new trends shaping our industry: cloud, mobility, big data, and the Internet of Things. Apache Hadoop is a key part of SAP’s overall big data strategy, and we believe we’re very much aligned, both in terms of technology and strategy, with SAP’s key initiatives. How about an example you can put to use right away? This new demo shows the integration of Apache Drill and SAP Lumira, a self-service, data visualization application for business users.
Apache Drill is one of the fastest growing open source projects, with the community making rapid progress with monthly releases. The latest release of Drill 0.6 is another important milestone for the project and builds on the product with key enhancements, including the ability to do SQL queries directly on MongoDB (along with file system, HBase, and Hive sources that are already supported today), as well as a number of performance and SQL improvements.
Customer feedback is a valuable tool for every business, and one of the primary ways to get quality feedback is through surveys. However, asking customers to fill out lengthy surveys with 15+ questions will often result in a very low response rate. Most customers are not willing to take a long survey, and the ones who do often regret it after the first couple of questions.
The recent MapR webinar titled “The Future of Hadoop Analytics: Total Data Warehouses and Self-Service Data Exploration” proved to be a highly informative, in-depth look at the future of data warehouses and how SQL-on-Hadoop technologies will play a pivotal role in those settings. Matt Aslett, Research Director for 451 Research, along with Apache Drill architect Jacques Nadeau, discussed what lies ahead for enterprise data warehouse architects and BI users in 2015 and beyond.
Since Apache Drill 0.4 was released in August for experimentation on the MapR Distribution, there has been tremendous interest in the customer and partner community on the promise and potential of Drill to unlock the new types of data in their Hadoop/NoSQL systems for interactive analysis throughout the organization. Today we're excited to announce Apache Drill 0.5.
The September release of the Apache open source packages in MapR is now available for customers. The September updates to the Apache Open Source packages in the MapR Distribution are part of the MapR 4.0.1 major release. Details about the MapR 4.0.1 release can be found here.
Here are the top highlights of this month’s release:
At the Big Data Everywhere conference held in Israel, Atzmon Hen-Tov, Vice President of R&D of Pontis, and Lior Schachter, Director of Cloud Technology and Platform Group Manager of Pontis, gave an informative talk titled “Data on the Move: Transitioning from a Legacy Architecture to a Big Data Platform.” The five phase, two-year migration of their operational and analytical functions to MapR resulted in a true, real-time operational analytics environment on Hadoop.
Getting back to basics, MapR CTO and co-Founder M.C. Srivas provides a brief introduction to Hadoop, and explains where it fits on the “dumb data” to “very smart data” spectrum. After watching this video, you’ll have a better understanding of Hadoop, and how MapR has taken the best innovations from both ends of the data spectrum to develop the leading Hadoop technology for big data deployments.
A few key points made in the video include:
Congratulations to the Apache Drill community on reaching a big milestone. Apache Drill 0.4.0—a developer preview—has just been released. This is the first in a series of monthly builds the project team will deliver as it drives towards Beta and GA milestones.
Let’s take a brief look at why Apache Drill matters and its key features.
The latest monthly release of the Apache open source packages in MapR is now available for customers. The release includes updates to several OSS packages including Hive, HBase, Oozie, Hue and Sqoop. Here are some of the highlights of the release:
With our recent announcement of HP Vertica’s deployment onto MapR, we have already been flooded with questions about the integration.
This is was origionally posted on The HIVE on May 12, 2014.
Recently I happened to observe martial arts agility training at my son’s Taekwondo school. The ability to move quickly, change direction and still be coordinated enough to throw an effective strike or kick is the key to many martial arts, including Taekwondo.
SQL-on-Hadoop just got easier this morning. Working together with the HP Vertica team, we are excited to announce general availability of the HP Vertica Analytics Platform running on the MapR Distribution for Apache Hadoop.
MapR recently hosted the first Apache Drill hackathon, with nearly forty people in attendance who helped push Drill toward its first beta release. It was great to see people from companies such as Visa, Cisco, LinkedIn and Hortonworks come together to harden and enhance the Apache Drill project.
The hackathon participants worked on many different aspects of Apache Drill. Over the next few weeks, these features will be incorporated into mainline. Here’s a preview of what we worked on, coming soon to a master near you:
It gives me immense pleasure to write this blog on behalf of all of us here at MapR to announce the release of Hadoop 2.x, including YARN, on MapR. Much has been written about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. I will give a quick summary before highlighting some of the unique benefits of Hadoop 2.x and YARN in the MapR Distribution for Hadoop.
Today we are very excited to announce early access of the new HP Vertica Analytics Platform on MapR at the O’Reilly Strata Conference: Making Data Work. This solution tightly integrates HP Vertica’s high-performance analytic platform directly on the MapR Enterprise-Grade Distribution for Hadoop with no “connectors” required. We wanted to provide some additional details on this integration and why this is important for customers.
Drill will be used by analysts and developers who are doing interactive analysis of large-scale datasets. It is intended for ad hoc, fast query where there are multiple data sources and formats. Previously that required writing Java programs, which is neither ad hoc nor fast. Drill plans to change that.
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!