Debugging a real-life distributed application can be a pretty daunting task. Most common Google searches don't turn out to be very useful, at least at first. In this blog post, I will give a fairly detailed account of how we managed to accelerate by almost 10x an Apache Kafka/Spark Streaming/Apache Ignite application and turn a development prototype into a useful, stable streaming application that eventually exceeded the performance goals set for the application.
Apache Hive Blog Posts
In a typical Hive installation with metadata in a MySQL configuration, a password is configured in a configuration file in clear text. This presents a few risks: 1) Unauthorized access could destroy/modify Hive metadata and disrupt workflows. A malicious user could alter Hive permissions or damage metadata.
In the wide column data model of MapR-DB, all rows are stored by a row key, column family, column qualifier, value, and timestamps. In the current version, the row key is the only field that is indexed, which fits the common pattern of queries based on the row key.
In this blog post, I would like to briefly introduce the new analytics capabilities added to Drill namely ANSI SQL compliant Analytic and Window functions and how to get started with these.
Apache Drill allows users to explore any type of data using ANSI SQL. This is great, but Drill goes even further than that and allows you to create custom functions to extend the query engine. These custom functions have all the performance of any of the Drill primitive operations, but allowing that performance makes writing these functions a little trickier than you might expect.
This article describes the new Hive transaction feature introduced in Hive 1.0. This new feature adds initial support of the 4 traits of database transactions – atomicity, consistency, isolation and durability at the row level. With this new feature, you can add new rows in Hive while another application reads rows from the same partition without interference.
In part one of this series, Drilling into Healthy Choices we explored using Drill to create Parquet tables as well as configuring Drill to read data formats that are not very standard. In part two of this series we are going to utilize this same database to think beyond traditional database design.
This is the third and final entry in our three-part series focused on building basic skill sets for use in data analysis. The series is aimed at those who have some familiarity with using SQL to query data but limited or no experience with Apache Drill.
Today, we are extremely excited and proud to announce the general availability (GA) of Apache Drill 1.0, as part of the MapR Distribution. Congratulations to the Drill community on this significant milestone and achievement!
This is the second in our three-part series focused on building basic skill sets for use in data analysis. The material is intended for those who have no prior, or very limited, experience with Apache Drill, but do have some familiarity with running SQL queries.
In this week's Whiteboard Walkthrough, Tomer Shiran, PMC member and Apache Drill committer, walks you through the history of the non-relational datastore and why Apache Drill is so important for this type of technology.
Today, the Apache Drill community announced the release of Drill 0.9, and MapR is very excited to package this release as part of the MapR Distribution including Hadoop.
Data across the enterprise are typically stored in silos belonging to different business divisions and even to different projects within the same division. These silos may be further segmented by services/products and functions. Silos (which stifle data-sharing and innovation) are often identified as a primary impediment (both practically and culturally) to business progress and thus they may be the cause of numerous difficulties.
Since its Beta release in September '14, Apache Drill, the most flexible SQL-on-Hadoop technology, is making great strides in terms of the product progress as well as the community adoption. With four significant iterative releases (0.5, 0.6, 0.7, 0.8) in less than six months, thousands of downloads from the MapR website, nearly 1500 message threads in the Apache Drill user email alias, and an active open source community, Drill is well on its way to becoming generally available in the Q2 '15 time frame.
Hive has been using ZooKeeper as distributed lock manager to support concurrency in HiveServer2. The ZooKeeper-based lock manager works fine in a small scale environment. However, as more and more users move to HiveServer2 from HiveServer and start to create a large number of concurrent sessions, problems can arise. The major problem is that the number of open connections between Hiveserver2 and ZooKeeper keeps rising until the connection limit is hit from the ZooKeeper server side. At that point, ZooKeeper starts rejecting new connections, and all ZooKeeper-dependent flows become unusable.
SQL will become one of the most prolific use cases in the Hadoop ecosystem, according to Forrester Research. Apache Drill is an open source SQL query engine for big data exploration. REST services and clients have emerged as popular technologies on the Internet. Apache HBase is a hugely popular Hadoop NoSQL database. In this blog post, I will discuss combining all of these technologies: SQL, Hadoop, Drill, REST with JSON, NoSQL, and HBase, by showing how to use the Drill REST API to query HBase and Hive. I will also share a simple jQuery client that uses the Drill REST API, with JSON as the data exchange, to provide a basic user interface.
Over the last few releases, the options for how you store data in Hive has advanced in many ways. In this post, let’s take a look at how to go about determining what Hive table storage format would be best for the data you are using. Starting with a basic table, we’ll look at creating duplicate tables for each of the storage format options, and then comparing queries and data compression. Just keep in mind that the goal of this post is to talk about ways of comparing table formats and compression options, and not define the fastest Hive setup for all things data. After all, the fun is in figuring out the Hive table storage format for your own Hive project, and not just reading about mine.
Nearly one year ago the Hadoop community began to embrace Apache Spark as a powerful batch processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this trend, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.
There are many great examples out there for using the Hive shell, as well as examples of ways to automate many of the animals in our Hadoop zoo. However, if you’re just getting started, or need something fast that won’t stay around long, then all you need to do is throw a few lines of code together with some existing programs in order to avoid re-inventing the workflow. In this blog post, I’ll share a few quick tips on using the Hive shell inside scripts. We’ll take a look at a simple script that needs to pull an item or count, and then look at two ways to use the Hive shell to get an answer.
The November release of the Apache open source packages in MapR was made available for customers earlier this month. We are excited to deliver some major upgrades to existing packages.
Here are the highlights:
Apache Drill is one of the fastest growing open source projects, with the community making rapid progress with monthly releases. The latest release of Drill 0.6 is another important milestone for the project and builds on the product with key enhancements, including the ability to do SQL queries directly on MongoDB (along with file system, HBase, and Hive sources that are already supported today), as well as a number of performance and SQL improvements.
The recent MapR webinar titled “The Future of Hadoop Analytics: Total Data Warehouses and Self-Service Data Exploration” proved to be a highly informative, in-depth look at the future of data warehouses and how SQL-on-Hadoop technologies will play a pivotal role in those settings. Matt Aslett, Research Director for 451 Research, along with Apache Drill architect Jacques Nadeau, discussed what lies ahead for enterprise data warehouse architects and BI users in 2015 and beyond.
While big data security analytics promises to deliver great insights in the battle against cyber threats, the concept and the tools are still maturing. In this blog, I’ll simplify the topic of adopting security in Hadoop by showing you how to encrypt traffic between Hue and Hive.
Since Apache Drill 0.4 was released in August for experimentation on the MapR Distribution, there has been tremendous interest in the customer and partner community on the promise and potential of Drill to unlock the new types of data in their Hadoop/NoSQL systems for interactive analysis throughout the organization. Today we're excited to announce Apache Drill 0.5.
At the Big Data Everywhere conference held in Israel, Atzmon Hen-Tov, Vice President of R&D of Pontis, and Lior Schachter, Director of Cloud Technology and Platform Group Manager of Pontis, gave an informative talk titled “Data on the Move: Transitioning from a Legacy Architecture to a Big Data Platform.” The five phase, two-year migration of their operational and analytical functions to MapR resulted in a true, real-time operational analytics environment on Hadoop.
The MapR Distribution including Apache™ Hadoop® employs drivers from Simba Technologies to connect to client ODBC and JDBC applications allowing you to access data on MapR from tools like Tableau with ODBC or SQuirreL with JDBC. This post will walk you through the steps to set up and connect your Apache Hive instance to both an ODBC and JDBC application running on your laptop or other client machine. Although you may already have your own Hive cluster set up, this post focuses on the MapR Sandbox for Hadoop virtual machine (VM).
Getting back to basics, MapR CTO and co-Founder M.C. Srivas provides a brief introduction to Hadoop, and explains where it fits on the “dumb data” to “very smart data” spectrum. After watching this video, you’ll have a better understanding of Hadoop, and how MapR has taken the best innovations from both ends of the data spectrum to develop the leading Hadoop technology for big data deployments.
A few key points made in the video include:
The latest monthly release of the Apache open source packages in MapR is now available for customers. The release includes updates to several OSS packages including Hive, HBase, Oozie, Hue and Sqoop. Here are some of the highlights of the release:
With our recent announcement of HP Vertica’s deployment onto MapR, we have already been flooded with questions about the integration.
This is was origionally posted on The HIVE on May 12, 2014.
Recently I happened to observe martial arts agility training at my son’s Taekwondo school. The ability to move quickly, change direction and still be coordinated enough to throw an effective strike or kick is the key to many martial arts, including Taekwondo.
SQL-on-Hadoop just got easier this morning. Working together with the HP Vertica team, we are excited to announce general availability of the HP Vertica Analytics Platform running on the MapR Distribution for Apache Hadoop.
On the heels of the recent Spark stack inclusion announcement, here is some more fresh powder (For non-skiers, that’s fresh snow on a mountain).
MapR Distribution of Apache Hadoop: 4.0.0 Beta
In this blog I will show you how set up authentication for HiveServer2 (HS2) using pluggable authentication module (PAM). Once configured, all HS2 clients (JDBC and ODBC) will require a valid username and password to connect. A validation error will be thrown if an invalid username and password is passed. This authentication doesn’t apply to
hive cli (command line interface) as it doesn’t go through HS2. Please remember that HS2 authentication only controls connection to hive and not the actual data.
Today we are very excited to announce early access of the new HP Vertica Analytics Platform on MapR at the O’Reilly Strata Conference: Making Data Work. This solution tightly integrates HP Vertica’s high-performance analytic platform directly on the MapR Enterprise-Grade Distribution for Hadoop with no “connectors” required. We wanted to provide some additional details on this integration and why this is important for customers.
It gives me immense pleasure to write this blog on behalf of all of us here at MapR to announce the release of Hadoop 2.x, including YARN, on MapR. Much has been written about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. I will give a quick summary before highlighting some of the unique benefits of Hadoop 2.x and YARN in the MapR Distribution for Hadoop.
This blog explains how to achieve replication in Hive if the metastore is in MySQL. MySQL has built-in replication, which can be used in conjunction with remote mirroring to replicate Hive tables. While Hive does not have this replication capability, this can be achieved using mirror volumes in MapR.
ODBC has been the flagship API for SQL ever since it was first developed by Microsoft and Simba Technologies in 1992. An acronym for Open DataBase Connectivity, ODBC is the standard API used by popular applications like Excel, Crystal Reports, MicroStrategy and Tableau to connect to SQL databases.
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!