In this week’s Whiteboard Walkthrough, Stephan Ewen, PMC member of Apache Flink and CTO of data Artisans, describes a valuable capability of Apache Flink stream processing: grouping events together that were observed to occur within a configurable window of time, the event time.
Open Source Software Blog Posts
Today we are excited to announce the availability of Drill 1.8 on the MapR Converged Data Platform. As part of the Apache Drill community, we continue to deliver iterative releases of Drill, providing significant feature enhancements along with enterprise readiness improvements based on feedback from a variety of customer deployments.
Elasticsearch and Kibana are widely used in the market today for data analytics; however, security is one aspect that was not initially built in to the product. Since data is the lifeline of any organization today, it becomes essential that Elasticsearch and Kibana be “secured.” In this blog post, we will be looking at one of the ways in which authentication, authorization, and encryption can be implemented for them.
In the big data enterprise ecosystem, there are always new choices when it comes to analytics and data science. Apache incubates so many projects that people are always confused as to how to go about choosing an appropriate ecosystem project. In the data science pipeline, ad-hoc query is an important aspect, which gives users the ability to run different queries that will lead to exploratory statistics that will help them understand their data.
Sooner or later, if you eyeball enough data sets, you will encounter some that look like a graph, or are best represented a graph. Whether it's social media, computer networks, or interactions between machines, graph representations are often a straightforward choice for representing relationships among one or more entities.
As a data analyst that primarily used Apache Pig in the past, I eventually needed to program more challenging jobs that required the use of Apache Spark, a more advanced and flexible language. At first, Spark may look a bit intimidating, but this blog post will show that the transition to Spark (especially PySpark) is quite easy.
In this week's Whiteboard Walkthrough, Dale Kim, Director of Industry Solutions at MapR, explains the architectural differences between MapR-FS or the MapR File System, and the Hadoop Distributed File System (HDFS).
There are many options for monitoring the performance and health of a MapR cluster. In this post, I will present the lesser-known method for monitoring the CLDB using the Java Management Extensions (JMX).
We have experimented with on a 5 node MapR 5.1 cluster running Spark 1.5.2 and will share our experience, difficulties, and solutions on this blog post.
Sqoop is a popular data transfer tool for Hadoop. Sqoop allows easy import and export of data from structured data stores like relational databases, enterprise data warehouses, and NoSQL datastores. Sqoop also integrates with Hadoop-based systems such as Hive, HBase, and Oozie.
The distributed computation world has seen a massive shift in the last decade. Apache Hadoop showed up on the scene and brought with it new ways to handle distributed computation at scale. It wasn’t the easiest to work with, and the APIs were far from perfect, but they worked.
Spark 1.6 is now in Developer Preview on the MapR Converged Data Platform. In this blog post, I’ll share a few details on what Spark 1.6 brings to the table and what you should care about.
During the early days of developing Apache Drill, the Drill team realized the need for an efficient way to represent complex, columnar data in memory. Projects like Protobuf provided an efficient way to represent data that had a predefined schema for transmission over the network, and the Apache Parquet project had implemented an efficient way to represent complex columnar data on disk.
Two blogs came out recently that share some very interesting perspectives on the blurring lines between architectures and implementation of different data services, ranging from file systems to databases to publish/subscribe streaming services.
Today we are excited to announce that Apache Drill 1.4 is now available on the MapR Distribution. Drill 1.4 is a production-ready and supported version on MapR and can be downloaded from here and the find the 1.4 release notes here
Apache Drill has a hidden gem: an easy to use REST interface. This API can be used to Query, Profile and Configure Drill engine.
The Julia programming language was created in 2009 by Jeff Bezanson, Stefan Karpinski, and Viral B Shah. It was broadly announced in 2012 and has had a growing community of contributors and users ever since.
Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Spark SQL, Scala, Hive, Flink, Kylin and more. Zeppelin enables rapid development of Spark and Hadoop workflows with simple, easy visualizations.
This is a tale of two siloed clusters. The first cluster is an Apache Hadoop cluster. This is an island whose resources are completely isolated to Hadoop and its processes. The second cluster is the description I give to all resources that are not a part of the Hadoop cluster.
In this blog post, we’ll go into detail about our solution for providing YARN clusters on shared infrastructure which have the same security and performance.
In this blog post, I will explain the resource allocation configurations for Spark on YARN, describe the yarn-client and yarn-cluster modes, and will include examples. Spark can request two resources in YARN: CPU and memory.
In this blog post, I’ll give you an in-depth look at the HBase architecture and its main benefits over NoSQL data store solutions. Be sure and read the first blog post in this series, titled “HBase and MapR-DB: Designed for Distribution, Scale, and Speed.”
In this blog post, I will discuss best practices for YARN resource management. The fundamental idea of MRv2(YARN) is to split up the two major functionalities—resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).
Teradata Connector for Hadoop (TDCH) is a key component of Teradata’s Unified Data Architecture for moving data between Teradata and Hadoop. TDCH invokes a mapreduce job on the Hadoop cluster to push/pull data to/from Teradata databases, with each mapper moving a portion of the data, in parallel across all nodes, for very fast transfers.
Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
In this tutorial, you’ll learn how you can deploy your MapR clusters with just one click on your private cloud infrastructure.
The following program illustrates a table load tool, which is a great utility program that can be used for batching puts into a HBase/MapR-DB table. The program creates a simple HBase table with a single column within a column family, and inserts 100,000 rows in a batch fashion.
Recommendation engines help narrow your choices to those that best meet your particular needs. In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model, and search technology from Elasticsearch to simplify deployment of the recommender.
Our latest product updates are now out, with some interesting new features on Hue and Pig. This month’s release includes updates for Hue, Oozie, Spark and Pig. Here are the highlights of the release:
As part of working at MapR, we live and breathe Apache Hadoop. And we use Hadoop to help customers solve difficult business problems that would be intractable otherwise. Last year, about six months after shipping our first version of Hadoop 2.x with YARN, multiple customers asked us to consider working with Apache Mesos. Our early response was that of curiosity. Why are multiple customers asking us to work with Mesos when we just released YARN?
In this week's Whiteboard Walkthrough, Jim Scott, Director of Enterprise Strategy and Architecture at MapR, explains the differences between Apache Mesos and YARN, and why one may or may not be better in global resource management than the other.
Hive has been using ZooKeeper as distributed lock manager to support concurrency in HiveServer2. The ZooKeeper-based lock manager works fine in a small scale environment. However, as more and more users move to HiveServer2 from HiveServer and start to create a large number of concurrent sessions, problems can arise. The major problem is that the number of open connections between Hiveserver2 and ZooKeeper keeps rising until the connection limit is hit from the ZooKeeper server side. At that point, ZooKeeper starts rejecting new connections, and all ZooKeeper-dependent flows become unusable.
In this blog post, we compare MapReduce v1 to MapReduce v2 (YARN), and describe the MapReduce Job Execution framework. We also take a detailed look at how jobs are executed and managed in YARN and how YARN differs from MapReduce v1. To begin, a user runs a MapReduce program on the client node which instantiates a Job client object. Next, the Job client submits the job to the JobTracker. Then the job tracker creates a set of map and reduce tasks which get sent to the appropriate task trackers. The task tracker launches a child process which in turns runs the map or reduce task. Finally the task continuously updates the task tracker with status and counters and writes its output to its context.
The January 2015 release of the Apache open source packages in MapR was made available for customers a few days ago. A number of packages including Hadoop Core, Hue, Flume, Storm, Hive, HttpFS and HBase were updated. Release highlights include: Upgrade to Hadoop 2.5.1, Hive Updates, Upgrade to HBase 0.98.7, Apache Storm Update. Please refer to our release notes for more details.
In this week's Whiteboard Walkthrough, Jim Scott, Director of Enterprise Strategy and Architecture at MapR, walks you through HBase key design with OpenTSDB.
One of the important things to keep in mind with HBase is that it is a linearly-scaling, column-oriented key value store. Now in order to get linearly-scalable functionality out of HBase, you have to be very cognizant of the key design. This means you don't want to create what's called hot spots, and you want to prevent things like sequential writes from occurring. So what I've done is I've pre-drawn this diagram for you to show you that if you were to write sequentially, the keys, what happens in HBase is that when you're writing keys one through five, they're all going to land on the first server.
SQL will become one of the most prolific use cases in the Hadoop ecosystem, according to Forrester Research. Apache Drill is an open source SQL query engine for big data exploration. REST services and clients have emerged as popular technologies on the Internet. Apache HBase is a hugely popular Hadoop NoSQL database. In this blog post, I will discuss combining all of these technologies: SQL, Hadoop, Drill, REST with JSON, NoSQL, and HBase, by showing how to use the Drill REST API to query HBase and Hive. I will also share a simple jQuery client that uses the Drill REST API, with JSON as the data exchange, to provide a basic user interface.
The November release of the Apache open source packages in MapR was made available for customers earlier this month. We are excited to deliver some major upgrades to existing packages.
Here are the highlights:
Hadoop and distributed data processing are being increasingly adopted. Everybody is in the race to process more data faster and at the same time allow multiple applications running within the same cluster. Not surprisingly, new challenges and growing pains are emerging, especially in multi-tenant production environments.
The capability to process live data streams enables businesses to make real-time, data-driven decisions. The decisions could be based on simple data aggregation rules or even complex business logic. The engines that support these decision models have to be fast, scalable and reliable and Hadoop, with its rapidly growing ecosystem, is fast emerging as the data platform that supports such real-time stream processing engines.
The September release of the Apache open source packages in MapR is now available for customers. The September updates to the Apache Open Source packages in the MapR Distribution are part of the MapR 4.0.1 major release. Details about the MapR 4.0.1 release can be found here.
Here are the top highlights of this month’s release:
A core-differentiating component of the MapR Distribution including Apache™ Hadoop® is the MapR File System, also known as MapR-FS. MapR-FS was architected from its very inception to enable truly enterprise-grade Hadoop by providing significantly better performance, reliability, efficiency, maintainability, and ease of use compared to the default Hadoop Distributed Files System (HDFS).
The August release of the Apache open source packages in the MapR Distribution is now available for customers. The release includes updates to several packages, including Flume, Oozie, HBase, Hive and AsyncHBase.
I have a background in Open Data and will say that this area has a lot in common with Open Source software. One of the core tenets is that of freedom. No single organization, independent of its size or accumulated brain power, can ever anticipate all things that are possible; be that with the data or be it with regards to code.
The latest updates for the Apache open source projects in the MapR Distribution for Hadoop are now out. This release includes patches for Hive version 0.11 and 0.13, Httpfs, Pig, along with the 0.9 (pre 1.0) release of Mahout. The new additions to Mahout 0.9 are described in the earlier blog Advances in Apache Mahout: Highlights for the 0.9 Release.
Here are details of the latest release:
The latest monthly release of the Apache open source packages in MapR is now available for customers. The release includes updates to several OSS packages including Hive, HBase, Oozie, Hue and Sqoop. Here are some of the highlights of the release:
Large clusters that store enterprise big data for the long run, while exposing that data to a variety of workloads at the same time, are turning out to be the preferred deployment option for Hadoop. This model makes it easy for businesses to avoid data silos and progressively build a full suite of big data applications over time.
Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and there’s been plenty of hype about it in the past several months. In the latest webinar from the Data Science Central webinar series, titled “Let Spark Fly: Advantages and Use Cases for Spark on Hadoop,” we cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal.
On the heels of the recent Spark stack inclusion announcement, here is some more fresh powder (For non-skiers, that’s fresh snow on a mountain).
MapR Distribution of Apache Hadoop: 4.0.0 Beta
Big changes are underway for the open source machine learning project Apache Mahout, and there’s a lot of excitement over this new work. Mahout is a library of very scalable machine learning algorithms that is part of the MapR Distribution for Hadoop. Now these new sweeping changes will make Mahout run enormously faster and much easier to use.
As many of you may recall, YARN was first released in October 2013 and gathered a lot of buzz in the later part of the year. On February 20th, the community voted to release the next version of YARN with Hadoop 2.3.0. We are delighted at the progress that has been made with YARN in particular. Here are the Hadoop 2.3.0 release notes from the Apache Hadoop website.
Scalable machine learning for Apache Hadoop-based systems got a boost recently when the Apache Mahout PMC approved release of the 0.9 version of Mahout. This release is the second in less than a year, and it’s another step toward a stable, mature scalable machine learning library. The open source Apache Mahout community has been very active in the last year, with new releases, active discussions on the user and developer mailing lists, new publications and engagement via Twitter.
It gives me immense pleasure to write this blog on behalf of all of us here at MapR to announce the release of Hadoop 2.x, including YARN, on MapR. Much has been written about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. I will give a quick summary before highlighting some of the unique benefits of Hadoop 2.x and YARN in the MapR Distribution for Hadoop.
What is Storm?
Advantage #1: You can easily mix and match modes without having to move data assets back and forth.
This is a tutorial of the MapR Enterprise Grade distribution of Hadoop Command Line Interface.
Before we start, just in case let's refresh the MapR Architecture.
The MapR Architecture consists of the following services or daemons
Running a large HBase™ cluster smoothly with minimum downtime is a skill which requires a deep understanding of how HBase™ works. When a disaster strikes, you find yourself digging into HBase™ code and/or mailing lists to understand what went wrong, determine how to recover from the current mess and most importantly figure out what can be done to prevent the same thing from happening again. Apart from the inconvenience downtime, a service crash can also lead to inconsistencies in HBase™ meta tables.
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!