Open Source Software Blog Posts

Posted on September 21, 2016 by Ellen Friedman

In this week’s Whiteboard Walkthrough, Stephan Ewen, PMC member of Apache Flink and CTO of data Artisans, describes a valuable capability of Apache Flink stream processing: grouping events together that were observed to occur within a configurable window of time, the event time.

Posted on September 14, 2016 by Neeraja Rentachintala

Today we are excited to announce the availability of Drill 1.8 on the MapR Converged Data Platform. As part of the Apache Drill community, we continue to deliver iterative releases of Drill, providing significant feature enhancements along with enterprise readiness improvements based on feedback from a variety of customer deployments.

Posted on September 12, 2016 by Vikash Selvin

Elasticsearch and Kibana are widely used in the market today for data analytics; however, security is one aspect that was not initially built in to the product. Since data is the lifeline of any organization today, it becomes essential that Elasticsearch and Kibana be “secured.” In this blog post, we will be looking at one of the ways in which authentication, authorization, and encryption can be implemented for them.

Posted on August 4, 2016 by Dong Meng

In the big data enterprise ecosystem, there are always new choices when it comes to analytics and data science. Apache incubates so many projects that people are always confused as to how to go about choosing an appropriate ecosystem project. In the data science pipeline, ad-hoc query is an important aspect, which gives users the ability to run different queries that will lead to exploratory statistics that will help them understand their data.

Posted on July 14, 2016 by Nick Amato

Sooner or later, if you eyeball enough data sets, you will encounter some that look like a graph, or are best represented a graph. Whether it's social media, computer networks, or interactions between machines, graph representations are often a straightforward choice for representing relationships among one or more entities.

Posted on July 13, 2016 by Philippe Cuzey

As a data analyst that primarily used Apache Pig in the past, I eventually needed to program more challenging jobs that required the use of Apache Spark, a more advanced and flexible language. At first, Spark may look a bit intimidating, but this blog post will show that the transition to Spark (especially PySpark) is quite easy.

Posted on June 1, 2016 by Dale Kim

In this week's Whiteboard Walkthrough, Dale Kim, Director of Industry Solutions at MapR, explains the architectural differences between MapR-FS or the MapR File System, and the Hadoop Distributed File System (HDFS).

Posted on April 29, 2016 by Mathieu Dumoulin

There are many options for monitoring the performance and health of a MapR cluster. In this post, I will present the lesser-known method for monitoring the CLDB using the Java Management Extensions (JMX).

Posted on April 27, 2016 by Mathieu Dumoulin

We have experimented with on a 5 node MapR 5.1 cluster running Spark 1.5.2 and will share our experience, difficulties, and solutions on this blog post.

Posted on April 15, 2016 by Nick Amato

This is a great example of Pig and Hive in action. The data set used is publicly available, making it a great self-help tutorial to play with.

Posted on April 12, 2016 by Kostas Tzoumas

In this post, we focus on a seemingly simple, extremely widespread, but surprisingly difficult (in fact, an unsolved) problem in practice: counting in streams.

Posted on March 18, 2016 by Venkat Gunnu

Sqoop is a popular data transfer tool for Hadoop. Sqoop allows easy import and export of data from structured data stores like relational databases, enterprise data warehouses, and NoSQL datastores. Sqoop also integrates with Hadoop-based systems such as Hive, HBase, and Oozie.

Posted on March 8, 2016 by Jim Scott

The distributed computation world has seen a massive shift in the last decade. Apache Hadoop showed up on the scene and brought with it new ways to handle distributed computation at scale. It wasn’t the easiest to work with, and the APIs were far from perfect, but they worked.

Posted on February 17, 2016 by Sameer Nori

Spark 1.6 is now in Developer Preview on the MapR Converged Data Platform. In this blog post, I’ll share a few details on what Spark 1.6 brings to the table and what you should care about.

Posted on February 17, 2016 by Parth Chandra

During the early days of developing Apache Drill, the Drill team realized the need for an efficient way to represent complex, columnar data in memory. Projects like Protobuf provided an efficient way to represent data that had a predefined schema for transmission over the network, and the Apache Parquet project had implemented an efficient way to represent complex columnar data on disk.

Posted on February 2, 2016 by Will Ochandarena

Two blogs came out recently that share some very interesting perspectives on the blurring lines between architectures and implementation of different data services, ranging from file systems to databases to publish/subscribe streaming services.

Posted on January 13, 2016 by Neeraja Rentachintala

Today we are excited to announce that Apache Drill 1.4 is now available on the MapR Distribution. Drill 1.4 is a production-ready and supported version on MapR and can be downloaded from here and the find the 1.4 release notes here

Posted on December 22, 2015 by Tugdual Grall

Apache Drill has a hidden gem: an easy to use REST interface. This API can be used to Query, Profile and Configure Drill engine.

Posted on December 21, 2015 by Tony Kelman

The Julia programming language was created in 2009 by Jeff Bezanson, Stefan Karpinski, and Viral B Shah. It was broadly announced in 2012 and has had a growing community of contributors and users ever since.

Posted on November 19, 2015 by Paul Curtis

Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Spark SQL, Scala, Hive, Flink, Kylin and more. Zeppelin enables rapid development of Spark and Hadoop workflows with simple, easy visualizations.

Posted on November 12, 2015 by Jim Scott

This is a tale of two siloed clusters. The first cluster is an Apache Hadoop cluster. This is an island whose resources are completely isolated to Hadoop and its processes. The second cluster is the description I give to all resources that are not a part of the Hadoop cluster.

Posted on November 5, 2015 by Mitra Kaseebhotla

In this blog post, we’ll go into detail about our solution for providing YARN clusters on shared infrastructure which have the same security and performance.

Posted on September 11, 2015 by Hao Zhu

In this blog post, I will explain the resource allocation configurations for Spark on YARN, describe the yarn-client and yarn-cluster modes, and will include examples. Spark can request two resources in YARN: CPU and memory.

Posted on September 4, 2015 by Carol McDonald

This post will help you get started using Apache Spark Streaming with HBase on the MapR Sandbox. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing.

Posted on August 7, 2015 by Carol McDonald

In this blog post, I’ll give you an in-depth look at the HBase architecture and its main benefits over NoSQL data store solutions. Be sure and read the first blog post in this series, titled “HBase and MapR-DB: Designed for Distribution, Scale, and Speed.”

Posted on August 6, 2015 by Carol McDonald

In this blog post, I’ll discuss how HBase schema is different from traditional relational schema modeling, and I’ll also provide you with some guidelines for proper HBase schema design.

Posted on July 24, 2015 by Hao Zhu

In this blog post, I will discuss best practices for YARN resource management. The fundamental idea of MRv2(YARN) is to split up the two major functionalities—resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).

Posted on July 8, 2015 by Andy Lerner

Teradata Connector for Hadoop (TDCH) is a key component of Teradata’s Unified Data Architecture for moving data between Teradata and Hadoop. TDCH invokes a mapreduce job on the Hadoop cluster to push/pull data to/from Teradata databases, with each mapper moving a portion of the data, in parallel across all nodes, for very fast transfers.

Posted on June 26, 2015 by Carol McDonald

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.

Posted on May 5, 2015 by Abizer Adenwala

In this tutorial, you’ll learn how you can deploy your MapR clusters with just one click on your private cloud infrastructure.

Posted on April 22, 2015 by Terry He

The following program illustrates a table load tool, which is a great utility program that can be used for batching puts into a HBase/MapR-DB table. The program creates a simple HBase table with a single column within a column family, and inserts 100,000 rows in a batch fashion.

Posted on April 9, 2015 by Carol McDonald

Recommendation engines help narrow your choices to those that best meet your particular needs. In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model, and search technology from Elasticsearch to simplify deployment of the recommender.

Posted on March 17, 2015 by Nitin Bandugula

Our latest product updates are now out, with some interesting new features on Hue and Pig. This month’s release includes updates for Hue, Oozie, Spark and Pig. Here are the highlights of the release:

Posted on February 11, 2015 by Anoop Dawar

As part of working at MapR, we live and breathe Apache Hadoop. And we use Hadoop to help customers solve difficult business problems that would be intractable otherwise. Last year, about six months after shipping our first version of Hadoop 2.x with YARN, multiple customers asked us to consider working with Apache Mesos. Our early response was that of curiosity. Why are multiple customers asking us to work with Mesos when we just released YARN?

Posted on February 11, 2015 by Jim Scott

In this week's Whiteboard Walkthrough, Jim Scott, Director of Enterprise Strategy and Architecture at MapR, explains the differences between Apache Mesos and YARN, and why one may or may not be better in global resource management than the other.

Posted on February 6, 2015 by Na Yang

Hive has been using ZooKeeper as distributed lock manager to support concurrency in HiveServer2. The ZooKeeper-based lock manager works fine in a small scale environment. However, as more and more users move to HiveServer2 from HiveServer and start to create a large number of concurrent sessions, problems can arise. The major problem is that the number of open connections between Hiveserver2 and ZooKeeper keeps rising until the connection limit is hit from the ZooKeeper server side. At that point, ZooKeeper starts rejecting new connections, and all ZooKeeper-dependent flows become unusable.

Posted on February 3, 2015 by James Casaletto

In this blog post, we compare MapReduce v1 to MapReduce v2 (YARN), and describe the MapReduce Job Execution framework. We also take a detailed look at how jobs are executed and managed in YARN and how YARN differs from MapReduce v1. To begin, a user runs a MapReduce program on the client node which instantiates a Job client object. Next, the Job client submits the job to the JobTracker. Then the job tracker creates a set of map and reduce tasks which get sent to the appropriate task trackers. The task tracker launches a child process which in turns runs the map or reduce task. Finally the task continuously updates the task tracker with status and counters and writes its output to its context.

Posted on January 28, 2015 by Nitin Bandugula

The January 2015 release of the Apache open source packages in MapR was made available for customers a few days ago. A number of packages including Hadoop Core, Hue, Flume, Storm, Hive, HttpFS and HBase were updated. Release highlights include: Upgrade to Hadoop 2.5.1, Hive Updates, Upgrade to HBase 0.98.7, Apache Storm Update. Please refer to our release notes for more details.

Posted on January 21, 2015 by Jim Scott

In this week's Whiteboard Walkthrough, Jim Scott, Director of Enterprise Strategy and Architecture at MapR, walks you through HBase key design with OpenTSDB. 

One of the important things to keep in mind with HBase is that it is a linearly-scaling, column-oriented key value store. Now in order to get linearly-scalable functionality out of HBase, you have to be very cognizant of the key design. This means you don't want to create what's called hot spots, and you want to prevent things like sequential writes from occurring. So what I've done is I've pre-drawn this diagram for you to show you that if you were to write sequentially, the keys, what happens in HBase is that when you're writing keys one through five, they're all going to land on the first server.

Posted on January 6, 2015 by Carol McDonald

SQL will become one of the most prolific use cases in the Hadoop ecosystem, according to Forrester Research. Apache Drill is an open source SQL query engine for big data exploration. REST services and clients have emerged as popular technologies on the Internet. Apache HBase is a hugely popular Hadoop NoSQL database. In this blog post, I will discuss combining all of these technologies: SQL, Hadoop, Drill, REST with JSON, NoSQL, and HBase, by showing how to use the Drill REST API to query HBase and Hive. I will also share a simple jQuery client that uses the Drill REST API, with JSON as the data exchange, to provide a basic user interface.

Posted on December 1, 2014 by Nitin Bandugula

The November release of the Apache open source packages in MapR was made available for customers earlier this month. We are excited to deliver some major upgrades to existing packages.

Here are the highlights:

Posted on October 29, 2014 by Yuliya Feldman

Hadoop and distributed data processing are being increasingly adopted. Everybody is in the race to process more data faster and at the same time allow multiple applications running within the same cluster. Not surprisingly, new challenges and growing pains are emerging, especially in multi-tenant production environments.

Posted on September 29, 2014 by Nitin Bandugula

The capability to process live data streams enables businesses to make real-time, data-driven decisions. The decisions could be based on simple data aggregation rules or even complex business logic. The engines that support these decision models have to be fast, scalable and reliable and Hadoop, with its rapidly growing ecosystem, is fast emerging as the data platform that supports such real-time stream processing engines.

Posted on September 16, 2014 by Nitin Bandugula

The September release of the Apache open source packages in MapR is now available for customers. The September updates to the Apache Open Source packages in the MapR Distribution are part of the MapR 4.0.1 major release. Details about the MapR 4.0.1 release can be found here.

Here are the top highlights of this month’s release:

Posted on September 5, 2014 by Pat Farrel
Combining a search engine with Mahout has created a recommender that is extremely fast and scalable and seamlessly blends results using collaborative filtering data and metadata. In the first post we described creating a co-occurrence indicator matrix for a recommender. In this follow up post, we dive in deeper to the performance and quality of the recommendations.
Posted on August 12, 2014 by Pat Farrel
There are big changes happening in Apache Mahout. For years it’s been the go to machine learning library for Hadoop. It contained most of the best-in-class algorithms for scalable machine learning, which means clustering, classification, and recommendation. But it was written for Hadoop and mapreduce. Today a number of new parallel execution engines show great promise in speeding calculations by as much as 10-100x (Spark, H2O, Flink). That means instead of buying 10 computers for a cluster, a single one may do. That should get you manager’s attention.
Posted on August 11, 2014 by Nitin Bandugula

The August release of the Apache open source packages in the MapR Distribution is now available for customers. The release includes updates to several packages, including Flume, Oozie, HBase, Hive and AsyncHBase.

Posted on August 11, 2014 by Bruce Penn

A core-differentiating component of the MapR Distribution including Apache™ Hadoop® is the MapR File System, also known as MapR-FS. MapR-FS was architected from its very inception to enable truly enterprise-grade Hadoop by providing significantly better performance, reliability, efficiency, maintainability, and ease of use compared to the default Hadoop Distributed Files System (HDFS).

Posted on July 7, 2014 by Michael Hausenblas

I have a background  in Open Data and will say that this area has a lot in common with Open Source software. One of the core tenets is that of freedom. No single organization, independent of its size or accumulated brain power, can ever anticipate all things that are possible; be that with the data or be it with regards to code.

Posted on July 6, 2014 by Michele Nemschoff

M.C.Srivas, CTO and Co-Founder of MapR Technologies recently spoke at the Munich Hadoop User Group about the Apache Drill project.  The following is a blog from HUG Muenchen originally published on the comSysto blog.

 

A deep dive into Apache Drill - fast interactive SQL on Hadoop

Posted on July 3, 2014 by Nitin Bandugula

The latest updates for the Apache open source projects in the MapR Distribution for Hadoop are now out. This release includes patches for Hive version 0.11 and 0.13, Httpfs, Pig, along with the 0.9 (pre 1.0) release of Mahout. The new additions to Mahout 0.9 are described in the earlier blog Advances in Apache Mahout: Highlights for the 0.9 Release.

Here are details of the latest release:

Posted on June 20, 2014 by Nitin Bandugula

The latest monthly release of the Apache open source packages in MapR is now available for customers. The release includes updates to several OSS packages including Hive, HBase, Oozie, Hue and Sqoop. Here are some of the highlights of the release:

Posted on June 13, 2014 by Nitin Bandugula

Large clusters that store enterprise big data for the long run, while exposing that data to a variety of workloads at the same time, are turning out to be the preferred deployment option for Hadoop. This model makes it easy for businesses to avoid data silos and progressively build a full suite of big data applications over time.  

Posted on June 12, 2014 by Keys Botzum

Apache Accumulo is a popular BigTable-like framework created by the NSA and open-sourced as an Apache project. We’ve previously blogged about using Accumulo 1.4 with MapR, and thought now was a good time to update the post with the latest versions of Accumulo and MapR.

Posted on May 6, 2014 by Michele Nemschoff

Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and there’s been plenty of hype about it in the past several months. In the latest webinar from the Data Science Central webinar series, titled “Let Spark Fly: Advantages and Use Cases for Spark on Hadoop,” we cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal.

Posted on April 11, 2014 by Anoop Dawar

On the heels of the recent Spark stack inclusion announcement, here is some more fresh powder (For non-skiers, that’s fresh snow on a mountain).

MapR Distribution of Apache Hadoop: 4.0.0 Beta

Posted on April 4, 2014 by Karen Whipple
Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage Hadoop in the AWS Cloud. The latest webinar from the Amazon Web Services Partner webinar series, titled “Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS,” showed examples of how to use Amazon EMR with the MapR Distribution for Apache Hadoop, and outlined the advantages of using the cloud to increase flexibility and accelerate projects while lowering costs.
Posted on March 31, 2014 by Ellen Friedman

Big changes are underway for the open source machine learning project Apache Mahout, and there’s a lot of excitement over this new work. Mahout is a library of very scalable machine learning algorithms that is part of the MapR Distribution for Hadoop. Now these new sweeping changes will make Mahout run enormously faster and much easier to use.

Posted on March 3, 2014 by Ellen Friedman
Does it make sense for me to have a car? If so, which one is the best choice for my needs: a gasoline, hybrid, or electric? And should I buy or lease? In order to make an effective decision, I need to understand key issues about the design, performance, and cost of cars, regardless of whether or not I actually know how to build one myself.
Posted on February 26, 2014 by Anoop Dawar

As many of you may recall, YARN was first released in October 2013 and gathered a lot of buzz in the later part of the year. On February 20th, the community voted to release the next version of YARN with Hadoop 2.3.0. We are delighted at the progress that has been made with YARN in particular. Here are the Hadoop 2.3.0 release notes from the Apache Hadoop website.

Posted on February 19, 2014 by Ellen Friedman

Scalable machine learning for Apache Hadoop-based systems got a boost recently when the Apache Mahout PMC approved release of the 0.9 version of Mahout. This release is the second in less than a year, and it’s another step toward a stable, mature scalable machine learning library. The open source Apache Mahout community has been very active in the last year, with new releases, active discussions on the user and developer mailing lists, new publications and engagement via Twitter.

Posted on February 11, 2014 by Anoop Dawar

It gives me immense pleasure to write this blog on behalf of all of us here at MapR to announce the release of Hadoop 2.x, including YARN, on MapR. Much has been written about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. I will give a quick summary before highlighting some of the unique benefits of Hadoop 2.x and YARN in the MapR Distribution for Hadoop.

YARN 

Posted on October 4, 2013 by Ellen Friedman
The well-known, open source project Storm is in the process of moving into the Apache Foundation group of open source software projects. This is a big step for Storm and for the community developing this already well-respected software.

What is Storm?

Posted on September 25, 2013 by Nitin Bandugula
As part of the latest MapR M7 release for NoSQL and Apache Hadoop, MapR conducted benchmark tests to measure and validate its Apache HBase™ application performance. The MapR M7 Edition dramatically boosted the performance of applications originally written to run with HBase with upwards of 10x better throughput while eliminating latency spikes.
Posted on August 29, 2013 by Ted Dunning

Machine learning with the open source project Apache Mahout just got better with the much anticipated new Mahout version 0.8, released on July 25, 2013. It’s leaner, with less-used features removed and some powerful new ones added, including improved recommendation and a super-fast new clustering algorithm.

Posted on August 21, 2013 by Ted Dunning
We are often asked by potential customers if Apache Mahout ™ integrates well with the MapR M7 Edition. The quick answer is, "Yes!” Mahout itself is extremely portable, and it easily connects with M7 where appropriate. The advantage of running Mahout on MapR has more to do with development simplicity, speed and reproducibility.

Advantage #1: You can easily mix and match modes without having to move data assets back and forth.
Posted on May 9, 2013 by Nitin Bandugula
NoSQL databases are becoming increasingly popular for analyzing big data. There are very few NoSQL solutions, however, that provide the combination of scalability, reliability and data consistency required in a mission-critical application.

Posted on March 27, 2013 by Carlos Morillo

This is a tutorial of the MapR Enterprise Grade distribution of Hadoop Command Line Interface.

Before we start, just in case let's refresh the MapR Architecture.

MapR Architecture

The MapR Architecture consists of the following services or daemons

Posted on November 13, 2012 by Nitin Bandugula

Apache HBase is a NoSQL database solution for large key-value based data sets that provides scale and strong consistency, combined with MapReduce functionality over Hadoop. About half of Hadoop users today deploy Apache HBase for their NoSQL operations.

Posted on October 18, 2012 by Aditya Kishore

Running a large HBase™ cluster smoothly with minimum downtime is a skill which requires a deep understanding of how HBase™ works. When a disaster strikes, you find yourself digging into HBase™ code and/or mailing lists to understand what went wrong, determine how to recover from the current mess and most importantly figure out what can be done to prevent the same thing from happening again. Apart from the inconvenience downtime, a service crash can also lead to inconsistencies in HBase™ meta tables.

Blog Sign Up

Sign up and get the top posts from each week delivered to your inbox every Friday!


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free