Offloading cold or unused data and ETL workloads from a data warehouse to Hadoop/big data platforms is a very common starting point for enterprises beginning their big data journey. Platforms like Hadoop provide an economical way to store data and do bulk processing of large data sets; hence, it’s not surprising that cost is the primary driver for this initial use case.
Apache Hadoop Blog Posts
If you’ve been keeping tabs on all the great product enhancements that have been coming out of MapR, you will know that the 5.2 version of the MapR Converged Data Platform went GA this summer. It takes a few cycles to make the platform available on the AWS marketplace, largely due to the testing efforts required.
In this Whiteboard Walkthrough, MapR’s Chief Application Architect, Ted Dunning, explains the move from state to flow and shows how it works in a financial services example. Ted describes the revolution underway in moving from a traditional system with multiple programs built around a shared database to a new flow-based system that instead uses a shared state queue in the form of a message stream built with technology such as Apache Kafka or MapR Streams. This new architecture lets decisions be made locally and supports a micro-services style approach.
Today we are excited to announce the availability of Drill 1.8 on the MapR Converged Data Platform. As part of the Apache Drill community, we continue to deliver iterative releases of Drill, providing significant feature enhancements along with enterprise readiness improvements based on feedback from a variety of customer deployments.
Elasticsearch and Kibana are widely used in the market today for data analytics; however, security is one aspect that was not initially built in to the product. Since data is the lifeline of any organization today, it becomes essential that Elasticsearch and Kibana be “secured.” In this blog post, we will be looking at one of the ways in which authentication, authorization, and encryption can be implemented for them.
Mitesh Shah, Senior Product Manager for Security and Data Governance at MapR, describes an important concept in big data security: vulnerability management, a key layer in the trust model. He explains what is provided by the MapR Converged Data Platform and what the role is of the customer in maintaining a secure environment.
Apache PredicitonIO is an open source machine learning server. In this article, we integrate Apache PredictionIO with the MapR Converged Data Platform 5.1 as a backend. Specifically, we use MapR-DB (1.1.1) for event data storage, ElasticSearch for metadata storage, and MapR-FS for model data storage.
MapR Streams and MapR-DB are both very exciting developments in the MapR Converged Data Platform. In this blog post, I’m going to show you how to get Ruby code to natively interact with MapR-DB and MapR Streams.
As a data analyst that primarily used Apache Pig in the past, I eventually needed to program more challenging jobs that required the use of Apache Spark, a more advanced and flexible language. At first, Spark may look a bit intimidating, but this blog post will show that the transition to Spark (especially PySpark) is quite easy.
In this blog post, I’ll describe how to install Apache Drill on the MapR Sandbox for Hadoop, resulting in a "super" sandbox environment that essentially provides the best of both worlds—a fully-functional, single-node MapR/Hadoop/Spark deployment with Apache Drill.
This post describes step by step on how to deploy Mesos, Marathon, Docker and Spark on a MapR cluster and run various jobs as well as Docker containers using this deployment.
MapR-FS provides some very useful capabilities for data management and access control. These features can and should be applied to user home directories. A user in a MapR cluster has a lot of capability at their fingertips. They can create files, two styles of NoSQL tables, and pub/sub messaging streams with many thousands of topics.
In this week's Whiteboard Walkthrough, Dale Kim, Director of Industry Solutions at MapR, explains the architectural differences between MapR-FS or the MapR File System, and the Hadoop Distributed File System (HDFS).
If you’ve had a chance to work with Hadoop or Spark a little, you probably already know that HDFS doesn't support full random read-writes or many other capabilities typically required in a production-ready file system.
There are many options for monitoring the performance and health of a MapR cluster. In this post, I will present the lesser-known method for monitoring the CLDB using the Java Management Extensions (JMX).
In this week's Whiteboard Walkthrough, Mitesh Shah, Product Management at MapR, describes how you can track administrative operations and data accesses in your MapR Converged Data Platform in a seamless and efficient way with the built-in auditing feature.
In 2015, MapR shipped three significant core releases : 4.0.2 in January, 4.1 in April, 5.0 and the GA version of Apache Drill in July. While all this was happening, many of my colleagues in engineering (who’ve demonstrated a whole new level of ingenuity and multitasking) were also working on one of the biggest releases in the history of MapR—the converged data platform release (AKA, MapR 5.1).
The distributed computation world has seen a massive shift in the last decade. Apache Hadoop showed up on the scene and brought with it new ways to handle distributed computation at scale. It wasn’t the easiest to work with, and the APIs were far from perfect, but they worked.
In this article, I am going to show you how to use the impersonation feature to create and access a file in MapR-FS. In this example, we will run a Java program as the “mapr” superuser that will run operations on behalf of the “user01” user.
This is the second post in a series about the MapR Command Line Interface. The first post gave you an idea of how to use the command line to review your cluster nodes, what services were running, and how to manage them.
Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Spark SQL, Scala, Hive, Flink, Kylin and more. Zeppelin enables rapid development of Spark and Hadoop workflows with simple, easy visualizations.
In this blog post, we’ll go into detail about our solution for providing YARN clusters on shared infrastructure which have the same security and performance.
In this week's Whiteboard Walkthrough, Aditya Kishore, engineer on the MapR-DB team, explains how to use the OJAI API to insert, search, and update the document database.
With MapR version 5.0 being released recently, MapR customers got yet another powerful feature at no additional licensing costs: Auditing! In this two-folded blog post, I’ll describe various use cases for auditing as well as instructions for how to deploy these cases in your MapR environment.
It’s difficult to describe what a real breach looks like, but you will know it when you see it. To identify a potential breach, we assess the amount of activity of accounts later experiencing fraud at each merchant and then visualize the results.
In this blog post, I will explain the resource allocation configurations for Spark on YARN, describe the yarn-client and yarn-cluster modes, and will include examples. Spark can request two resources in YARN: CPU and memory.
Here at MapR, developer productivity is critical to us. In order to keep our pace of innovation high and give customers more choice and flexibility in Apache Hadoop and other open source projects we ship with the MapR Distribution for Hadoop, we apply DevOps methodologies as widely as we can. One critical piece of this is ensuring we can rapidly test our builds to ensure quality in the codebase. Automation is key here, which is what allows us to integrate all the latest innovations across multiple releases from the community in our Hadoop distribution.
In this post, I’ll show you what happens behind the scenes, from the time a user fires any job from a Client/Edge node or Cluster node, until the time the job actually gets submitted to the JobTracker for execution.
In this blog post, I will discuss best practices for YARN resource management. The fundamental idea of MRv2(YARN) is to split up the two major functionalities—resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).
Teradata Connector for Hadoop (TDCH) is a key component of Teradata’s Unified Data Architecture for moving data between Teradata and Hadoop. TDCH invokes a mapreduce job on the Hadoop cluster to push/pull data to/from Teradata databases, with each mapper moving a portion of the data, in parallel across all nodes, for very fast transfers.
Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
In this week's Whiteboard Walkthrough, Jim Scott, Director of Enterprise Strategy and Architecture at MapR, explains the differences between Apache Mesos and YARN, and why one may or may not be better in global resource management than the other.
SQL will become one of the most prolific use cases in the Hadoop ecosystem, according to Forrester Research. Apache Drill is an open source SQL query engine for big data exploration. REST services and clients have emerged as popular technologies on the Internet. Apache HBase is a hugely popular Hadoop NoSQL database. In this blog post, I will discuss combining all of these technologies: SQL, Hadoop, Drill, REST with JSON, NoSQL, and HBase, by showing how to use the Drill REST API to query HBase and Hive. I will also share a simple jQuery client that uses the Drill REST API, with JSON as the data exchange, to provide a basic user interface.
In this week's Whiteboard Walkthrough, James Casaletto walks you through how to configure the network for the MapR Hadoop Sandbox. Whether you use VirtualBox, VMware Fusion, VMware Player, or pretty much any hypervisor on your laptop to support your MapR Sandbox, you'll need to configure the network. There's essentially three different settings that you can use to configure the network for your Sandbox. One is NAT, one is host-only, and one is bridged.
With Hadoop becoming more prominent in customer environments, one of the frequent questions we hear from users is what should be the storage format to persist data in Hadoop. The data format selection is a critical decision especially as Hadoop evolves from being about cheap storage to a pervasive query and analytics platform. In this blog, I want to briefly describe self-describing data formats, how they are gaining a lot of interest as a new management paradigm to consumerize Hadoop data in organizations and the work we have been doing as part of the Parquet community to evolve Parquet as fully self-describing format.
The community version of Apache Hadoop 2.6 was released recently. Some of the cool new things that are part of the Hadoop 2.6 release include changes to YARN to support rolling upgrades, where the ResourceManager and the NodeManager will now preserve state information. Further highlights include label-based scheduling for YARN (with code contributions from an existing MapR feature) along with an alpha feature for running YARN applications natively in docker containers.
One of the challenges with Hadoop is getting value out of it without having to learn all the new skillsets that you need to truly harness Hadoop’s power. The reality of using the MapR Distribution including Hadoop is… you don’t have to know Hadoop to use Hadoop! I recently came up against this again and thought I would throw it out there and hopefully make someone’s journey to their first Hadoop job a no-brainer.
One of the most recent and highly used functional programming language is Scala. It is used in some of the Hadoop ecosystem components like Apache Spark, Apache Kafka, etc. So it would be really useful for someone to develop applications using Scala that uses Hadoop and the ecosystem projects. In this post, I am going to show you how to run a Scala application that accesses Hadoop data.
The capability to process live data streams enables businesses to make real-time, data-driven decisions. The decisions could be based on simple data aggregation rules or even complex business logic. The engines that support these decision models have to be fast, scalable and reliable and Hadoop, with its rapidly growing ecosystem, is fast emerging as the data platform that supports such real-time stream processing engines.
The September release of the Apache open source packages in MapR is now available for customers. The September updates to the Apache Open Source packages in the MapR Distribution are part of the MapR 4.0.1 major release. Details about the MapR 4.0.1 release can be found here.
Here are the top highlights of this month’s release:
Why do this? There are many use cases for time series data, and they usually require handling a decent data ingest rate. Rates of more than 10,000 points per second are common and rates of 1 million points per second are not quite as common, but not outrageously high either.
I have a background in Open Data and will say that this area has a lot in common with Open Source software. One of the core tenets is that of freedom. No single organization, independent of its size or accumulated brain power, can ever anticipate all things that are possible; be that with the data or be it with regards to code.
Snapshots come up in most technical discussions of enterprise-grade applications. Hadoop applications are no exception. Jack Norris talks with Industry Analyst Donnie Berkholz about snapshots to help cut to the core of what the issues are in the context of Hadoop on the What Are The Facts video series.
NFS is the Network File System. It's been part of Linux and the broader Unix ecosystem for decades and been used for a long time in both enterprise environments to share files as well as in customized environments like high performance computing.
With our recent announcement of HP Vertica’s deployment onto MapR, we have already been flooded with questions about the integration.
SQL-on-Hadoop just got easier this morning. Working together with the HP Vertica team, we are excited to announce general availability of the HP Vertica Analytics Platform running on the MapR Distribution for Apache Hadoop.
MapR recently hosted the first Apache Drill hackathon, with nearly forty people in attendance who helped push Drill toward its first beta release. It was great to see people from companies such as Visa, Cisco, LinkedIn and Hortonworks come together to harden and enhance the Apache Drill project.
The hackathon participants worked on many different aspects of Apache Drill. Over the next few weeks, these features will be incorporated into mainline. Here’s a preview of what we worked on, coming soon to a master near you:
On the heels of the recent Spark stack inclusion announcement, here is some more fresh powder (For non-skiers, that’s fresh snow on a mountain).
MapR Distribution of Apache Hadoop: 4.0.0 Beta
We are excited to announce the availability of the installation guide for RHadoop on MapR, titled “RHadoop and MapR: Accessing Enterprise-Grade Hadoop from R”. This highly detailed installation guide explains how to install and use RHadoop with MapR and R on RedHat Enterprise Linux.
As many of you may recall, YARN was first released in October 2013 and gathered a lot of buzz in the later part of the year. On February 20th, the community voted to release the next version of YARN with Hadoop 2.3.0. We are delighted at the progress that has been made with YARN in particular. Here are the Hadoop 2.3.0 release notes from the Apache Hadoop website.
It gives me immense pleasure to write this blog on behalf of all of us here at MapR to announce the release of Hadoop 2.x, including YARN, on MapR. Much has been written about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. I will give a quick summary before highlighting some of the unique benefits of Hadoop 2.x and YARN in the MapR Distribution for Hadoop.
Today we are very excited to announce early access of the new HP Vertica Analytics Platform on MapR at the O’Reilly Strata Conference: Making Data Work. This solution tightly integrates HP Vertica’s high-performance analytic platform directly on the MapR Enterprise-Grade Distribution for Hadoop with no “connectors” required. We wanted to provide some additional details on this integration and why this is important for customers.
We, at Flux7 Labs, a solutions company, help customers maximize performance/$. To help our customers make the right decisions we constantly research and evaluate the solutions available to customers and thereby build and strengthen our internal knowledge. As part of this research process, we evaluated the most common Hadoop distributions on various metrics. The distributions we tested were from Intel, Cloudera, Hortonworks, and MapR. This testing was done independently on all the distributions.
The advancement in SQL development for Hadoop is making it possible for SQL professionals to apply their skills and SQL tools of choice to Big Data projects. Based on their use case, SQL pros can choose from Apache projects Hive and Drill, Apache-licensed Impala, and proprietary technologies such as Hadapt, HAWQ, and Splice Machine. Hive, the most mature of these technologies, is widely used and best suited for long running queries. The other technologies are very new to the marketplace.
Mike Gualtieri, Principal Analyst with Forrester Research joined us for a webinar titled Productionizing Hadoop: Seven Architectural Best Practices. Following the webinar, Mike answered a number of questions from participants, including a question about productionizing Hadoop.
Q: How do you move from the test phase to productionizing Hadoop? How do you put it into practice?
Hadoop provides a compelling distributed platform for processing massive amounts of data in parallel using the Map/Reduce framework and the Hadoop distributed file system. A JAVA API allows developers to express the processing in terms of a map phase and a reduce phase, where both phases use key/value pairs or key/value list pairs as input/output.
“The fact is, there is an entire industry building products atop Apache open source code – and that is the point of having Apache license its projects and provide the other services it does for the open source community…
RHadoop and MapR technical brief now available
If you are a data analyst or statistician familiar with the R programming language and you want to use Hadoop to run MapReduce jobs or access HBase tables, Revolution Analytics has created RHadoop to make your life easier. You can use all of your existing R programs and add MapReduce and HBase functionality. You get all the statistical analysis capabilities of your R environment with the enterprise grade, massively scalable, distributed compute provided by MapR's Hadoop distribution.
Eclipse is a popular development tool and we thought it would be helpful to share some tips on using Eclipse with MapR to write MapReduce programs. The following notes describe how to enable Eclipse as your development IDE for Hadoop MapReduce jobs using an existing MapR cluster.
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!