A very common use case for the MapR Converged Data Platform is collecting and analyzing data from a variety of sources, including traditional relational databases. Until recently, data engineers would build an ETL pipeline that periodically walks the relational database and loads the data into files on the MapR cluster, then perform batch analytics on that data.
MapR Platform Blog Posts
Today we are excited to announce the availability of Drill 1.8 on the MapR Converged Data Platform. As part of the Apache Drill community, we continue to deliver iterative releases of Drill, providing significant feature enhancements along with enterprise readiness improvements based on feedback from a variety of customer deployments.
Elasticsearch and Kibana are widely used in the market today for data analytics; however, security is one aspect that was not initially built in to the product. Since data is the lifeline of any organization today, it becomes essential that Elasticsearch and Kibana be “secured.” In this blog post, we will be looking at one of the ways in which authentication, authorization, and encryption can be implemented for them.
This post will help you get started using Apache Spark Streaming for consuming and publishing messages with MapR Streams and the Kafka API. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. MapR Streams is a distributed messaging system for streaming event data at scale. MapR Streams enables producers and consumers to exchange events in real time via the Apache Kafka 0.9 API.
Mitesh Shah, Senior Product Manager for Security and Data Governance at MapR, describes an important concept in big data security: vulnerability management, a key layer in the trust model. He explains what is provided by the MapR Converged Data Platform and what the role is of the customer in maintaining a secure environment.
Apache PredicitonIO is an open source machine learning server. In this article, we integrate Apache PredictionIO with the MapR Converged Data Platform 5.1 as a backend. Specifically, we use MapR-DB (1.1.1) for event data storage, ElasticSearch for metadata storage, and MapR-FS for model data storage.
Building a robust, responsive, secure data service for healthcare is tricky. For starters, healthcare data lends itself to multiple models: Document representation for patient profile views or updates; Graph representation to query relationships between patients, providers, and medications; Search representation for advanced lookups. This post will describe how stream-first architectures can solve these challenges, and look at how this has been implemented at Liaison Technologies.
In this week’s Whiteboard Walkthrough, Vinay Bhat, Solution Architect at MapR Technologies, takes you step-by-step through a widespread big data use case: data warehouse offload and building an interactive analytics application using Apache Spark and Apache Drill. Vinay explains how the MapR Converged Data Platform provides unique capabilities to make this process easy and efficient, including support for multi-tenancy.
PySpark is a Spark API that allows you to interact with Spark through the Python shell. If you have a Python programming background, this is an excellent way to get introduced to Spark data types and parallel programming. PySpark is a particularly flexible tool for exploratory big data analysis because it integrates with the rest of the Python data analysis ecosystem, including pandas (DataFrames), NumPy (arrays) and Matplotlib (visualization).
In the big data enterprise ecosystem, there are always new choices when it comes to analytics and data science. Apache incubates so many projects that people are always confused as to how to go about choosing an appropriate ecosystem project. In the data science pipeline, ad-hoc query is an important aspect, which gives users the ability to run different queries that will lead to exploratory statistics that will help them understand their data.
In this week’s Whiteboard Walkthrough, Stephan Ewen, PMC member of Apache Flink and CTO of data Artisans, explains how to use savepoints, a unique feature in Apache Flink stream processing, to let you reprocess data, do bug fixes, deal with upgrades, and do A/B testing.
MapR Streams and MapR-DB are both very exciting developments in the MapR Converged Data Platform. In this blog post, I’m going to show you how to get Ruby code to natively interact with MapR-DB and MapR Streams.
Sooner or later, if you eyeball enough data sets, you will encounter some that look like a graph, or are best represented a graph. Whether it's social media, computer networks, or interactions between machines, graph representations are often a straightforward choice for representing relationships among one or more entities.
In this week’s Whiteboard Walkthrough, Prashant Rathi, Senior Product Manager at MapR, describes the architecture for fine-grained monitoring in the MapR converged data platform from collection to storage and visualization from a variety of data sources as part of the Spyglass Initiative.
In this blog post, I’ll describe how to install Apache Drill on the MapR Sandbox for Hadoop, resulting in a "super" sandbox environment that essentially provides the best of both worlds—a fully-functional, single-node MapR/Hadoop/Spark deployment with Apache Drill.
This post describes step by step on how to deploy Mesos, Marathon, Docker and Spark on a MapR cluster and run various jobs as well as Docker containers using this deployment.
MapR-FS provides some very useful capabilities for data management and access control. These features can and should be applied to user home directories. A user in a MapR cluster has a lot of capability at their fingertips. They can create files, two styles of NoSQL tables, and pub/sub messaging streams with many thousands of topics.
In this week's Whiteboard Walkthrough, Terry He, Software Engineer at MapR, walks you through how to tune MapR Streams running an application with Apache Flink for optimal performance.
The power of SQL for business analytics is a given, but the challenge in big data settings is that SQL is normally a static language that assumes pre-defined, fixed and well-known schema. SQL also needs flat data structures. It has been assumed that you need fixed schema for performance.
In this week's Whiteboard Walkthrough, Dale Kim, Director of Industry Solutions at MapR, explains the architectural differences between MapR-FS or the MapR File System, and the Hadoop Distributed File System (HDFS).
If you’ve had a chance to work with Hadoop or Spark a little, you probably already know that HDFS doesn't support full random read-writes or many other capabilities typically required in a production-ready file system.
In this post we are going to discuss building a real time solution for credit card fraud detection.
There are many options for monitoring the performance and health of a MapR cluster. In this post, I will present the lesser-known method for monitoring the CLDB using the Java Management Extensions (JMX).
We have experimented with on a 5 node MapR 5.1 cluster running Spark 1.5.2 and will share our experience, difficulties, and solutions on this blog post.
This post will show how to integrate Apache Spark Streaming, MapR-DB, and MapR Streams for fast, event-driven applications.
In this article we will explore what it means to have a converged data platform for building and delivering business applications. This sample application will be to create blog articles for a personal website.
Today we are very excited to announce the release of Apache Drill 1.6 on the MapR Converged Data Platform. Drill has been on the path of rapid iterative releases for one and a half years now, gathering amazing traction with customers and OSS community users on the way.
In this week's Whiteboard Walkthrough, Mitesh Shah, Product Management at MapR, describes how you can track administrative operations and data accesses in your MapR Converged Data Platform in a seamless and efficient way with the built-in auditing feature.
MapR Streams is a new distributed messaging system for streaming event data at scale, and it’s integrated into the MapR converged platform. MapR Streams uses the Apache Kafka API, so if you’re already familiar with Kafka, you’ll find it particularly easy to get started with MapR Streams.
In this week's Whiteboard Walkthrough, Mitesh Shah, Product Management at MapR, describes how you can make sure you aren’t opening more access permissions to your sensitive data in Hadoop than you intended, using File Access Control Expressions in MapR.
In 2015, MapR shipped three significant core releases : 4.0.2 in January, 4.1 in April, 5.0 and the GA version of Apache Drill in July. While all this was happening, many of my colleagues in engineering (who’ve demonstrated a whole new level of ingenuity and multitasking) were also working on one of the biggest releases in the history of MapR—the converged data platform release (AKA, MapR 5.1).
In this article, I am going to show you how to use the impersonation feature to create and access a file in MapR-FS. In this example, we will run a Java program as the “mapr” superuser that will run operations on behalf of the “user01” user.
Spark 1.6 is now in Developer Preview on the MapR Converged Data Platform. In this blog post, I’ll share a few details on what Spark 1.6 brings to the table and what you should care about.
This is the second post in a series about the MapR Command Line Interface. The first post gave you an idea of how to use the command line to review your cluster nodes, what services were running, and how to manage them.
Two blogs came out recently that share some very interesting perspectives on the blurring lines between architectures and implementation of different data services, ranging from file systems to databases to publish/subscribe streaming services.
Are you ready to start streaming all the events in your business? What happens to your streaming solution when you outgrow your single data center? What happens when you are at a company that is already running multiple data centers and you need to implement streaming across data centers?
In the wide column data model of MapR-DB, all rows are stored by a row key, column family, column qualifier, value, and timestamps. In the current version, the row key is the only field that is indexed, which fits the common pattern of queries based on the row key.
XGBoost is a library that is designed for boosted (tree) algorithms. It has become a popular machine learning framework among data science practitioners, especially on Kaggle, which is a platform for data prediction competitions where researchers post their data and statisticians and data miners compete to produce the best models.
In this week's Whiteboard Walkthrough, Will Ochandarena, Director of Product Management at MapR, explains how we are able to build the MapR Streams capabilities that differentiate us from similar products in the market.
In this week's Whiteboard Walkthrough, Mansi Shah, Senior Staff Engineer at MapR, talks about MapR Streams, a global publish-subscribe event streaming system for big data. Mansi will discuss its architecture and how it lets you deliver your data globally and reliably.
In this post, we will give a high-level overview of the components of MapR Streams. Then, we will follow the life of a message from a producer to a consumer, with an oil rig use case as an example.
In this week's Whiteboard Walkthrough, MC Srivas, MapR Co-Founder, walks you through the MapR Converged Data Platform that has been in the making for the last 6 years and is now finally complete with MapR Streams.
In this week's Whiteboard Walkthrough, Bruce Penn, Sales Engineer at MapR, explains how NFS access works on MapR-FS vs. HDFS, and helps you decide what you would use in production.
This blog describes how to get an instance of the MapR-DB Document Database Developer Preview image running on Amazon AWS using one of the pre-configured AMI images supplied by MapR. With this AMI, you can start writing JSON-based applications on MapR-DB using the open source Open JSON Application Interface, or OJAI.
Handling large JSON-based data sets in Hadoop or Spark can be a project unto itself. Endless hours toiling away into obscurity with complicated transformations, extractions, handling the nuances of database connectors, and flattening ‘till the cows come home is the name of the game.
In this week's Whiteboard Walkthrough, Aditya Kishore, engineer on the MapR-DB team, explains how to use the OJAI API to insert, search, and update the document database.
To start with, the audit log files generated by MapR Auditing include the type of action performed on the filesystem or MapR-DB table, the date & time of the action performed and the specific details on the file or table being part of the activity.
In this week's Whiteboard Walkthrough, Bharat Baddepudi, engineer on the MapR-DB team, explains how documents in MapR-DB are inserted and updated.
With MapR version 5.0 being released recently, MapR customers got yet another powerful feature at no additional licensing costs: Auditing! In this two-folded blog post, I’ll describe various use cases for auditing as well as instructions for how to deploy these cases in your MapR environment.
MapR developed OJAI (the Open JSON Application Interface) which provides native integration of JSON-like document processing in Hadoop-style scale-out clusters.
In this week's Whiteboard Walkthrough, MC Srivas, MapR CTO and Co-Founder, explains the innovation and vision behind MapR-DB and how project Kudu stacks up to the MapR Data Platform.
In this week's Whiteboard Walkthrough, Anurag Choudhary, Engineer on the MapR-DB team, explains how horizontal scaling in MapR-DB works and how hot spotting is automatically avoided.
Here at MapR, developer productivity is critical to us. In order to keep our pace of innovation high and give customers more choice and flexibility in Apache Hadoop and other open source projects we ship with the MapR Distribution for Hadoop, we apply DevOps methodologies as widely as we can. One critical piece of this is ensuring we can rapidly test our builds to ensure quality in the codebase. Automation is key here, which is what allows us to integrate all the latest innovations across multiple releases from the community in our Hadoop distribution.
Native languages like C/C++ provide a tighter control on memory and performance characteristics of the application. A well written C++ program that has intimate knowledge of the memory access patterns and the architecture of the machine will run several times faster than Java. For these reasons, a lot of enterprise customers with massive scalability and performance requirements tend to use C/C++ in their server applications in comparison to Java.
In this week's Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, talks about the architectural differences between HDFS and MapR-FS that boil down to three numbers.
In part one of this series, Drilling into Healthy Choices we explored using Drill to create Parquet tables as well as configuring Drill to read data formats that are not very standard. In part two of this series we are going to utilize this same database to think beyond traditional database design.
Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
In this post, I’ll show you how to build a simple real-time dashboard using Spark on MapR.
In this post, we will discuss how to use the MapR Control System (MCS) to monitor MRv1 jobs. We will also see how to manage and display jobs, history, and logs using the command line interface.
A number of people have been claiming lately that interactive responses to queries constitute real-time processing. For instance, Mike Olson has been quoted saying that interactive queries are what is needed for real-time processing. I like to start with something more like the wikipedia definition of real-time computing instead. Wikipedia defines real time as a response before a deadline. A relaxed form of this is stream processing, where response is per record ASAP, but with no clear latency deadline.
In this week's Whiteboard Walkthrough, Abizer Adenwala, Technical Support Engineer at MapR, walks you through what a storage pool is, why disks are striped, reasons disk would be marked as failed, what happens when a disk is marked failed, what to watch out for before reformatting/re-adding disk back, and what is the best path to recover from disk failure.
In this week's Whiteboard Walkthrough, James Casaletto walks you through how to configure the network for the MapR Hadoop Sandbox. Whether you use VirtualBox, VMware Fusion, VMware Player, or pretty much any hypervisor on your laptop to support your MapR Sandbox, you'll need to configure the network. There's essentially three different settings that you can use to configure the network for your Sandbox. One is NAT, one is host-only, and one is bridged.
The capability to process live data streams enables businesses to make real-time, data-driven decisions. The decisions could be based on simple data aggregation rules or even complex business logic. The engines that support these decision models have to be fast, scalable and reliable and Hadoop, with its rapidly growing ecosystem, is fast emerging as the data platform that supports such real-time stream processing engines.
Why do this? There are many use cases for time series data, and they usually require handling a decent data ingest rate. Rates of more than 10,000 points per second are common and rates of 1 million points per second are not quite as common, but not outrageously high either.
The MapR Distribution including Apache™ Hadoop® employs drivers from Simba Technologies to connect to client ODBC and JDBC applications allowing you to access data on MapR from tools like Tableau with ODBC or SQuirreL with JDBC. This post will walk you through the steps to set up and connect your Apache Hive instance to both an ODBC and JDBC application running on your laptop or other client machine. Although you may already have your own Hive cluster set up, this post focuses on the MapR Sandbox for Hadoop virtual machine (VM).
A core-differentiating component of the MapR Distribution including Apache™ Hadoop® is the MapR File System, also known as MapR-FS. MapR-FS was architected from its very inception to enable truly enterprise-grade Hadoop by providing significantly better performance, reliability, efficiency, maintainability, and ease of use compared to the default Hadoop Distributed Files System (HDFS).
NFS is the Network File System. It's been part of Linux and the broader Unix ecosystem for decades and been used for a long time in both enterprise environments to share files as well as in customized environments like high performance computing.
The second publication in the O’Reilly Practical Machine Learning series, subtitled A New Look at Anomaly Detection by Ted Dunning and me, is being released this week. In the previous book, which focused on practical approaches to recommendation, we started with the idea that everyone thinks “I want a pony”. Here in the second book, what we want is to find the outlier, the zebra in a herd of ponies, the fish swimming against the school of fish, the rare event.
Hadoop provides a compelling distributed platform for processing massive amounts of data in parallel using the Map/Reduce framework and the Hadoop distributed file system. A JAVA API allows developers to express the processing in terms of a map phase and a reduce phase, where both phases use key/value pairs or key/value list pairs as input/output.
Running MapReduce jobs on ingested data is traditionally batch-oriented: the data must be first transferred to a local file system accessible to the Hadoop cluster, then copied into HDFS with Flume or the “hadoop fs” command. Only once the transfers are complete can MapReduce be run on the ingested files.
Profiling Java applications can be accomplished with many tools, such as the built-in HPROF JVM native agent library for profiling heap and CPU usage. In the world of Hadoop and MapReduce, there are a number of properties you can set to enable profiling of your mapper and reducer code.
With MapR’s enterprise-grade distribution of Hadoop, there are 3 unique features that make this task of profiling MapReduce code easier. They are:
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!