In this week's Whiteboard Walkthrough, Ellen Friedman, Solutions Consultant at MapR, describes what happens when certain fundamental big data capabilities are engineered together as a part of the same technology. This brief overview compares the converged data platform as a foundation for big data projects versus building solutions on a base of separate pieces.
Here's the transcription:
Hello, I'm Ellen Friedman, a Solutions Consultant for MapR Technologies, and a committer for the Apache Drill and Apache Mahout projects. I'm here today to talk to you about the MapR Converged Data Platform and what that means for how you work with data.
Let's start with some of the underlying ideas. Looking back to what was a new and exciting technology ten years ago, this idea of the distributed file system was in the form of the Hadoop Distributed File System, or HDFS, and some people also felt the need for working with a NoSQL database—Apache HBase is a popular one. People who've been familiar with working with these open source systems are also now realizing that using data from continuous events—streaming data—is becoming increasingly important to people. So there's increasing interest in an open source tool such as Apache Kafka, which works as a message transport to deliver that streaming data onto the cluster on which you might build applications in which you store your data. These are some of the fundamental capabilities. Obviously, there are a lot of other open source projects around as well. People who've been used to running these sorts of systems have gotten used to the idea of running these fundamental capabilities on separate clusters. And that creates some problems, especially as you try to go toward a production capability, a production-grade system.
MapR has taken a very different approach. They feel that these fundamental capabilities are important in order to provide a platform on which you have built your applications and run the variety of open source tools of interest to you, but that it's better to have these fundamental capabilities integrated together into one system, into one platform, so that this actually can support what you do in production.
That's what they've done. They've engineered a system that takes these fundamental capabilities, now in the form of the distributed file system (MapR-FS) and a NoSQL database (MapR-DB), which actually supports two different styles of API. It supports the Apache HBase API, it also supports JSON-style API called OJAI, which is also open source, and in terms of being able to deliver streaming data—streaming transport—the MapR Streams are again integrated into the same system, which supports the Apache Kafka API.
Now, they've done more than just kind of draw a circle around these separate capabilities. They have actually engineered them together into a shared cluster and basically shared global namespace. What this can mean in terms of how you work with data is you're working with these systems, and you're working with all of the other open source tools that you might be running on top of this. Maybe Apache Spark—or if you're using streaming data—Spark Streaming. You might be using Apache Flink for streaming data, which has been tested to run on the MapR system, although it doesn't ship with MapR yet. You might be running Apache Drill for using standard SQL queries in this space, or Hive—whatever systems you want to run on top of this fundamental platform. But since they're all in this same shared cluster, this means you're operating under the same security, and the same system administration.
This can be very important for things such as being able to control data locality, a feature that MapR does very well. This is also very important for multi-tenant work. Of course, an enterprise-grade system needs to provide efficient mirroring across different data centers, as well as absolutely consistent snapshots. Other capabilities, such as table replication and something that's really unique to MapR, is the ability to do geodistributed replication of message streams across different data centers.
This is just a small portion of the kind of capability that you have. The key message here is that you're running all of this on the MapR Converged Data Platform. This is a platform which is built to not have just components that communicate well to each other, but in fact are actually converged together as the fundamental platform on which you build your applications, run the open source tools of interest which run very well on this platform, and you use this as you move from development into production.
I hope this description has helped you understand what the converged platform can actually do for you. If you have questions or comments, please provide those in the box. Thank you very much for listening.