Moving a data analysis platform from a “submit the job and wait” model to a “make things happen in real-time” one isn’t easy. If it were, the world wouldn’t spend so much time talking about it. The challenges are numerous: every component from the ingest of data, to the landing spots, to the pipelines, to the database, to the supporting platform, to the reporting of results has to support this type of model, or else batching gets stuck somewhere in the middle. Then there are the architecture questions: Storm or Spark streaming? Single cluster or multiple ones? What’s the DR scenario? Do I need this MirrorMaker thing? Even your particular choice of programming language and development tools can have a far-reaching effect on the process.
The publish-subscribe concept in streaming has the potential to unleash new ways of getting answers from data faster, with an added amount of agility that shortens the development process for implementing new ideas and innovations. Event streaming, as a capability of the underlying platform, can even support existing batch workloads as they are moved to align with real-time and interactive requirements. This is why streaming is so powerful and desirable -- files and directories become abstract, application-simplifying offsets into a continuous data stream, and analytics consumers can be written to subscribe to only the data they need, while the entire stream is replicated and persisted according to policies.
The act of designing and implementing a streaming data architecture today often starts with a collection of off-the-shelf software packages, each with their own individual resource requirements and underlying assumptions about how they should be deployed. This can mean deploying a separate cluster to support streaming, or multiple clusters to support mirroring and backup requirements.
With the release of MapR 5.1, our vision for a platform that can handle everything in real-time is complete, making this “off the shelf” approach obsolete. MapR is the first streaming system built into a fully converged platform, and along with MapR-FS and MapR-DB, are the first to enable customers to deploy IoT-scale reliability and failover capabilities across a globally distributed enterprises.
Recently we teamed up with a couple of smart folks from Enterprise Systems Group (ESG) who work with a lot of enterprise vendors and know the space well. We asked them to audit some specific tests around the scalability of MapR Streams to handle a huge amount of messages, just to see how far we could push the platform in a repeatable, controlled environment. Under supervision from the ESG team, we also tried to “break it” with the well-known Jepsen toolkit, which is very good at causing these types of platforms to lose data. The results? Here’s a preview video:
We tested MapR Streams to a level of throughput and messages/sec we haven’t seen with anywhere else, with a few different workloads and profiles. As part of running this benchmark, the MapR engineering team developed a streaming tool called Rubix (to be open sourced soon), which simulates producers and consumers with configurable behavior profiles, messages sizes and threads.
Here’s a summary of the scalability tests -- in the below chart, ‘RF’ refers to Replication Factor, the amount each message was replicated in the cluster.
This was a five node test (for details on each machine, consult the link to the full report at the bottom of this article), which achieved nearly 7M messages/sec at RF=1, and around 6M messages/sec at RF=3. The speeds for the consumer-only test were higher, and we included other test profiles called Tango and Slacker, which varied the behaviors of producers and consumers running simultaneously, simulating a real-world workflow. Considering the state of streaming today and where other public benchmarks stand, there’s no doubt that these are very impressive speeds, and no application changes are required to take advantage of them. Streams implements the Kafka 0.9 API and this was used for all of the tests.
On the reliability side, we ran the full Jepsen/Kafka ISR tests which all passed with no data loss, and added some of our own, which failed over specific MapR services to simulate some worse-case scenarios. All of these additional tests also passed with no data loss, showing a very high level of resiliency even while moving and replicating data at high speeds.
The conclusion? MapR Streams is clearly a system that’s built for scale and this latest ESG Lab Review shows the high levels of performance possible in a globally replicated environment, uniquely capable of handling IoT-scale streams in a converged platform.