The first Kafka Summit was recently held in San Francisco. While the size of the conference was relatively small at 600 attendees, it was encouraging to see the variety of companies that are embracing real-time data pipelines. It’s no longer just silicon valley web companies like LinkedIn and Uber, attendees came from a variety of verticals including financial services, healthcare, hospitality, logistics, retail, and pharma.
I attended several sessions given by Kafka users, and talked to many more at the booth and in the hallway. The general consensus is the Kafka model and API is well suited to building horizontally-scalable, asynchronous data pipelines for data integration and stream processing. That said, the companies that are operating these systems at scale – billions of events per day or multiple data centers – described a consistent set of challenges.
Earlier this year we launched MapR Streams, a publish-subscribe event streaming system built on top of the MapR platform, exposing the Kafka API. By leveraging the strong MapR foundation that supports our distributed file system (MapR-FS) and database (MapR-DB), both of which effortlessly scale to petabytes of data and thousands of nodes across multiple data centers, MapR Streams inherently avoids several of the challenges described by the companies operating it at scale. In this blog I’d like to highlight a few of these challenges and how MapR Streams overcomes them.
The most universal pain point I heard had to do with how Kafka balances topic partitions between cluster nodes nodes. Because Kafka assumes all partitions are equal in terms of size and throughput, a common occurrence is for multiple “heavy” partitions to be placed on the same node, resulting in hot spotting and storage imbalances. To overcome this these companies devote a lot of resources to monitoring, and manually intervene each time an issue is found to migrate partitions between nodes.
Rather than pin partitions to a single node, MapR Streams splits partitions into smaller linked objects, called “partitionlets”, that are spread among the nodes in the cluster. As data is written to the cluster, the active partitionlets (those handling new data) are dynamically balanced according to load, minimizing hotspotting.
Infinite Topic Persistence
Another common issue that was talked about was handling long-term persistence of streaming data. As mentioned above, Kafka partitions are pinned to a single node, meaning they can’t outgrow the storage capacity of that node. Because of this, a common design pattern is to shovel streaming data from Kafka into HDFS for long-term persistence, also known as the Lamba architecture. This creates huge complications around reprocessing of old data (either for new use cases or fixing bugs on existing streaming apps), as it forces companies to write two versions of each app – one for processing from Kafka, the other from HDFS.
The MapR partitionlet approach described above also eliminates this issue, as it allows all historical data to be stored in a single system – MapR Streams. This allows streaming apps, new or existing, to simply “scrollback to 0” and reprocess months or even years of historical data.
LinkedIn gave a presentation to a packed room called “More Clusters, More Problems” on designing for multi-datacenter. They listed several design challenges, to name a few -
- Establishing tiered cluster types (local and aggregate) to avoid forming topology loops that infinitely replicate data.
- Ensuring topic names are globally unique due to lack of namespacing.
- Ensuring identical partition configuration on both ends to prevent re-ordering.
- No ability to handle failover of applications in a disaster recovery scenario, as message offsets aren’t synchronized between clusters.
By building global replication into the platform, MapR Streams overcomes all of the challenges above and more. Full details of how this works are best found in this whiteboard walkthrough.