Two blogs came out recently that share some very interesting perspectives on the blurring lines between architectures and implementation of different data services, ranging from file systems to databases to publish/subscribe streaming services. The timing couldn’t be better for these, because MapR has recently announced our latest wave of convergence, and we’ve learned some key lessons along the way on how to get it right. Before we get to that, let’s review what they had to say.
In the first blog post, Kafka and Confluent, Curt Monash writes from a conversation with Jay Kreps, co-founder and CEO at Confluent: “Jay also views Kafka as something like a file system. Kafka doesn’t actually have a file system-like interface for managing streams, but he acknowledges that as a need and presumably a roadmap item.” Of course this was just a minor point of the blog, but I’m sure it raised a few eyebrows. The second, Kudu as a More Flexible And Reliable Kafka, prototypes a publish/subscribe system implemented on a columnar database in order to achieve greater reliability (through strongly consistent writes) and more flexibility (through modifiable messages and supporting many more topics).
So, to summarize, one blog post describes a file system built on a publish/subscribe system, and the other describes a publish/subscribe system on a database. Interesting, right? If you think for a minute about the architecture of these different systems, it isn’t hard to understand why people are thinking this way, as many of the things these systems need to overlap, such as:
- High-throughput writes (or puts, or produces) and reads (or gets, or consumes)
- Reliable persistence of data to multiple nodes, so data isn’t lost if failures occur
- Detecting and handling node failures with minimal disruption
- Rebalancing data as nodes are added to avoid hotspots
- Scaling, both in terms of number of nodes but also number of objects
Once all of these problems are solved in one system, it is extremely tempting to build other systems on top to leverage the technology. There is just one problem - impedance mismatch. Typically, when a system is designed, the architecture is over-fit to the type of data being served. Some services get optimized for random read/write, others for sequential. Some are designed for availability, and others for consistency. Volumes have been written about queues on databases alone.
How MapR Approached Convergence
So how did we avoid this impedance mismatch when developing the MapR Converged Data Platform? Rather than trying to stack data services on top of each other, we spent our first years as a company developing a robust, patented, extremely fast container-based architecture. Because of this architecture, we could build in purpose-optimized data structures and datapaths for files, database tables, and streams, achieving:
- Scalability of billions of files, tables, or topics
- Throughput of up to 10GB per second per node
- Global, real-time availability of data
- Consistency across services – data availability, unified security policy
Note the difference here. We didn't build a file system and then build a queue on it, or build a database and build a queue on it. We built a storage platform that supports the common core attributes and then built multiple different higher level components on that common platform. The result is you get the best of both worlds - one converged platform but without the painful tradeoffs.
Where to Go From Here
Let’s wrap up with some advice for approaching building data architectures:
- Do: Incorporate real-time streaming concepts into your applications. Once you make the shift in mindset to streaming-first, you’ll find your data pipelines get much simpler, and you can make use of the data in many new ways.
- Do: Deploy the right tool for the job. This doesn’t necessarily mean deploying one-off systems for each type of data service you need. You’ll spend way too much time doing the infrastructure work of moving data around, securing each individual system, monitoring, upgrades, and more.
- Don’t: Try to implement one data service on top of another, even it it does most of what you think you need. You’ll only learn after it goes into production what assumptions the underlying system made that don’t apply to the stacked system.