In some circles today there is a sort of ‘Hadoop vs. RDBMS’ debate ongoing. Often the discussion casts Hadoop as the obvious heir apparent in the data processing world, with RDBMS cast as your father’s Oldsmobile. This debate is somewhat misdirected and the discussion could lead organizations away from the strategy they really should be following, namely a strategy of productive coexistence, and not a simple matter of replacing A with B.
First, let’s talk about what these two technologies are and are not. Hadoop as you know is not really a database, although it pretty much acts like one. It is a distributed file system specialized to facilitate storage and processing of monumental volumes of data, pretty much regardless of data format. It is not a good idea to query any of that data in real-time however.
A typical RDBMS or relational database management system is the undisputed champion of real-time queries of structured data – like that in a data warehouse - which makes it ideal for real-time online transaction processing or OLTP. Serious businesses of all stripes rely on this kind of functionality to transact very important business.
That’s the basics. Peeling back the onion more reveals other distinct differences, further making the case more strongly for a Hadoop-RDBMS coexistence strategy. RDBMS has the backing of the biggest names in the software industry, and as such has fostered an install base of IT talent probably second to none. RDBMS integrate very well with other systems, and represent a very mature technology having venerable, 40-year old roots. RDBMS are baked into the very fabric of just about every mid-to large sized IT organization in the world. Believe it – RDBMS aren’t going away any time soon, nor should they.
Open source Hadoop is roughly a decade old, though most really serious development has occurred in the last five years. And in some ways, it does represent the future of data. In most organizations, data volumes are doubling about every two years. Most of that growth is unstructured or semi-structured data. Unstructured data is to RDBMS what oil is to water. Mix the two and you gum up the works. But if unstructured data is where data growth is going, then data processing must follow the same path, right?
And that is where Hadoop shines. It is purpose-built to handle enormous volumes of unstructured data. It can scale in a way that harmonizes with the hockey-stick growth of unstructured data, which a typical RDBMS cannot. Where RDBMS are usually run on pricey commercial servers, Hadoop is just fine running on commodity hardware. And Hadoop splits typical data queries among various nodes, making it relatively fault-tolerant.
For batch processing, Hadoop gets the nod over RDBMS. But consider that one of the many sources of data in batch processing may well be data from an RDBMS. Moving archive data off an RDBMS can reduce storage costs, because it is cheaper to store archived data on the commodity-based Hadoop infrastructure. Similarly data aggregated in Hadoop may well need to migrate over to an RDBMS too.
The bottom line is that the right path to follow when it comes to Hadoop and RDBMS is clearly one of coexistence. They don’t do the same vital tasks equally well. They do different tasks very well. Your job is to parcel out the processing load in accordance with the strengths of each. If anyone advises you to throw out your RDBMS when you have a real use for it, they probably don’t understand the problem it is meant to solve.
Want to learn more?