Dawn of the Planet of the In-Hadoop Databases

I was recently at the offices of Trace3 (a MapR partner) in Irvine, CA, to give a talk at the OC Big Data Meetup. Well-organized by its gracious host, “Big Data Joe Rossi,” this meetup appears to have a very enthusiastic following that continues to grow each month. Fortunately, my presentation did not seem to curtail that growth.

OC Big Data Meetup

OC big data meetup
My pic of the audience from the stage (a “stagie”?) at the OC Big Data Meetup.                    And the other half of the audience. Quite a great crowd as you can see.

Anyway, my talk was entitled, Hadoop and NoSQL Joining Forces, and it was about the trend around the tighter integration between Hadoop and NoSQL databases. You might know that both of these classes of technologies are great for managing big data. You might also know that these two technologies are often used together in the same environment. This makes sense since big data covers a variety of distinct workloads, and these technologies are designed for different requirements. Hadoop is ideal for large-scale analytics, especially those that take a huge processing task and break it into many smaller tasks, using what we in the biz call “MapReduce.” NoSQL—which, as a reluctant corporate suck up, I begrudgingly pronounce noh-SEE-kwuhl, while the cool kids call it noh-ESS-cue-ELL—is ideal for interactive, real-time reads and writes of small data elements, especially in workloads that require fast lookups over large volumes of disparate data.

When Hadoop and NoSQL work together, they are often deployed in separate clusters. Conventional wisdom says this is the right way to deploy a big data solution. You simply copy data over from the operational side to the analytics side in batch, and everyone’s happy. And as a bonus, you might run computations on the analytics side and copy the processed data back to the operational side. So far so good, right?

Well, some of the more cynical among us consider “conventional wisdom” to be a euphemism for “nobody really thought that one through.” But fortunately, some organizations did think this through and combined analytical and operational functions into an integrated platform. Google certainly did with their BigTable work. The folks who built Apache HBase did as well.

So why do this? You get real-time operational analytics, which means you can run your analytics jobs on the same nodes as your live/real-time data (i.e., “close to the data”). The business analysts who want faster actionable insights will care the most. There are technical benefits as well. A single cluster means you don’t have to copy data across your network. There’s no unnecessary duplication of data, and no completely disparate implementations for data governance, data protection, administration, transformation, etc. People who seek low-risk deployments—enterprise architects, systems administrators, and folks of that ilk—will appreciate all of this.

Let’s take a look strictly from a scalable storage angle. Think about the two primary models for storage today: direct attached storage (DAS) and network attached storage (NAS or SAN, i.e., Storage Area Network). Sure, I know I shouldn’t exclude the public cloud, but let me restrict this discussion to on-premises deployments. Databases want the speed of DAS, and you want the low costs. But you also want the scalability, manageability, and enterprise-readiness of NAS. Does Hadoop give you the best of both worlds? Even if all Hadoop gave you was the economics of DAS and the distributed scalability of NAS, you’d be in a good place. And with a certain Hadoop distribution (i.e., MapR), you get the performance, manageability, and enterprise-grade features at the storage layer along with all the goodness that Hadoop gives you.

By the way, HBase is not alone in the in-Hadoop world. MapR-DB, which runs HBase applications, joins the party. As does Apache Accumulo, supported by Sqrrl, a MapR partner. Other MapR technology partners are believers as well, including Splice Machine and HP Vertica. There are more, and I’m certain this list will continue to grow.

It takes a bit of forward-thinking vision to build a database that runs on Hadoop, but it might be fairly commonplace in the months and years to come. I think this is where the integration of Hadoop and NoSQL is heading, and I believe enterprises seeking big data solutions will be the big winners. (Forgive me if that last statement sounded a bit corny… I just wanted to end this thing on an up note for you.)


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free