Google, Hadoop and You: M.C. Srivas Provides an Overview

Getting back to basics, MapR CTO and co-Founder M.C. Srivas provides a brief introduction to Hadoop, and explains where it fits on the “dumb data” to “very smart data” spectrum. After watching this video, you’ll have a better understanding of Hadoop, and how MapR has taken the best innovations from both ends of the data spectrum to develop the leading Hadoop technology for big data deployments. 

A few key points made in the video include:

  1. There is a wide spectrum of data types in the world today:
    • Very dumb data - At the far end of the spectrum, you have very dumb data, such as a data block, which is the smallest unit of data used by a database. You can only read or write a block of data.
    • Dumb data - Next to that type are “somewhat” dumb types of data such as files. Data appears in the form of byte streams instead of blocks. Operations on this type of data include open/read/write/close, with protocols such as NFS and CIFS.
    • Somewhat smart data (OLTP) – Towards the smarter end of the spectrum are databases. This data can consist of rows, columns, and tables, and there are known relationships between each data type. Protocols include various forms of SQL.
    • Very smart data (OLAP) – There are very strong relationships within this kind of data, which includes aggregates, pivot tables, and windowing functions.
  2. Not all data lives at the opposite ends of the spectrum. Email is an example of data that’s right in the middle of the spectrum. Email has some structure to it (folders, to, from, etc.) and has some relationships with other data types such as the address book, other emails, replies, etc. However, in order to implement a reliable email system, you need to combine smart (OLTP) data with dumb (flat file) data. Google faced this problem before everyone else. They wanted to figure out how to combine flat files with OLTP analysis, and their solution was to develop MapReduce and Google File System (GFS). Hadoop was basically developed from these two technologies, and it resides in the middle of the data spectrum - Hadoop can process both database records and flat files to run business operations that depend on a wide range of data types.

MapR took the best technologies from both sides of the data spectrum: from the dumb data side, we took features such as data backup, disaster recovery, and full HA and brought it to Hadoop. From the smart data side, we brought in high performance, ANSI SQL, and transactions. The result? The best Hadoop processing environment that’s out there today.



Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free