5 Google Projects That Changed Big Data Forever

“Google is living a few years in the future and sends the rest of us messages,” Doug Cutting, Hadoop founder

Because of the nature of its business, Google has long been a pioneer in embracing both the challenges and opportunities of big data. Google has had to solve the same challenges that many companies face—the difference is the sheer scale of the problem. They’ve often had to invent entirely new approaches to meet the need of their businesses.

Over the past decade, Google has developed many custom solutions to support their own products and services. They’ve documented many of these internal solutions in white papers and many have evolved into open source projects that now are the foundation of the Hadoop ecosystem. This post outlines five foundational Google projects that have changed the big data landscape forever.

1.  Google MapReduce:  Apache Hadoop

One of Google’s first challenges was to figure out how to index the exploding volume of content on the web. To solve this, Google invented a new style of data processing known as MapReduce to manage large-scale data processing across large clusters of commodity servers. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines.

A year after Google published a white paper describing the MapReduce framework (2004), Doug Cutting and Mike Cafarella created Apache Hadoop. Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a huge variety of tasks that all share the common theme of variety, volume and velocity of structured and unstructured data. Hadoop is increasingly becoming the go-to framework for large-scale, data-intensive deployments.

2.  Google Bigtable:  Apache HBase

With web search, Google needed to be able to quickly access huge amounts of data distributed across a wide array of servers. Google developed Bigtable as a distributed storage system for managing structured data. It was designed to scale to petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable such as personalized search, Google Earth, Google Analytics and Google Finance. These applications have very different needs, in terms of data size and latency requirements (batch vs. real-time). Bigtable’s technology was described in a white paper in 2006.

Based on this white paper, an open source version of Bigtable called Apache HBase, was created by the Apache project on top of the Hadoop core. To meet the needs of the enterprise, MapR has taken this one step further and improved upon this by adhering to the APIs of HBase and building a better and faster implementation, including full security support to support enterprise needs.

3.  Google “Borg”:  Apache Mesos

Google’s project that remains unnamed (but is known as Borg outside of Google) is a way of managing the resources of the entire data center as if they are one giant compute node. Borg provides a central brain for controlling tasks across the company’s data centers. Rather than building a separate cluster of servers for each software system, Google can build a cluster that does several different types of work at the same time. Borg sends tasks wherever it can find free computing resources.

Based on Borg, engineers at UC Berkeley developed Apache Mesos, an open source cluster manager that simplifies the complexity of running applications on a shared pool of servers. Currently running at Twitter and Airbnb, Mesos enables enterprises to manage their data center as a compute node.

4.  Google Chubby:  Apache Zookeeper

With Google’s multiple online services splitting tasks into tiny pieces and spreading them across a vast network of machines, Google needed a way of controlling access to those machines. So they developed Chubby, a distributed lock service intended for coarse-grained synchronization of activities within Google’s distributed systems. The primary goals included reliability, availability to a moderately large set of clients, and easy-to-understand semantics.

Apache ZooKeeper is the open source implementation based on Google’s 2006 Chubby white paper. However, ZooKeeper is more of a distributed lock service and metadata system. This project is the underpinning to making Apache HBase and many other distributed, highly-available systems functional. ZooKeeper is in a league of its own, and is often regarded as one of the most solid and mature open source projects that makes it easy to use and to integrate into any system requiring this type of functionality.

5.  Google Dremel:  Apache Drill

Google Dremel provides a way to query big data across thousands of servers at blazing fast speed. The system can run instant queries on multiple petabytes of data in seconds. Dremel has been in use inside Google since 2006 and has thousands of Google users. Google released a white paper on Dremel in 2010 and multiple projects have sought to replicate some level of functionality, but only Apache Drill is taking the concepts to the next level.

Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL. The datasets associated with modern big data applications evolve rapidly, are often self-describing and can include complex document types such as JSON and Parquet. Apache Drill is built from the ground up to provide low latency queries natively on such rapidly evolving multi-structured datasets at scale. Drill supports a multitude of file formats, and data sources and allows cross-data source queries utilizing an ANSI SQL 2003 engine.

These are just a few examples of the ways Google has set the stage for the big data revolution. There is no doubt that they will continue to lead the way in inventing the next big data game changers. MapR is proud to partner with the innovators at Google Capital and together we'll continue to lead the market in innovation and bringing big data solutions to enterprises around the world. 

If you’re interested in learning more about the future of big data and Hadoop, go here.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free