Walmart: Harvesting Value from Big Data with Hadoop & NoSQL

When it comes to Walmart, big data meets big retail in an impressive way. Not only is Walmart an industry leader in global ecommerce and brick-and-mortar retail, they’re also a leader in the use of Hadoop-based technologies to implement their new data-driven approach to business. Since writing the short O’Reilly book Real World Hadoop earlier this year, I’ve been interested in going beyond the idea of what advantages Hadoop and NoSQL theoretically offer to looking out how these technologies play out in real world settings. Walmart’s approach is a great example of Hadoop-based technologies used successfully in production.

There’s naturally a high level of interest in how they’re doing this – and that interest was readily apparent in the way an audience over-packed the room for a Walmart presentation at Strata + Hadoop World in New York on October 1, 2015. Like so many others, I was curious to hear how they are extracting value from big data. I was fortunate to be inside the room; I heard there were people crowded outside as well. Walmart global ecommerce chief technology officer and senior vice president, Jeremy King, described “how they make data work” through a transformation that began in 2012 with their first Hadoop cluster. Now they work with tens of petabytes of detailed, highly valuable transactional data for their 245 million customers who shop online or in thousands of stores. @Walmart Labs, an accelerator in Silicon Valley, spearheaded this effort.

Walmart Hadoop Strata presentation

Jeremy King of @WalmartLabs presenting a talk at Strata + Hadoop World (image © E. Friedman)

The overall goal is to optimize the shopping experience of Walmart customers while in the store, browsing a website or through mobile devices when they are in motion. The solution has involved redesigning global websites and building innovative applications that personalize customer experience while increasing the efficiency of logistics. What is needed to do this is a large scale system that gives internal customers real time access to data collected from a wide range of sources and centralized for more effective use. That’s where Hadoop and NoSQL technologies come in.

Developers for Walmart have built a variety of innovative applications, including personalized search and targeted recommendations. One useful application is a savings catcher that lets customers scan receipts to see whether they have the best price – if not, an adjustment is made in order to meet the Walmart promise.  Another push has been to “mobilize the store”. Each store has its own layout that requires up-to-date information about every item in terms of availability and location. The aim is to bring all this to mobile applications, making shopping easier for Walmart customers.

How is this accomplished? King did not name any vendor, but he emphasized two key aspects of the approach Walmart has taken: provide widespread data access that allows data democracy and build a big data system that never shuts down.

The idea behind data democracy is to get rid of unnecessary bureaucracy that can block development. By giving broad data access, developers can take advantage of one of the strengths of big data approaches – the powerful benefit of combining data from multiple sources. Point of sales data, information about inventory & logistics, competitive intelligence, trends revealed through social media and transactional details for individual customers are all part of the rich collection of data resources. Fragmented data sets are pulled together to a centralized organization. In addition, developers can test their applications at scale, a very important characteristic in a well-designed workflow.  

Big Data Democracy: give developers access to a full data set at scale. This shortens the pipeline between valuable innovation and effective implementation.

Bypassing a cumbersome approval process for access by internal developers really paid off in the case of development of an application known as Walmart exchange. Two engineers from the internal team realized that they could marry advertising data with customer transactional data, and in just a few hours they built a rough version of an application that lets third party advertisers really see which ads are working. Without this data access, development might have taken months.

Another important decision in data democracy is to introduce developers to each other so that resources are used effectively. By watching for situations in which queries may overlap, it’s often possible for different developers to work together in some aspect of what they are doing, thus avoiding wasted work.

Of course, in order to safely provide widespread access, King pointed out that it’s first necessary to cleanse data of personally identifiable information (PII).  A specially firewalled Hadoop cluster houses data with PII, but most people don’t need to see this outside of customer support and billing.  For the data set made widely available to developers, data is carefully tokenized or summarized.  Take the example of a full history of transactional data for a particular customer. Developers generally just need to know that these transactions belong to one individual – they don’t need to know the person’s name, email address, phone number or billing information. King also advised that if tokenization is to be undertaken, it is much easier to do so from the start.

I also found the other key point that King emphasized – the need for the Hadoop platform to run reliably on a 24/7 basis – to be of particular interest. It illustrates the need for a realistic fit between technology and SLAs on a project-by-project basis. This strong emphasis on reliability and availability for Walmart reminded me of a very different use case that I discussed in the Real World Hadoop book, India’s society-changing Aadhaar project. Aadhaar provides a unique identification for every person in India, and authentication of this ID requires sub-second response times and access from anywhere in India, at anytime. For the authentication phase, Aadhaar moved to a NoSQL technology integrated into the Hadoop distribution, in this case MapR-DB, because this technology showed that it could meet the extreme requirement for 24/7 availability and reliability.

I think the lesson in both cases is to think of effective performance as more than a sprint. It’s not just about how fast a particular query can run, for instance, but also about how much work gets done at the end of weeks and months. Reliability, availability and your own project’s tolerance for any down time are issues to consider as you build a big data system that is a good match for your needs.

Extreme Availability: For the projects addressed by @WalmartLabs, the big data technologies involved must be able to meet the requirement to never shut down.

King closed his Strata presentation by reminding the audience that once you’ve appropriately cleansed data of PII and built a reliable system that avoids bureaucracy and provides developers with widespread access, you should just “…hire smart people and let them run.”

You can find out more on the @WalmartLabs website:

Resources on related topics:

These free O’Reilly eBooks are offered for free download as a courtesy by MapR:

  1. Real World Hadoop by Ted Dunning & Ellen Friedman (February 2015)
  2. Sharing Big Data Safely: Managing Data Security by Ted Dunning & Ellen Friedman (September 2015)
  3. Practical Machine Learning: Innovations in Recommendation by Ted Dunning & Ellen Friedman (February 2014)  


Ebook: Getting Started with Apache Spark
Interested in Apache Spark? Experience our interactive ebook with real code, running in real time, to learn more about Spark.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free