Delivering the Promise of Hadoop with Data Wrangling

The following is a guest blog post from Sean Kandel, CTO & Co-founder of MapR Partner, Trifacta.

It’s no secret that enterprises are increasingly adopting Hadoop for a variety of analytic purposes. The Hadoop software stack introduces entirely new economics for storing and processing data at scale. It also allows organizations unparalleled flexibility in how they’re able to leverage data of all shapes and sizes to uncover insights about their business. However, the process of getting analytic value out of Hadoop often proceeds more slowly than users might expect, largely because of the under-appreciated challenges associated with getting the data ready for analysis in the first place.

In some ways, these challenges posed by Hadoop are side-effects of the platform's greatest strength. Because Hadoop accepts practically any kind of data, it stores information in far more diverse formats than what is typically found in the tidy rows and columns of a traditional database. Some good examples are machine-generated and log data, written out in a number of different formats, such as JSON, Avro and ORC.

This is a challenge, both for organizations trying to expand the use of current Hadoop implementations as well as for companies in the process of deploying their first Hadoop cluster and set of use cases. In traditional analysis environments involving a data warehouse, the preparation process is rigid and slow but well-established: I talk to customers all the time that have teams of people solely focused on data discovery and cleaning for the data warehouse.  More organizations are moving to Hadoop in an effort to escape the rigidity associated with data warehouses and to perform new types of analysis.

The overwhelming majority of data preparation work in Hadoop is currently being done by writing code in scripting languages like Hive, Pig or Python. That means there is a high technical barrier for individuals to participate in the preparation process. And even the most proficient technologists are forced into the painful process of manually creating new scripts for each inbound dataset or use case. To provide a little more context for the work that makes up "data wrangling," let’s take a look at the typical stages that make up this process:

  • Discovering what is actually in each of the data sets to determine how they can be used as part of an analysis.
  • Assessing the data quickly, to ensure the fit and accuracy of data for analysis to avoid erroneous analytic results downstream.
  • Shaping the data to ensure that it fits the requirements and is at the appropriate level of aggregation for the specific analysis being performed.
  • Enriching the content of an individual data set by joining multiple data sources together to provide additional context, or create derived fields with calculations that highlight business opportunities or gaps.
  • Distilling the data into the format required by the analysis or the downstream analytics tool being used.

Unfortunately, even in the most experienced hands, these steps are almost never sequential. Data preparation work is usually painful and inefficient, requiring analysts to tediously repeat steps over and over before they can make any progress. The delays that result all too often threaten the ability of a Hadoop installation to deliver on its initial promise: robust, flexible analysis for data of all shapes and sizes.

Awareness of this issue is increasing as more and more enterprises adopt Hadoop. Indeed, many of them are discovering that data preparation is actually the most time-consuming portion of any data analysis project, taking up as much as 80% of the overall time.

The overall complexity of this process led our team at Trifacta to develop an entirely new approach to handling the data preparation process.  It's an approach that enables front line business users to do the work themselves, instead of being forced to turn to a technical resource who usually lacks context about both the data and the business issues that are influencing the analysis. To learn more visit or our partner page on MapR’s site at

About Trifacta: Trifacta is the pioneer in data transformation, significantly enhances the value of an enterprise’s big data by enabling users to easily transform raw, complex data into clean and structured formats for analysis. Visit for more information.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free