The d’Artagnan of Hadoop (Spoiler Alert: Data Governance for Hadoop)

“Dad, why do they call it “Three Musketeers” when it’s all about d’Artagnan?” asked my son after we finished watching the movie. D’Artagnan was the true hero of the story, without whom there would have been no adventures.

This brings me to big data. Like the story of the Three Musketeers, it is not the first three Vs – Volume (undoubtedly, the portly Porthos), Velocity (fast talking and fast thinking Aramis) and Variety (Athos – a man of many moods and faces) that are the true heroes, but rather it is the fourth V – Veracity (d’Artagnan) that saves the day.

Another interesting point is that at the beginning of the story, d’Artagnan as an aspiring Musketeer is not taken seriously by Porthos, Athos, and Aramis. However, d’Artagnan not only proves his value right away, but he is the one who thwarts Cardinal Richelieu’s conspiracy. Similarly, in the early days of Hadoop, the other three Vs overshadowed Veracity, but now Veracity (i.e., Data Governance) is becoming the “d’Artagnan of Hadoop.”

So what does Data Governance mean for Hadoop?

First of all, let’s acknowledge that organizations already have data governance policies in place, as well as have implemented such policies in enterprises databases, business applications, and end-user tools - but, what about Hadoop? Does data governance apply to Hadoop?

Let’s take an example. A key use case for Hadoop is exploratory analytics, where data engineers and data scientists explore, wrangle, and analyze data in a data lake to uncover new insight and business value. During that process, internal and external data are mashed together to build predictive models as well as visualize data. But simply loading data in Hadoop and putting self-service tools in the hands of users could violate multiple data governance policies.

For instance, data governance policies related to secure access control could be violated if permissions for what users can see and do with the data aren’t enforced. This is particularly risky given that Hadoop allows for so much data to be consolidated and analyzed in one place.  The latter is a tremendous benefit of Hadoop, but there is a need to balance ubiquitous self-service with security.

Furthermore, consider regulatory compliance. For example, HIPAA regulations (Health Insurance Portability and Accountability Act) demand that sensitive data (e.g., personally identifiable information) has to be protectedthe data has to be encrypted or masked (e.g., protecting social security numbers).  This means that sensitive data going into Hadoop must be identified and protected before users can get to it. For example, during the exploratory analytics process, it means users shouldn’t be able to see sensitive data, or derive sensitive data by being able to deduce identifiable information from multiple data sets.

Another regulation is Basel III, which focuses on risk data aggregation. This requires financial data to be consolidated in a single view of risk exposure for use in financial reporting. Such consolidation has to be “certified,” which means that an organization has to show what the data means (metadata), where the data came from and what was done to it (i.e., data lineage), who did it (i.e., authorized user or owner), and that it is trustworthy (i.e., data quality). So, doing this kind of consolidation in Hadoop has to be auditable.

In summary, there are key technical capabilities that are needed in Hadoop in order to comply with data governance policies. A Gartner post provides a good summary of these capabilities:

“Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.” - Gartner Says Beware of the Data Lake Fallacy

As pointed out, metadata, data lineage, and data quality are critical capabilities you need to have in order to comply with regulatory and data governance policies. However, today it is challenging to implement these capabilities in Hadoop. Auditing data lineage requires custom coding and stitching to instrument the data pipeline across different programs and toolsthis approach has a high TCO. Data quality can be assessed and data can be cleansed using a data prep tool or by hand in R for instance, but only one file at a time —and considering a data lake can contain millions of fields of data, this approach also has a high TCO and isn’t very scalable. Last but not least, defining business and compliance metadata is a manual process of tagging each field with a definition and annotationthis approach is time consuming because of the sheer size of the data lake, as well as the fact that people have to ask around to find someone who knows what individual fields mean. Moreover, this approach lacks a governed process to create an approved business glossary, without which it is not possible to trust the meaning of the data.

MapR and Waterline Data are partnering to bring Data Governance to Hadoop. The solution automates the discovery of technical, business and compliance metadata across the data lake and enables self-service with data governance:

  • Achieve Compliance. Automatically discover business and compliance metadata, audit history and data lineage, and provide secure self-service to the data
  • Build Your Metadata and Business Glossary. Automated field-level metadata discovery, business glossary, and ontology crowdsourcing
  • Ensure Data Quality. Field-level data quality assessment and inspection

In conclusion, to quote King Louis: “This world is an uncertain realm, filled with danger…  But there are those… who dedicate their lives to truth, honor, and freedom. These men are known as Musketeers. Rise, d’Artagnan, and join them.”

Learn more about how Waterline Data will help you find, understand, and govern data in Hadoop.

Download the free Waterline Data on MapR sandbox and tutorials.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free