Gigaom Research released a new report recently, titled “Extending Hadoop Towards the Data Lake.” According to the report, early data lake adopters are integrating Hadoop into organizational workflows, and are addressing challenges regarding the cleanliness, validity, and protection of their data. Their research resulted in some key findings:
- As Hadoop continue to expand its capabilities, its potential as a basis for data lakes becomes more compelling.
- Operational workloads have very different requirements on an IT infrastructure compared to analytical batch processing duties.
- A Hadoop-based data lake augments rather than replaces existing IT systems such as the enterprise data warehouse.
As suggested above, the report mentions that if the Hadoop-based data lake is to become a key enterprise resource, it will need to be able to handle operational workloads in which far smaller data volumes are both read out of the file system and written to it. This validates the claim we’ve been making for a couple of years now that Hadoop and NoSQL have to work together tightly for many types of emerging use cases. A few powerful technologies were specifically called out as enabling the operational processing in a Hadoop-based data lake, including our own MapR-DB, as well as other technologies that MapR supports including Apache HBase and Apache Spark.
A key message in the report is that Hadoop is ideally more about augmenting existing systems rather than about replacing what you already have. Rather than try to redo all of the processes, technologies, and best practices you’ve already put into place, there are likely specific tasks that Hadoop can handle that lead to more value for your enterprise architecture.
The Gigaom report also lists security issues and the occasional failures and restarts of open-source Hadoop as real challenges for organizations, research labs, and university classrooms. Gigaom praised MapR for “placing an early emphasis on hardening its Hadoop distribution, replacing the file system and improving security and reliability throughout the product.”
Finally, when deploying a data lake, there’s always the risk of haphazardly throwing more and more data and workloads into it. This might be a result of early successes of your data lake and thus a growing trust in it, or it might be a result of simple data dumping. An end result could be the pollution in your lake that prevents you from getting the value you initially sought. Gigaom reports that the major Hadoop vendors see these risks, and suggests that you talk to the vendors to better understand how their products can help you keep your data lake pollution-free.