I recently wrote an article about how to get a Big Data initiative back on track. In the comments section, a user challenged how such an initiative is different to a traditional analytics project. That’s a good question, and the answer is not immediately obvious to most. The key differentiators are not Hadoop, NoSQL, large datasets or any of the usual suspects. The difference is that it is impacting all organizational datasets, the overall flow of data in the long-term, as well as data processing and storage.
A lot of people dislike the term Big Data, especially since marketing has affixed it to everything under the sun in order to sell it. A good way to highlight the difference between Big Data and garden-variety analytics projects is if we think of it as a long-term and comprehensive change to organizational data. Tools like Hadoop permit you to store and process all your data. Exploiting that capability to the fullest means rethinking how you manage all your data. You could call it Strategic Data instead of Big Data initiatives.
At Big Data Partnership, we engage with clients from various contexts. For example, some come with a specific challenge like offloading ETL from expensive RDBMS appliances to reduce time, cost, and improve reporting. On the back of that conversation, additional desires emerge like deeper analytics or intelligent recommendation engines. And as the conversation continues, more and more data needs to be made available in a platform like Hadoop to service the demand. Eventually, the point is reached where an organisation has to take a step back and look at its data flow and realise that it makes no sense to continue sending back and forth large datasets. It would be better to have a central store, a point of truth.
Large organisations occasionally come with exactly this request as the principal requirement. They have fragmented data stores from acquisitions and organic growth. They have to cope with various data marts and data warehouses, legacy systems with curious data formats, and the tradition of deleting data. When these systems were installed, storing data was expensive, and they mutate the little data they keep in irreversible and often untracked fashion. In some industries like finance, this is particularly undesirable, but all industries can benefit from changing this historically emerged processing.
Big Data initiatives extend beyond the immediate use cases on a longer time scale. They are strategic data initiatives that impact data governance, storage and processing organization-wide. Such an effort is not easy and takes a well-planned, phased approach with enough flexibility to act on the unknowns that will certainly interfere with the execution in such a complex environment.
The first step is to identify what the raw data of the organization at its periphery is. This data, usually some events, transactions, or logs, is to be considered immutable and stored unprocessed. This essential point future-proofs your investment. The idea is not new and has been part of the Lambda Architecture, a somewhat related architecture pattern. The collected raw data enables an organization to recover from disasters, processing errors, human errors, bugs and provide future value. At any time, what-if scenarios or retrospectively improved algorithms that leverage the data better or exploit unused signals can be applied.
The second aspect is to centralise your data storage and processing to a core platform like Hadoop. The dataset is likely large and has gravity and inertia, i.e., increasingly smaller datasets will be collocated to be easily joined, and processing all of the data is best done in place. The additional benefit is the establishment of a trusted master dataset. A core issue with the fragmented IT landscapes is the situation of having overlapping information between departments and systems that disagree. It stifles an organisation and its decision makers with uncertainty and errors.
So when we talk about Big Data initiatives, we quickly end up talking about strategic data and how the organisation and its approach to data are about to change. The effort, risk and benefits are on a different scale compared to an analytics project, and as such, it is important to get it right the first time. This can be a significant challenge to organizations, since they are dealing with new technologies and patterns of which they have little knowledge in-house.
Luckily, governance and security are rapidly evolving topics in the Big Data ecosystem. Today, we have a growing number of tools available, e.g., to provide compliance with HIPAA or PCI. The risk increasingly becomes to not plan and adopt now and be left behind by others in the industry. An example is a financial company which has been doing an excellent job in its sub-vertical, a settled market, and for a very long time stored away offline only a minimal amount of data required by regulations. It and everyone else in the market ignored years of rich datasets which, when stored and correctly analysed, could have tremendous value to its clients as entirely new products. In a competitive market, such a realisation and subsequent actions are providing untapped potential for growth.