About 13 years ago, Doug Laney of the META Group (now Gartner) wrote an amazing report that showed both great insight and great foresight. The paper’s title was “3D Data Management: Controlling Data Volume, Velocity, and Variety.” The 3 V’s of big data were born on that day—February 6, 2001. My only not-so-serious quibble with the paper is that he should have started the title this way: “3V Data Management…” Nevertheless, from that point forward, the big data game was officially on!
Since Doug Laney’s landmark white paper, and especially during the past two years, the big data hype has ratcheted up, frequently around “the 3 V’s” of volume, velocity, and variety. For example, an article in summer 2012 (that has received nearly 10,000 page views to date) again addressed these three challenge areas: “The 3 V’s That Define Big Data.” That this article has been viewed by so many is an indication that many of us at that time (and since) were trying to figure out what this V-based characterization of big data was all about. I was especially pleased to see that the first petascale project mentioned in the 2012 article is the big data-producing astronomy project LSST (Large Synoptic Survey Telescope), a project that I have been supporting in different roles for the past 10 years!
Sidebar: Here is a short description of LSST, to illustrate its three V big data challenges in context. LSST will generate 30 terabytes of image data per night, every night, for 10 years. By the end of the 10-year sky survey, a final image archive of 100-200 petabytes will be achieved, along with a 20- to 40-petabyte queryable database of astronomical source information (i.e., features extracted from the many millions of images obtained during that decade). Volume is a big deal here, but the data velocity may be the biggest challenge. In particular, there will be one 3-gigapixel (6-gigabyte) image obtained with LSST’s superb camera every 20 seconds every night for those 10 years. Within 60 seconds, the image (actually a pair of images, to remove instrumental artifacts) needs to be processed, and all objects in the image pair that have changed in any way (via movement or flux variation) must be reported to the worldwide astronomical community. It is anticipated that each image pair (every 40 seconds) will generate several thousand alerts—which corresponds to about one to ten million alerts being sent out via individual email and text messages (just kidding!) every single night for 10 years. That is fast, high-velocity information! The ability to mine, characterize, classify, and respond to this rapid avalanche of alerts will be an enormous challenge to the researchers seeking to make astronomical discoveries from this data fire hose. Furthermore, each of the 50 billion objects in the LSST survey will be observed roughly 1000 times each over the 10-year project duration, and each observation (40-second image pair) will yield about 200 unique scientific features measured for each object. Consequently, the final completed database of 50 billion astronomical objects will contain approximately 200,000 dimensions of information per object! That is a forceful example of high-variety big data! A slightly dated article that describes these many challenges of astronomical-scale big data mining is available here: “Scientific Data Mining in Astronomy.” A shorter but more dated article is here: “A Machine Learning Classification Broker for Petascale Mining of Large-Scale Astronomy Sky Survey Databases.”
Getting back to our main point—much ado has been made about the 3 V’s of big data in the past few years, with many articles, opinions, criticisms, and hyped messages having been delivered on that subject. Objectively, the main point of the V-based characterization of big data is to highlight its most serious challenges: the capture, cleaning, curation, integration, storage, processing, indexing, search, sharing, transfer, mining, analysis, and visualization of large volumes of fast-moving highly complex data. The good news is that there are many big data solutions to help out, including the MapR M7 Enterprise Database Edition for Hadoop, which received the highest ranking among big data deployments, according to a recent Forrester report. So, there is help for coping with the first 3 V’s.
It did not take long (after the big data 3 V’s began getting a lot of attention in 2012) for many of us (including yours truly) to add some more V’s to the characterization of big data. This was both fortunate and unfortunate. It was unfortunate in the sense that many perceived this as a joke or simply more fuel for the big data hype machine, and consequently many serious discussions were dismissed or ignored. On the other hand, the addition of more V’s was fortunate, in the sense that big data’s “first responders” were encountering these additional challenges with this huge data avalanche, and by adding new V’s to the list of big data challenges, they were providing valuable lessons learned and best practices for the rest of us.
So, what are the V’s representing big data’s biggest challenges? I list below ten (including Doug Laney’s initial 3 V’s) that I have encountered and/or contributed. These V-based characterizations represent ten different challenges associated with the main tasks involving big data (as mentioned earlier: capture, cleaning, curation, integration, storage, processing, indexing, search, sharing, transfer, mining, analysis, and visualization).
- Volume: = lots of data (which I have labeled a “Tonnabytes”, to suggest that the actual numerical scale at which the data volume becomes challenging in a particular setting is domain-specific, but we all agree that we are now dealing with a “ton of bytes”).
- Variety: = complexity, thousands or more features per data item, the curse of dimensionality, combinatorial explosion, many data types, and many data formats.
- Velocity: = high rate of data and information flowing into and out of our systems, real-time, incoming!
- Veracity: = necessary and sufficient data to test many different hypotheses, vast training samples for rich micro-scale model-building and model validation, micro-grained “truth” about every object in your data collection, thereby empowering “whole-population analytics”.
- Validity: = data quality, governance, master data management (MDM) on massive, diverse, distributed, heterogeneous, “unclean” data collections.
- Value: = the all-important V, characterizing the business value, ROI, and potential of big data to transform your organization from top to bottom (including the bottom line).
- Variability: = dynamic, evolving, spatiotemporal data, time series, seasonal, and any other type of non-static behavior in your data sources, customers, objects of study, etc.
- Venue: = distributed, heterogeneous data from multiple platforms, from different owners’ systems, with different access and formatting requirements, private vs. public cloud.
- Vocabulary: = schema, data models, semantics, ontologies, taxonomies, and other content- and context-based metadata that describe the data’s structure, syntax, content, and provenance.
- Vagueness: = confusion over the meaning of big data (Is it Hadoop? Is it something that we’ve always had? What’s new about it? What are the tools? Which tools should I use? etc.) Note: I give credit here to Venkat Krishnamurthy (Director of Product Management at YarcData) for introducing this new “V” at the Big Data Innovation Summit in Santa Clara on June 9, 2014.
This top 10 list is no joke, unlike the top 10 lists of a certain late-night talk show on US television. This list of 10 V’s of big data may seem contrived or not very serious, but it is presented here with serious intent—to identify some of the biggest challenges of big data (using the V merely as a mnemonic device to label and recall these all-important issues). Let us just be grateful that, unlike college-bound students who are preparing for their standardized tests, we do not have 100 V’s to cope with.