Someone once said “if you can’t measure something, you can’t understand it.” Another version of this belief says: “If you can’t measure it, it doesn’t exist.” This is a false way of thinking – a fallacy – in fact it is sometimes called the McNamara fallacy. This mindset can have dire consequences in national affairs as well as in personal medical treatment (such as the application of “progression-free survival” metrics in cancer patients, where the reduction in tumors is lauded as a victory while the corresponding reduction in quality of life is ignored).
Similarly, in the world of data science and analytics, we are often drawn into this same way of thinking. Quantitative data are the ready-made inputs to our mathematical models. The siren call of quantifiable predictive and prescriptive models is difficult to resist. If the outputs from our models are quantitative (e.g., accuracy, precision, recall, or some other validation metric), then why not also the inputs to our models? Isn’t that the essence of being data-driven?
What we overlook when we say “data-driven” is that we really mean to say “evidence-based”. Evidence is not only quantitative. Similarly, data are not only quantitative. Consequently, what we miss in the rush to be more quantitative is the enormous value of qualitative data sets. The value of qualitative data comes in several ways, including:
- it provides additional features that can improve the accuracy, usability, and explanatory power of our analytic models;
- it puts the quantitative data into proper context (which prevents the misapplication of our model in the wrong context);
- it helps build the human story, the narrative, and the acceptance (and ultimately advocacy) of our model results; and
- as the name implies, it helps us to assess (even validate) the quality our analytic results.
We will explore these ideas by responding to four basic questions related to qualitative data:
1. What are some ways that we encounter qualitative data?
Qualitative data can come from surveys, customer response forms, documents, and even social media. These are invaluable sources of information that organizations already collect and exploit for important insights. Historically, the analysis of qualitative data tended to be very human-intensive, since we could not just submit a database query against a document and get some numbers back that we can feed into a visualization. Consequently, historical qualitative data analysis were typically limited in scope. However, that situation is now rapidly changing. There are increasingly clever ways that qualitative data are being transformed into quantitative data, thereby unleashing the full power of quantitative analytics on the qualitative data also. Some transformation methods include scoring (assigning a numerical rank or score to specific qualitative responses or comments), sentiment analysis (assigning a positive or negative value to the sentiment being expressed in the qualitative data, and then assigning a numerical value to the strength of that sentiment), text analytics (summarizing the content of textual information in quantitative ways, such as topic models and heat maps), and natural language and semantic processing (extracting meaning from the language, whether written or verbal). Consequently, qualitative data are already first-class citizens in the world of big data and they should be allowed equal opportunity to deliver business insights and value.
2. What are some of the similarities and differences between qualitative data and quantitative data when it comes to deriving insights?
Since qualitative data generally are data that are not quantitative, that means that these data are unstructured and usually textual. They might be from customer surveys, response forms, online forums, feedback comment blocks on web forms, written comments, phone calls to call centers, anecdotal evidence (e.g., gathered by our sales force or marketing team), news reports, and so on. Consequently, the extraction of structure and objective insights from such data requires a model: How do we model the words, or the comments, or the survey responses that we are collecting? How much weight do we assign to different content? How do we combine and integrate multiple sources? Answers to these questions are not really that much different from the answers we give to these exact same questions when we handle quantitative data. The big difference is that the quantitative data are already in a form to be manipulated in spreadsheets, displayed in dashboards, or plotted on graphs. Some decisions need to made (which can be subjective) when deciding how to transform the qualitative data into quantitative form. So that is a challenge, but it is also a rich opportunity – there are far more subtleties and intricacies in language that we can use to extract deeper understanding and finer shades of meaning from our qualitative data sources about our customers, employees, and partners.
3. How are the analytic and statistical processes of data science different for qualitative data sets?
First, a richer set of transformations are needed than for quantitative data (where it might be sufficient to normalize the data on a zero-to-one scale, or to combine the variables in some mathematical way, or to assign numerical weights to different measurements before combining them, or to define a simple mathematical similarity or distance metric between different attributes). Validation of models tends to be more straightforward with such quantitative analysis. Conversely, more sophisticated and clever transformation and validation metrics must be used in qualitative data analysis, where it is harder to define a clear value for “right” and “wrong” (e.g., True Positive vs False Positive), though logistic regression techniques are sufficient when there are binary outputs (e.g., is this social media user more likely to vote for political candidate A or B?). However, standard statistical tests that perform binary testing (hypothesis A vs. hypothesis B [or the Null hypothesis]) won’t work when there are many shades of meaning and many degrees of understanding embedded within qualitative data (i.e., many possible hypotheses that need testing). Link analysis is one possible approach to mining qualitative data: this technique can be used to discover and explore associations between multiple nodes in a complex knowledge network. Link analysis does not need quantitative data – in fact, it depends on the data being discretized and not continuous numeric data – in this case, qualitative data have an advantage.
4. Are qualitative data sets therefore going away, especially if we are simply transforming them and quantifying them (perhaps automatically) into quantitative data?
Qualitative data are not going away. In fact, that data type is probably growing faster than any other type of data that we are collecting in this big data era. But, we will definitely see more quantification of the qualitative data (which we are already seeing) in order that we can take advantage of the rich set of analytics algorithms and technologies that are now being spun out at a prodigious rate for quantitative data. Nevertheless, it is incorrect to say that qualitative data are no longer part of the picture after we quantify them. They are still one of the most important parts of our “data story” and data assets. We can’t run from it, nor should we try. But we should try to make the best use of it, to create the best models to extract meaning and insights from it, and to continue the search for cleverer algorithms that allow us to quantify the massive volumes of qualitative data that we are collecting. In short, we need to collect, process, and mine big data “at scale”, and that includes both quantitative and qualitative data.
In summary, we can avoid fallacious thinking and give deeper contextual meaning to our data science activities when we can seamlessly aggregate, analyze, and mine both our quantitative and our qualitative data collections. This is most easily achievable if we can do it within a converged “multi-lingual” data environment, on a single platform, with a shared set of analytic tools. We now see the emergence of such convergence in the big data ecosystem, specifically in the new converged data platform at MapR. The ability to store heterogeneous data across a distributed data architecture with Hadoop, the ability to query the data (databases, documents, text, JSON data objects, and more) across the data lake with Apache Drill, and the ability to mine these data in-memory and in real-time with Apache Spark are all bringing us one step closer to the promise of Cognitive Analytics: asking the right question, at the right time, in the right context, across all of your data collections, both quantitative and qualitative.
Whatever industry or context you are working in, if you can extract data from it, you can understand it. That is both a qualitatively and a quantitatively correct way of thinking.