Identifying Bias in Data: The Importance of Creating a “Data Block Program”

When we read “data journalism” articles, it often appears that journalists are walking a perilous line. In many cases, they’re working with data that is provided by the creators. Previous work with the data is scant to non-existent. In addition, fact checkers are simply making sure that the article correctly describes the data provided.

In these cases, we must question the accuracy and validity of the data. Consider the Veterans Administration, for example. Data reporting was corrupt at the highest levels of the organization. How have things changed for veterans since the scandal occurred? If you want to find out if the VA‘s quality of service is getting better, what do you compare it to?  

The Federal government releases more and more information every day, in an effort to appear more transparent, and state organizations are doing the same. Even inside our own organization, we are more and more reliant on data that is shared. How useful is all this information? Should we be using it at all?  Do we know the risks?

Typically, information passed along or acquired from third parties comes with more assurance. A provider is paid, and access is granted to that data. The money that is paid to the provider helps cover the cost of “curating” that data. This includes some level of process management, validation, etc.  What is overlooked in this process is that the provider of the data is providing an understood level of bias. This means that in addition to making sure the data is valid, they are providing the data with an understood level of bias and completeness. If they did not do this, there would little or no value to the data, and no one would buy it.

Open government data does not have these same levels of guarantees. There is no well-defined method for collecting the data, the process the data goes through is obscured, and the context of the data in a broader system is missing. This is becoming more and more commonnot just with open government data, but with data we may be leveraging to make business decisions, healthcare decisions, and educational decisions.

What is needed for any data is a methodology for retroactively detecting bias or error, and critiquing data for its value and ability to convey actionable knowledge. Is this possible? During our talk at last week’s Strata+Hadoop World in New York, titled “Fixing Chicago’s Crime Data,” we discussed a new approach to human data interaction. This approach has become necessary because of how easy it is now to collect, assemble and consume data. Rules that applied to data ten years ago no longer apply today.

One phrase that we all need to be suspect of is “lies, damned lies, and statistics,” which is often used to doubt statistics that are used to prove an opponent's point. This line of thinking can lead to complacency or exhaustion—that is human nature. If you do not know where to focus efforts in order to reduce risk, your focus will be everywhere, and you soon will become exhausted. On the other hand, a feeling of “helplessness” and inability to accomplish a goal often leads to doing nothing.

What we are proposing is a process and set of tools to identify bias in data, as well as to score data on its understood bias, utilization and understanding. This methodology can be used on both large public data sets and internal private data sets.  By employing this method, we seek to drive down the risk of using data. It also gives us an idea of where we should focus our attention in order to prevent exhaustion and complacency. Lastly, it provides a collaborative, social approach.    

This approach is very similar to a small community: the parents know everyone in the community, and everyone in the community knows the children. Children are often allowed to play more freely in these communities. Parents, by virtue of knowing everyone, have an understanding of risk for their children. If that community grows dramatically, parents are not likely to give their kids as much freedom. The risk may not be any greater, but they cannot know this for sure. If someone moves in and creates a good neighborhood block program, things can change again. The community can easily meet each other, rely on each other as resources, and get to know each other better. As a result, children could be allowed to play freely again.

In summary, it’s vital that we develop a way to improve the integrity of data that we are using. We need a process and set of tools to identify bias in data, as well as to score data based on its bias, utilization and understanding.



Ebook: Getting Started with Apache Spark
Interested in Apache Spark? Experience our interactive ebook with real code, running in real time, to learn more about Spark.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free