When we were in school and we were given a problem to solve, we usually stopped working on the problem as soon as we found the answer and we recorded that answer on our paper. This might be a fair approach for elementary school assignments, but that approach is not good in higher education or in life. Unfortunately, many people continue this learned behavior into adulthood, at the university and/or on their jobs. Consequently, these people miss new opportunities for learning, discovery, recognition, and advancement.
In data science, we are trained to keep searching (at least, I hope that this is true) even after we find a model from our data that appears to answer our business question accurately. Data Scientists should continue searching for a better solution, for at least four reasons, described below.
In this discussion please note that I am not advocating “paralysis of analysis”, where never-ending searches for new and better solutions are just an excuse (or a behavior pattern) that prevents one from making a final decision. Good leaders know when an answer is “good enough”. We discussed this in a previous article: “Machine Unlearning – The Value of Imperfect Models.”
Here are four reasons why the result of your analytics modeling might be correct (according to some accuracy metric), but it might not be the right answer:
Your model may be underfit, due to stopping too soon in modeling the data set. Unless you are awesomely lucky, it is rare that your very first analytics model on a data set will be the best possible and most accurate model.
Your model may be overfit, due to over-zealous emphasis on modeling the training data as accurately as possible. Even a broken clock is right twice a day! The analytics model must be validated against an independent test data set. This validation process must continue with each new improvement and iteration of your model. You can only demonstrate that you have found a minimum in the model’s MSE (mean squared error; or some other accuracy metric) if you produce “improved” models (with lower error on the training data) that have higher MSE on the independent test data. In other words, until you have discovered this turn-around (local minimum) in the test data error curve for your sequence of models, then you should keep searching your model space for improvements.
Your model may be biased. It may be “good enough” in some cases for you to find the local minimum in the MSE curve for the model that you are building, but did you find the global minimum? Can you prove it? It is one thing to build the model right, but quite another thing to build the right model. Incorrectly identifying a local minimum as the global minimum might be manifest if you use different data variables in the model’s feature set, or you use a different algorithm, and you discover further improvements in the predictive accuracy of your analytics model. There may be human, algorithmic, or technical bias built into your modeling process that overlooks these alternative model choices. One of the most common of these biases is confirmation bias. Using an evolutionary (Genetic Algorithm) approach is one way to avoid local minima and to improve your chances of finding the global minimum in your MSE curve.
Your model may be suffering from the false positive paradox. When the error rate of the model (for positive instances of the condition being tested) is greater than the rate of occurrence of positive instances in the data set, then false positives will outnumber true positives in the predictive model outputs. In this case, the paradox can be stated in this way: “the majority of instances that have been identified as having the condition will in fact not have the condition.” This can be quite serious if the condition is a cancer diagnosis for a cancer-free patient or a terrorist-related arrest of an innocent airline passenger. Because much has already been written about the false positive paradox and its appearance in unbalanced data sets (which are datasets that have far more instances of the control class relative to the tested class), we won’t delve into those issues here. However, we mention the false positive paradox here for another reason. If your analytics model only tests for the rare class, you may too quickly find that the model has very few false negatives (i.e., a high percentage of the instances that have the condition are correctly labeled by the model as “having the condition”; compared to a small rate of such instances being incorrectly labeled by the model as “not having the condition”). In this case, if you base your model accuracy estimation (and premature analytics modeling termination) on the relatively low “false negative” rate (which might be appropriate in some circumstances), then you may miss the existence of many false positives and the presence of the false positive paradox in your analytics experiment. If appropriate, in such cases, you should try using a more complete accuracy estimator, such as a ROC curve or the F-score.
To avoid premature analytics model termination, we need to test many models. To accomplish this effectively, it is helpful to apply an automated multi-model approach that goes beyond identifying just the one supposedly "best" model. Dr. Mo, the software product of Soft10 Inc., can carry out fast search and evaluation of a large number of predictive analytics models, even on large databases. Greater power and efficiency can then be achieved by combining such multi-modeling approaches with the fast big data analytics platforms available through MapR, including Storm or Spark Streaming and in-memory analytics applications on Hadoop. The MapR Sandbox provides the fastest on-ramp to Hadoop for all your analytics modeling needs, especially when you need to look beyond an accurate answer and thus need to dig deeper in order to find the right answers to your analytics questions.