Tips on Data Imputation From Machine Learning Experts

Missing values are a critical issue in statistics and machine learning (which is “advanced statistics”). Data imputation deals with ways to fill those missing values.

Andriy Burkov made this statement a few days ago [1]:

“The best way to fill a missing value of an attribute is to build a classifier (if the attribute is binary) or a regressor (if the attribute is real-valued) using other attributes as “X” and the attribute you want to fix as “y”.”

However, the issue is not that simple. As noted by one participant:

From Franco Costa, Developer: Java, Cloud, Machine Learning:

What if is totally independent from the other features? Nothing to learn

The discussion then quickly expanded and many machine learning experts offered their own experiences and tips for solving this problem. At the time of writing (March 8, 2018), there are 69 answers.

Here are, in my opinion, the most useful ones.

1) REMOVE MISSING VALUES

From Blaine Bateman, EAF LLC, Founder and Chief Data Engineer at EAF LLC:

Or just drop it from the predictors

From Swapnil Gaikwad, Software Engineer Cognitive Computing (Chatbots) at Light Information Systems Pvt. Ltd.:

Also I got an advice from one of my mentor is whenever we have more than 50% of the missing values in a column, we can simply omit that column (if we can), if we have enough other features to build a model.

2) ASK WHY

From Kevin Gray, Reality Science:

It’s of fundamental importance to do our best to understand why missing data are missing. Two excellent sources for an in-depth look at this topic are Applied Missing Data Analysis (Enders) and Handbook of Missing Data Methodology (Molenberghs et al.). Outlier Analysis (Aggarwal) is also relevant. FIML and MI are very commonly used by statisticians, among other approaches.

From Julio Bonis Sanz, Medical Doctor + MBA + Epidemiologist + Software Developer = Health Data Scientist:

In some analysis I have done in the past, including “missing” as a value for prediction itself have got some interesting results. The fact that for a given observation that value is missing is sometimes associated with the outcome you want to predict.

From Tero Keski-Valkama, A Hacker and a Machine Learning Generalist:

Also, you can try to check if the value being missing encodes some real phenomenon (like the responder chooses to skip the question about gender, or a machine dropping temperature values above a certain threshold) by trying to train a classifier to predict whether a value would be missing or not. It’s not always the case that values being missing are just independent random noise.

From Vishnu Sai, Decision Scientist at Mu Sigma Inc.:

In my experience, I’ve found that the technique for filling up missing values depends on the business scenario.

From David T. Kearns, Co-founder, Sustainable Data and Principal Consultant, Sustainable Services:

I think it’s important to understand the underlying cause of the missing values. If your data was gathered by survey, for example, some people will realise their views are socially unpopular and will keep them to themselves. You can’t just average out that bias – you need to take steps to reduce it during measurement. For example, design your survey process to eliminate social pressure on the respondent.

For non-human measurements, sometimes instruments can be biased or faulty. We need to understand if those biases/faults are themselves a function of the underlying measurements – do we lose data just as our values become high or low for example? This is where domain knowledge is useful – making intelligence decisions of what to do, not blind assumptions.

If you’ve done all that and still have some missing values, then you’ll be in a far stronger position to answer your question intelligently.

3) USE MISSING VALUES AS A FEATURE

From Julio Bonis Sanz, Medical Doctor + MBA + Epidemiologist + Software Developer = Health Data Scientist:

One of my cases was a predictive model of use of antibiotics by patients with chronic bronchitis. One of the variables was smoking with about 20% of missing values. It turned out that having no information in the clinical record about smoking status was itself a strong predictor of use of antibiotics because a patient missing this data were receiving worse healthcare in general. By using imputation methods you someway lose that information.

From Kirstin Juhl, Full Stack Software Developer/Manager at UnitedHealth Group:

Julio Bonis Sanz Interesting- something that I wouldn’t have thought of – missing values as a feature itself.

From Peter Fennell, Postdoctoral Researcher in Statistics and A.I. @ USC:

Thats fine if have one attribute with missing values. Or two. But what if many of your features have missing values? Do recursive filling, but that can lead to error propagation? like to think that there is value in missing value, and so giving them their own distinct label (which, eg, a tree based classifier can isolate) can be an effective option

4) USE TESTED PACKAGES SUCH AS MICE OR RANDOM FOREST

From Jehan Gonsal, Senior Insights Analyst at AIMIA:

MCMC methods seem like the best way to go. I’ve used the MICE package before and found it to be very easy to audit and theoretically defensible.

From Swapnil Gaikwad, Software Engineer Cognitive Computing (Chatbots) at Light Information Systems Pvt. Ltd.:

This is a great advice! In one of my projects, I have used the R package called MICE which does the regression to find out the missing values. It works much better than the mean method.

From Nihit Save, Data Analyst at CMS Computers Limited (INDIA):

Multivariate Imputation using Chained Equation (MICE) is an excellent algorithm which tries to achieve the same. https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/

From ROHIT MAHAJAN, Research Scholar – Data Science and Machine Learning at Aegis School of Data Science:

In R there are many packages like MICE, Amelia and most Important “missForest” which will do this for you. But it takes too much time if data is more than 500 Mb. I always follow this regressor/classifier approach for most important attributes.

From Knut Jägersberg, Data Analyst:

Another way to deal with missing values in a model based manner is by using random forests, which work for both categorical and continuous variables: https://github.com/stekhoven/missForest . This algorithm can be easily be reimplemented with i.e. a faster than in R implemented RF algorithm such as ranger (https://github.com/imbs-hl/ranger) and then scales well to larger datasets.

5) USE INTERPOLATION

From Sekhar Maddula, Actively looking for Data-science roles:

Partly agree Andriy Burkov. But at the same time there are few methods specific to the technique/algo. e.g. For Time-series data, you may think of considering interpolation methods available with the R package “imputeTS”. I also hope that there are many Interpolation methods in the field of Mathematics. We may need to try an appropriate one.

6) ANALYZE DISTRIBUTIONS

From Gurtej Khanooja, Software Engineering/Data Science Co-op at Weather Source| Striving for Excellence:

One more way of dealing with the missing values is to identify the distribution using remaining values and fill the missing values by randomly filling the values from the distribution. Works fine in a lot of cases.

From Tero Keski-Valkama, A Hacker and a Machine Learning Generalist:

If you are going to use a classifier or a regressor to fill the missing values, you should sample from the predicted distribution rather than just picking the value with the largest probability.

SUMMARY

This was the best summary comment I found:

“I have used MICE package in R to deal with imputing and luckily it produced better results. But in general we should take care of the following:

Why the data is missing? Is is for some meaningful reason or what?
How much data is missing?
Fit a model with non missing values.
Now apply the imputing technique and fit the model and compare with the earlier one.”

-Dr. Bharathula Bhargavarama Sarma, PhD in Statistics, Enthusiastic data science practitioner

FOOTNOTES

[1] https://www.linkedin.com/feed/update/urn:li:activity:6375893686550614016/