MissForest for SurveyData

by PythonBeginner   Last Updated June 12, 2019 07:19 AM

Hello fellow data scientist,

I currently reading the paper by Stekhoven & Brühlmann about MissForest. I was wondering how to deal with variables that are restricted by domain knowlege. I.e. no women can not have had prostate cancer in the past, so missing values are wanted for this item. Should I just exclude such variables (were missing values are wanted / inteded) from the MissForest imputation?

If so how can I combine these variables with the imputed datasets afterwards?

I hope this is specific enough. Thanks in advance



Answers 1


Usually it is better to first apply logical rules to fill some blanks, eventually followed by algorithmical imputation.

Take e.g. a data set about house characteristics. One column is "swimming pool" with either a 1 (yes) or a missing (no). Algorithmic imputation would set all missing to "1", destroying all information about having a pool or not.

Michael M
Michael M
June 12, 2019 07:09 AM

Related Questions


Updated August 03, 2017 16:19 PM

Updated August 09, 2019 21:19 PM

Updated August 03, 2017 18:19 PM

Updated November 09, 2018 16:19 PM

Updated October 11, 2017 15:19 PM