I am building a random forest for a classification problem, using the randomForest package in R. The curves resulting from the plot.randomForest function are not in line with what I get from the confusion matrix, when I try to predict the training data itself (instead of the test set). My intuition was that if I predict the training set itself, I would be getting misclassification rates from the confusion matrix resembling the curves resulting from function plot.randomForest. However, the curves and the confusion matrix are telling me different things. I am not sure why this happens, but my gut feeling is that all curves resulting from the plot.randomForest function are based on the out of bag error and that is why they indicate lower accuracy than the confusion matrix (this is just a conjecture which may as well be incorrect). I will appreciate if someone could please let me know what I am missing, if anything.
Here is a reproducible example using the iris data.
library(datasets) library(gmodels) library(randomForest)
data(iris) set.seed(123) rf.train=randomForest(Species~ Sepal.Length+ Sepal.Width+ Petal.Length+ Petal.Width, data=iris, ntree=50, importance=TRUE) plot(rf.train, main="Error Rate vs Number of Trees In the Forest") predictions=predict(rf.train, newdata = iris) mydata_with_predictions=cbind(iris, predictions) #Confusion Matrix CrossTable(mydata_with_predictions$Species, mydata_with_predictions$predictions, prop.chisq=F, prop.t=F)