In addition to regression trees, we can also fit classification trees when we have binary or categorical outcomes. Use fl2003.RData
, which is a subset of the data in Fearon and Laitin (2003), to fit an ensemble model that explains onset
as a function of all other variables. Determine the most important variables in the ensemble, and then produce a partial dependence plot showing the relationship between two variables that are not the most important, and the predicted probability of civil war in a given observation. Discuss this relationship.
# set seed for replication
set.seed(0032185)
library(randomForest) # random forest ensembles
library(pdp) # partial dependence plots
library(doParallel) # parallel processing
# register parallel backend
registerDoParallel(makeCluster(parallel::detectCores()))
#load data
load('fl2003.RData')
# split into training and test sets
train <- sample(1:nrow(fl), (2 / 3) * nrow(fl))
fl_train <- fl[train, ]
fl_test <- fl[-train, ]
# fit random forest model
fl_rf <- randomForest(formula = as.factor(onset) ~., data = fl_train,
ntree = 1500, mtry = 3, nodesize = 1)
# variable importance plot
varImpPlot(fl_rf)
# partial dependence
fl_part <- partial(fl_rf, pred.var = c('instab', 'ethfrac'), rug = T,
train = fl_train, which.class = 1, prob = T, parallel = T,
paropts = list(.packages = "randomForest"))
# 2D plot
plotPartial(fl_part, rug = T, train = fl_train)
# 3D plot
plotPartial(fl_part, train = fl_train, levelplot = F, drape = T, colorkey = T,
screen = list(z = 240, x = -60))