Blog > Machine Learning

Random Forest in R: An Example

17/01/2019

Random Forest is a powerful ensemble learning method that can be applied to various prediction tasks, in particular classification and regression. The method uses an ensemble of decision trees as a basis and therefore has all advantages of decision trees, such as high accuracy, easy usage, and no necessity of scaling data. Moreover, it also has a very important additional benefit, namely perseverance to overfitting (unlike simple decision trees) as the trees are combined.

In this tutorial, we will try to predict the value of diamonds from the Diamonds dataset (part of ggplot2) applying a Random Forest Regressor in R. We further visualize and analyze the obtained predictive model and look into the tuning of hyperparameters and the importance of available features.

Loading and preparing data

# Import the dataset
diamond <-diamonds
head(diamond)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

The dataset contains information on 54,000 diamonds. It contains the price as well as 9 other attributes. Some features are in the text format, and we need to encode them to the numerical format. Let’s also drop the unnamed index column.

# Convert the variables to numerical
diamond$cut <- as.integer(diamond$cut)
diamond$color <-as.integer(diamond$color)
diamond$clarity <- as.integer(diamond$clarity)

head(diamond)

## # A tibble: 6 x 10
##   carat   cut color clarity depth table price     x     y     z
##   <dbl> <int> <int>   <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23      5     2       2  61.5    55   326  3.95  3.98  2.43
## 2 0.21      4     2       3  59.8    61   326  3.89  3.84  2.31
## 3 0.23      2     2       5  56.9    65   327  4.05  4.07  2.31
## 4 0.290     4     6       4  62.4    58   334  4.2   4.23  2.63
## 5 0.31      2     7       2  63.3    58   335  4.34  4.35  2.75
## 6 0.24      3     7       6  62.8    57   336  3.94  3.96  2.48

As we already mentioned, one of the benefits of the Random Forest algorithm is that it doesn’t require data scaling. So, to use this algorithm, we only need to define features and the target that we are trying to predict. We could potentially create numerous features by combining the available attributes. For simplicity, we will not do that now. If you are trying to build the most accurate model, feature creation is definitely a key part and substantial time should be invested in creating features (e.g. through interaction).

# Create features and target
X <- diamond %>% 
  select(carat, depth, table, x, y, z, clarity, cut, color)
y <- diamond$price

Training the model and making predictions

At this point, we have to split our data into training and test sets. As a training set, we will take 75% of all rows and use 25% as test data.

# Split data into training and test sets
index <- createDataPartition(y, p=0.75, list=FALSE)
X_train <- X[ index, ]
X_test <- X[-index, ]
y_train <- y[index]
y_test<-y[-index]

# Train the model 
regr <- randomForest(x = X_train, y = y_train , maxnodes = 10, ntree = 10)

Now, we have a pre-trained model and can predict values for the test data. We then compare the predicted value with the actual values in the test data and analyze the accuracy of the model. To make this comparison more illustrative, we will show it both in the forms of table and plot the price and the carat value

# Make prediction
predictions <- predict(regr, X_test)

result <- X_test
result['price'] <- y_test
result['prediction']<-  predictions

head(result)

## # A tibble: 6 x 11
##   carat depth table     x     y     z clarity   cut color price prediction
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <int> <int> <int> <int>      <dbl>
## 1  0.24  62.8    57  3.94  3.96  2.48       6     3     7   336       881.
## 2  0.23  59.4    61  4     4.05  2.39       5     3     5   338       863.
## 3  0.2   60.2    62  3.79  3.75  2.27       2     4     2   345       863.
## 4  0.32  60.9    58  4.38  4.42  2.68       1     4     2   345       863.
## 5  0.3   62      54  4.31  4.34  2.68       2     5     6   348       762.
## 6  0.3   62.7    59  4.21  4.27  2.66       3     3     7   351       863.

# Import library for visualization
library(ggplot2)

# Build scatterplot
ggplot(  ) + 
  geom_point( aes(x = X_test$carat, y = y_test, color = 'red', alpha = 0.5) ) + 
  geom_point( aes(x = X_test$carat , y = predictions, color = 'blue',  alpha = 0.5)) + 
  labs(x = "Carat", y = "Price", color = "", alpha = 'Transperency') +
  scale_color_manual(labels = c( "Predicted", "Real"), values = c("blue", "red"))

The figure displays that predicted prices (blue scatters) coincide well with the real ones (red scatters), especially in the region of small carat values. But to estimate our model more precisely, we will look at Mean absolute error (MAE), Mean squared error (MSE), and R-squared scores.

# Import library for Metrics
library(Metrics)

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

print(paste0('MAE: ' , mae(y_test,predictions) ))

## [1] "MAE: 742.401258870433"

print(paste0('MSE: ' ,caret::postResample(predictions , y_test)['RMSE']^2 ))

## [1] "MSE: 1717272.6547428"

print(paste0('R2: ' ,caret::postResample(predictions , y_test)['Rsquared'] ))

## [1] "R2: 0.894548902990278"

We obtain high error values (MAE and MSE). To improve the predictive power of the model, we should tune the hyperparameters of the algorithm. We can do this manually, but it will take a lot of time.

In order to tune the parameters ntrees (number of trees in the forest) and maxnodes (maximum number of terminal nodes trees in the forest can have), we will need to build a custom Random Forest model to obtain the best set of parameters for our model and compare the output for various combinations of the parameters.

Tuning the parameters

# If training the model takes too long try setting up lower value of N
N=500 #length(X_train)
X_train_ = X_train[1:N , ]
y_train_ = y_train[1:N]

seed <-7
metric<-'RMSE'

customRF <- list(type = "Regression", library = "randomForest", loop = NULL)

customRF$parameters <- data.frame(parameter = c("maxnodes", "ntree"), class = rep("numeric", 2), label = c("maxnodes", "ntree"))

customRF$grid <- function(x, y, len = NULL, search = "grid") {}

customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
  randomForest(x, y, maxnodes = param$maxnodes, ntree=param$ntree, ...)
}

customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
   predict(modelFit, newdata)
customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
   predict(modelFit, newdata, type = "prob")
customRF$sort <- function(x) x[order(x[,1]),]
customRF$levels <- function(x) x$classes

# Set grid search parameters
control <- trainControl(method="repeatedcv", number=10, repeats=3, search='grid')

# Outline the grid of parameters
tunegrid <- expand.grid(.maxnodes=c(70,80,90,100), .ntree=c(900, 1000, 1100))
set.seed(seed)

# Train the model
rf_gridsearch <- train(x=X_train_, y=y_train_, method=customRF, metric=metric, tuneGrid=tunegrid, trControl=control)

Let’s visualize the impact of tuned parameters on RMSE. The plot shows how the model’s performance develops with different variations of the parameters. For values maxnodes: 80 and ntree: 900, the model seems to perform best. We would now use these parameters in the final model.

plot(rf_gridsearch)

Best parameters:

rf_gridsearch$bestTune

##   maxnodes ntree
## 5       80  1000

Defining and visualizing variables importance

For this algorithm, we used all available diamond features, but some of them contain more predictive power than others.

Let’s build the plot with features list on the y axis. On the X axis we’ll have incremental decrease in node impurities from splitting on the variable, averaged over all trees, it is measured by the residual sum of squares and therefore gives us a rough idea about the predictive power of the feature. Generally, it is important to keep in mind, that random forest does not allow for any causal interpretation.

varImpPlot(rf_gridsearch$finalModel, main ='Feature importance')

From the figure above you can see that the size of diamond (x,y,z refer to length, width, depth) and the weight (carat) contain the major part of the predictive power.