EME 210
Data Analytics for Energy Systems

Introduction to Random Forests

PrintPrint

Introduction to Random Forests

Read It: Random Forests

Random forests, althought they fall under the "decision tree" umbrella, are considered to be an improvement over basci decision trees. This is because the random forest algorithm solves the critical overfitting issues of decision trees. It does this through a meta-algorithm known as bootstrap aggregation or "bagging". Recall the discussion on bootstrapping from Lesson 4. You learned that bootstrapping involves resampling from an initial sample without replacement, effectively creating new samples. The bagging meta-algorithm performs this process but goes one step further and conducts an analysis (e.g., fitting a decision tree model) on each individual bootstrapped sample. Then, once a set amount of bootstrapped samples have been drawn and a model developed, the results are aggregated together to form one final model. The resulting model avoids overfitting, since it is the average of the previously developed individual models. A depiction of the workflow for bagging is shown below. 

Enter image and alt text here. No sizes!
Ensemble methods include agging, boosting, and stacking

Random forest leverages this bagging meta-algorithm to create a "forest" of trees. However, random forest goes one step further than bagging and randomizes the explanatory variables as well. In other words, if you have 20 explanatory variables in your dataset. The random forest model will only use a fraction of those explanatory variables to create a single tree. This subset of variables is randomly sampled for each tree, creating a second form of randomization in the dataset, in addition to the random bootstrap sampling. An example of these two forms of randomization are shown in the figure below. Here, a single tree would only include a few rows (e.g., the bootstrapped sample) and a few columns (e.g., the random selection of explanatory variables).

Enter image and alt text here. No sizes!
Enter caption here
Enter image credit here

In the next several sections, we will discuss and demonstrate how to apply a random forest model in Python for both classification and regression, as well as how to interpret the results.


 Assess It: Check Your Knowledge

Knowledge Check