Random Forest Regression

Read It: Random Forest Regression

Similar to the classification model, random forest regression models build a forest of trees and aggregate them. However, the aggregation with regression is different. Since regression deals with quantitative response variables, there is no need to adopt a majorty-rule voting scheme, instead, the final result is simply the average of each of the individual trees. Additionally, there are some minor changes in the code.

Below, we demonstrate how to apply a random forest regression model in Python. We will include four key steps:

Split the data into training and test/validation sets (80/20 is a common split)
Specify the model parameters (e.g., number of trees, maximum depth of any given tree, etc.)
Fit the model to the training data set
Evaluate predictive error (e.g., RMSE, MAE, etc.) using the test/validation data set

Watch It: Video - Random Forest Regression (10:18 minutes)

Click here for a transcript.

All right, in this video we're going to introduce regression with random forest. So, similar to regression with neural networks, this is where we use quantitative variables that we're trying to predict then that numerical value as opposed to trying to classify in groups. And as we go through these videos you'll see that a lot of the analysis functions the same way. We still are going to split our data, define the model, fit the model, and evaluate the accuracy. But there are going to be a few key differences that I'll point out throughout the videos. So with that, let's go ahead and jump right in. So we're here in Google Colab. I've already installed the libraries and ran the recs data set.

So our goal is going to be to predict the cubic feet of natural gas that are being used by households, given some key household characteristics. So, once again we train split the data into training and test sets. To give you an idea of what this looks like, we've got our response variable and all of our predictor variables. Similar to the classification, we need to convert this into tensorflow objects. So we say tfdf.keras.pd_datadframe_to_tf_dataset. So same process as before. We give it our training data set, we tell it our label, which is our response variable. So in this case it's cubic feet of natural gas. Snd then we have one more step. We need to tell the program what task we're doing. So with random forest the baseline assumption is that you're doing classification. But if you're doing regression to change this we maintain the same task and then we're good to go in terms of splitting our data in memory space, but it tells the computer what to do with it. So once we've split the data into training and test, our next step is to actually define the model. So I'm going to call this model reg_model for regression, but the command is the same as classification.

So, regardless of which type of analysis you're doing, if you're working with random forest you're always using the same command. And we've got some of the same arguments. So we can still say compute_oob_variable_importances. Set that to true.

We can still set our number of trees. In this case, I'm going to set it to 200, a little higher than before, and we can still set our max depth, which I'm going to set to 10. But once again, we need to define our task. So this is the second area that you have to do this, you have to do it when you're splitting your data and turning it into tensorflow objects. And then, once again when you're defining your model, so you can run that. it's told us where it saved it, and then we have another option to specify our metrics. And this becomes a little bit more important for regression models because there are a lot more options for metrics. In particular, we want to do MSE and MAE. So MSE is the mean squared error, which is indicated here. It's the difference between your actual data and your predicted data squared and averaged.

And then mean absolute error is the difference between your actual data and your average no square, but still then averaged itself. So both are purely numerical metrics that tell us sort of the error, looks at how different our actual data is from our predicted data. So once we've compiled the model, the next step is to fit the model. And this works exactly like it does in classification. we say reg_
model dot fit and give it our training data set. And so, it's telling us that it's reading in the training data set, and goes through all of our different steps. Here we can see that it's taking a little bit longer. So our last one ran in less than five seconds total. This one took about 10, and that's because we've increased the number of trees to 200. But once the model is fit, we can go through and do our evaluation. And this works the same as it did before. So we just say reg_model dot evaluate test data set. And I'm going to return it as a dictionary.

So then we can see that it's given us our MSE and our MAE. And you can see that the MSE is apparently huge, and that's because this has been squared. And so, this is really in feet to the sixth, instead of cubic feet. So in order to get this back into our units that we are, that we're working in cubic feet of natural gas, we need to calculate the root mean squared error. And so, here we just say numpy dot square root. And it's just the MSE from our evaluation, make sure I spelled that right. I did not. There, and then we can print it. And so, here we can see it's at 374. And so that is more in line with our MAE, at least in the same order of magnitude.

Generally though, we want to provide units alongside our root mean squared error because it is unit dependent. And so, we can say cubic feet NG. And so that just provides a little bit more information because depending on what we're actually working with, 309 or 374 could be a huge error, or it could be a very small error. And so, similar to when we were working with outliers and means and medians, it's important to compare this to our actual data set. So if we look at the cubic feet of NG and say dot describe, we can see that the average is 334. So our error is actually greater than our average. So that's not really ideal. So you can imagine that if you are the utility that's trying to predict a household's cubic feet of natural gas, perhaps trying to estimate for next year's supply, and your error is greater than the average usage within your service area, that's probably not great. That means that this average household you could estimate within your error bounds more than double what they actually would do. So it's not great. And that's usually how you can make a decision about, you know, what actually is this, how good this error is. How acceptable it is by looking at what the average values are, and how your error compares to that.

Try It: GOOGLE COLAB

Click the Google Colab file used in the video here.
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, try to edit the following code to run a random forest regression model:

Note: You must be logged into your PSU Google Workspace in order to access the file.

# load the dataset
recs = pd.read_csv(...)

# split the data into 20/80 test/training sets 
train, test = train_test_split(...)

# reformat the data to a tensorflow dataset
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train, label = ..., ...)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test, label = ..., ...)

# create the random forest model
reg_model = tfdf.keras.RandomForestModel(...)
reg_model.compile(metrics = ...)

# fit model with training data
reg_model.fit(...)

# evaluate model accuracy using test data
evaluation = model.evaluate(...)

# calculate RMSE
rmse = ...

rmse

Once you have implemented this code on your own, come back to this page to test your knowledge.

Random Forest Regression