Variable Importance

Read It: Evaluating Variable Importance in Classification

The final way to interpret the results from random forest classification is to evaluate the variable importance. That is, determine which of the explanatory variables was the most important for achieving an accurate prediction. There are a number of ways to calculate variable importance, but the most common way is to look at how the accuracy decreases when a variable is either removed or randomized so the relationship between response and explanatory variable is broken. Below, we demonstrate how to find the variable importance for a random forest model in Python.

Watch It: Video - Classification Variable Importance (5:43 minutes)

Click here for a transcript.

Welcome back to the final video on regression or random forests with classification. In this video we're going to talk about variable importance. So a lot of times we understand how our model is doing, how accurate it is but we want to know why it's so accurate, which variables are contributing the most to that accuracy. And that's where variable importance comes in. So in order to get the variable importance, we just have whatever we called our model dot summary. And when we run this, it's going to come up with... oh there's a typo. It's going to come up with this long text cell, but it's got a lot of good information in there that we really can make use of. So first, it's got some information about our model itself. So we've got that. We're doing a random forest model. We've got that. It was classification, we've got all of our explanatory variables, and then we get into variable importance. And, there's you know, probably at least 10 ways that we can measure variable importance. But the one that we are going to be most interested in is going to be this one, this mean decrease in accuracy.

So what this means, how they calculate it, is imagine that you are interested in seeing how important kilowatt hours is. For example, what the computer does is it takes kilowatt hours randomly, shuffles it, does a reallocation similar to what we did in lesson five for the hypotheses, it shuffles it with replacement and then re-runs the model and sees how drastically did the accuracy decrease after that shuffling. So in theory, an important variable will actually create larger decreases in accuracy upon shuffling because suddenly those relationships between predictor and response have been broken. And so, that's what we're seeing here, that when kilowatt hours is shuffled it leads to on average, eight percent loss and accuracy. So we go from 0.7, down to close to 0.6. Same with cubic feet and of natural gas we end up with a seven and a half percent decrease in accuracy. And these are very high, which is why we actually see these hashtags, which tell us how significant a certain variable is in terms of its importance measure.

So this mean decrease in accuracy, eight percent, might not seem like a lot to you and me, but the computer is telling us that that is very highly significant. That that is a very massive reduction in accuracy for our data set. And even here, where we've got gallon FO, fuel oil, we get a 0.9 percent decrease in accuracy. It still does get a slight significance for still being a fairly decent decrease, whereas these last three are not very significant. And so, this is the primary variable importance measure that we'll use. But you can also look at all of these other ones. We can see that some of them actually change. Here natural gas is on top, and kilowatt hours is down below.

And so, they just sort of have all of these differences, and some of them you know, there are pros and cons to using all of these. But, like I said, the mean decrease in accuracy is going to be the primary one we are interested in, because it really gives that direct application to our ability to predict and classify new data sets. And so, as we keep scrolling down through here eventually we get to some information about the model itself. Again so we've got the accuracy here, we've got the number of trees, and the total number of nodes that were built. And then we've got some information about the individual trees here that can also be useful. So we've got sort of how often things show up in nodes, how often they show up at certain depths, and then finally some information on this training out of bag error. So while all of this can be important, like I said, what we are going to primarily want to focus on in terms of variable importance is this one right here.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

Try It: GOOGLE COLAB

Click the Google Colab file used in the video here.
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, try to edit the following code to print the variable importance. Remember to load the data and rerun the model code:

Note: You must be logged into your PSU Google Workspace in order to access the file.

# load the dataset
recs = pd.read_csv(...)

# rerun the model code here

# summarize model to view variable importance
...

Once you have implemented this code on your own, come back to this page to test your knowledge.

Variable Importance