Plotting Decision Trees

Read It: Plotting Decision Trees for Classification

One way you can begin to interpret the results of the random forest model is to plot one of the trees, as demonstrated in the video below. Note that for a random forest, you are plotting only one of many trees and are not being shown the aggregated version of the model. In this sense, each tree will be slightly different, since it was created using slightly different data.

Watch It: Video - Classification Plotting Tree (5:08 minutes)

Click here for a transcript.

Welcome back to the video series on random forests specifically classification. So, in this video, we are going to focus on interpreting the results from our random force model. So, as a review in the previous video, we separated our data into training and test. We defined the model using our random forest model. We fit the model to the training data set, and then we computed the accuracy on the test data set. And so, you could in theory stop here, but oftentimes we want to know what's going on and possibly why we got to the accuracy that we got, and so we can come in here, and we can actually look at these trees.

And so, essentially, to plot a tree we use our TFDF Library, and we do model underscore plotter, and then we have a special command for plotting the model in colab. So that's plot model in polab and this should be an underscore right here and then within the parentheses we give it our model name which we called model. We give it the tree index. I'll just say three, and we give it the max depth, and I'm going to say 3 here. And if we run this we can see that it gives us this example of a decision tree.

So each node here is showing the rule that has been used to make a split. So, in this case, if cubic feet of natural gas was greater than 378, the data set goes here. Otherwise, it goes here. One of the nice things is that there is this color bar that's actually indicating the percentage of each class. So we can see that initially our whole data set, we've got 67 of Class 2 those are single family detached homes, but if we come up here after our first split that jumps to 85 percent. So, more homes more single-family detached homes use more than 378 cubic feet than some of the other house types such as apartment buildings, which are down to 21 in this. And this keeps going. So, we have pellet amounts, we've got kilowatt-hours. And so, you can always come in here and see sort of where the splits are. But this isn't the end of the tree.

So you can see there's some continuation here to let us know that this is not the leaf node, the terminal node. If we increase our max depth to what our actual what we specified as our max depth up here, so 12, we can see that this plot becomes much larger. But if we scroll all the way to the end, we can see that there are these Max nodes.

So this particular node that we end up in is 100 percent single-family homes, and it tells us that there are six data points in this terminal node. This one also 100 percent Class 2 has 17.

Here we've got a mixed class. So, it's majority class three which is single-family attached homes, but there are a few detached homes. So not every Leaf node every terminal node is going to be 100 percent

there. And so, you could come through here but because it's so big, we often just want to cut it off at three maybe four so that you can see the most important splits are at the very beginning. And then you can play around, so say you wanted to look at the 30th tree. We can see that now, under this tree, kilowatt-hours is more has the first split. Whereas before, we had cubic feet of natural gas. So each of these trees is going to be different from one another, and the power of random Forest comes when you average all of those trees together to really get a strong predictive model.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

Try It: GOOGLE COLAB

Click the Google Colab file used in the video here.
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, try to edit the code from the previous section to plot a tree in your random forest model. Make sure to upload the data and rerun the model.

Note: You must be logged into your PSU Google Workspace in order to access the file.

# load the dataset
recs = pd.read_csv(...)

# rerun model here

# plot decision tree
tfdf.model_plotter.plot_model_in_colab(...)

Once you have implemented this code on your own, come back to this page to test your knowledge.

Plotting Decision Trees