Implementing Multiple Linear Regression in Python

Read It: Implementing Multiple Linear Regression in Python

In this course we will be implementing multiple linear regression in Python. In particular, we will be using the .ols command from the statsmodels.formula.api library that you learned about in Lesson 8. More information on this command canbe found in the documentation(link is external). Below we will demonstrate how to implement multiple linear regression through two videos. The first will set up the data, while the second will focus on the acutal implementation.

Watch It: Video - Data Set Up Visualization (6:50 minutes)

Click here for a transcript.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

Watch It: Video - Multiple Linear Regression (9:21 minutes)

Click here for a transcript.

Hello, and welcome to the videos on multiple linear regression. In this lesson we're going to extend what we learned about linear regression to including multiple explanatory variables within our model. So with that, let's go ahead and dive in.

Now in this video, we're mainly going to talk about the data that we're going to use in multiple linear regression, as well as do a preliminary visualization. So I've got several libraries here and I want to point out that we are using the statsmodels.formula.api library. So you learned about this in Lesson 8. This is the way we did linear regression, where we developed a formula using that tilde. We're going to use that here in multiple linear regression code. And so, what the data that we're going to work with throughout this lesson is the Pennsylvania regional greenhouse gas initiative data that we first introduced during the Anova lesson. And so, there are several Excel files that we will use. And I've already got the code set up to read this in. And so we're going to be using the 2030 data. And we're essentially going to be trying to predict the base NOx predict the RGGI NOx pollution levels based off of the base case data.

And so, I'm going to go ahead and run this. And so, as this is going, just a reminder that what this code is doing is its first reading in the Excel files specifying that we're working with sheet 2030. And then we're setting the indices for stuff that we don't want to necessarily add these suffixes to. Then we add the suffixes to differentiate between these different pollutants because we're going to in this code combine them all. So we want to have these different suffixes so that we know which column is for the base case, which is for RGGI, and which is for noaeps. And then finally, I reset the index. And so, if we look at say the first 10 rows, we can see that we've got this ID Latin lawn. We've got so many columns that it's just shortening it for us. And we can come down here, and we've got all of these different stack columns as well. And so we are going to be working with this data set throughout our multiple linear regression lesson. But before we get into multiple linear regression, before we develop the models, the first step is to always visualize the data try to figure out what exactly is going on, gives us an idea about what maybe we want to think about including in our multiple linear regression model. And so, before we can actually run this code, I need to convert the state data into a category so I'm going to say df_merged state.astype category. And this is because the states are actually labeled as numbers. And so, when we read the Excel file in, Python automatically assumes that they're numeric. But for the sake of plotting, we're going to want them to be categorical. And so, then we can use ggplot to do our main plot with our df_merged. And I'm going to set the AES statement inside this main command because our X data is going to be this X, and Y data are going to be the same across all of our plots. And so, we have X is NOx_base Y is NOx_RGGI. And then we can add a point plot. We do still need one extra AES here. We need to set the color equal to the state.

We can add a stat smooth, no need to do an AES here, we can just say the method is LM and then I'm going to show you a little bit of an extra thing with scale_color_brewer. And this is a way that you can very quickly change your color scheme to match one of the preset color_brewer color schemes. So we need to tell it that this is a qualitative data that we're plotting, so categorical data. And then we need to tell it which palette, and these all have set names that you can find online if you just Google Color Brewer. What we're going to be using is set one. And so we can run this, and we can see what we include, just NOx_base and NOx_RGGI. We can see that a lot of the data is going here, but there's this data down here, State 37. This happens to be Pennsylvania. And so our best fit line for just this linear regression isn't too great. It's trying to sort of split the middle between all of these different variables. And it looks like maybe that the state itself is an important explanatory variable and so perhaps when we do get into our multiple linear regression this suggests that maybe we should look into including state as well as NOx_base as our explanatory variable.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

Try It: DataCamp - Apply Your Coding Skills

Using the partial code below, implement a multiple linear regression model. Your response variable should be 'y' and the remaining variables should be used as explanatory variables.

1
2
# libraries
import pandas as pd
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

1/0 0/0

Implementing Multiple Linear Regression in Python