
Correlation-Based Variable Selection
Read It: Correlation-Based Variable Selection
One way to selet the optimal variables for multiple linear regression is through a correlation analysis. By determining which explanatory variables are most highly correlated with the response, you can get a better idea of important variables, thus implement models with higher adjusted R2 values. Below, we demonstrate how to implement this process in Python.
Watch It: Video - Correlation Variable Selection (10:13 minutes)
Click here for a transcript.
ADD TRANSCRIPT TEXT HERE
Credit: © Penn State is licensed under CC BY-NC-SA 4.0
Try It: DataCamp - Apply Your Coding Skills
Using the pre-coded variables below. Calculate and print the correlation matrix. Are any of the variables highly correlated? Which would you include in your multiple linear regression model?
# This will get executed each time the exercise gets initialized.
# libraries
import pandas as pd
import numpy as np
# create some data
df = pd.DataFrame({'y': np.random.uniform(10.0, 25.0, 50),
'x1': np.random.uniform(0.001, 0.999, 50),
'x2': np.random.uniform(0.001, 0.999, 50),
'x3': np.random.uniform(175, 250, 50),
'x4': np.random.uniform(5.7, 6.9, 50),
'x5': np.random.uniform(10.0, 25.0, 50)})
# find correlation matrix
corrMatrix = ...
# print correlation matrix
...
# libraries
import pandas as pd
import numpy as np
# create some data
df = pd.DataFrame({'y': np.random.uniform(10.0, 25.0, 50),
'x1': np.random.uniform(0.001, 0.999, 50),
'x2': np.random.uniform(0.001, 0.999, 50),
'x3': np.random.uniform(175, 250, 50),
'x4': np.random.uniform(5.7, 6.9, 50),
'x5': np.random.uniform(10.0, 25.0, 50)})
# find correlation matrix
corrMatrix = df.corr()
# print correlation matrix
corrMatrix