EME 210
Data Analytics for Energy Systems

Correlation-Based Variable Selection

PrintPrint

Correlation-Based Variable Selection

Read It: Correlation-Based Variable Selection

One way to selet the optimal variables for multiple linear regression is through a correlation analysis. By determining which explanatory variables are most highly correlated with the response, you can get a better idea of important variables, thus implement models with higher adjusted R2 values. Below, we demonstrate how to implement this process in Python.

 Watch It: Video -  Correlation Variable Selection (10:13 minutes)

Click here for a transcript.

ADD TRANSCRIPT TEXT HERE

Credit: © Penn State is licensed under CC BY-NC-SA 4.0
 

 Try It: DataCamp - Apply Your Coding Skills

Using the pre-coded variables below. Calculate and print the correlation matrix. Are any of the variables highly correlated? Which would you include in your multiple linear regression model? 

# This will get executed each time the exercise gets initialized. # libraries import pandas as pd import numpy as np # create some data df = pd.DataFrame({'y': np.random.uniform(10.0, 25.0, 50), 'x1': np.random.uniform(0.001, 0.999, 50), 'x2': np.random.uniform(0.001, 0.999, 50), 'x3': np.random.uniform(175, 250, 50), 'x4': np.random.uniform(5.7, 6.9, 50), 'x5': np.random.uniform(10.0, 25.0, 50)}) # find correlation matrix corrMatrix = ... # print correlation matrix ... # libraries import pandas as pd import numpy as np # create some data df = pd.DataFrame({'y': np.random.uniform(10.0, 25.0, 50), 'x1': np.random.uniform(0.001, 0.999, 50), 'x2': np.random.uniform(0.001, 0.999, 50), 'x3': np.random.uniform(175, 250, 50), 'x4': np.random.uniform(5.7, 6.9, 50), 'x5': np.random.uniform(10.0, 25.0, 50)}) # find correlation matrix corrMatrix = df.corr() # print correlation matrix corrMatrix


 Assess It: Check Your Knowledge

Knowledge Check