EME 210
Data Analytics for Energy Systems

(link is external) (link is external)

R-squared: Goodness-of-Fit

PrintPrint

F-Statistic

Let's start by comparing the hypotheses for a one-way ANOVA test (from Lesson 7) to the hypothese for a simple linear regression model:

One-Way ANOVA vs. Simple Linear Model
Hypotheses for a One-Way ANOVA Test are:  Hypotheses for a Simple Linear Model are:
Ho: All population means, μ i are equal for all k categories Ho: The model is ineffective (at explaining variability)
Ha: At least one μ i is not equal to the others Ha: The model is effective (at explaining variability)

The concept that links these two sets of hypotheses is that the grouping of data into k categories in the ANOVA is a model. ANOVA looks at the change in mean (of the response variable) over a categorical variable, whereas a simple linear model looks at the change in mean over a quantitative variable. As such, the simple linear model can also use the F-statistic for its hypothesis test:

F=MSMMSE F = \frac{MSM}{MSE}

where the “Mean Square for the Model” (MSM) replaces “Mean Square for Groups” (MSG) in the numerator and has the equation:

MSM=SSM1 MSM = \frac{SSM}{1}

This is essentially the same as MSG, with the main difference being that the denominator has degrees-of-freedom equal to 1 because there is only one explanatory variable in the simple linear model (XX). Similarly, the equation for MSE is modified with different degrees-of-freedon in the denominator to indicate that there are two “unknowns” in the simple linear model: the slope and intercept (the β\beta’s):

MSE=SSEn2 MSE = \frac{SSE}{n-2}

Thus, the F-statistic is conceptually the same as what we saw for ANOVA in Lesson 7:

F=MSMMSE F = \frac{MSM}{MSE}

The p-value is determined from the F-distribution in the same way as introduced in Lesson 7 as well.

Coefficient of Determination: R2R^2

While the F-statistic above represents the ratio of the variability in the response data that is explained by the model over the unexplained variability (variability outside the model), the coefficient of determination (R2R^2) is the ratio of the variability in the response data that is explained by the model over the TOTAL variability:

R2=SSMSST R^2 = \frac{SSM}{SST}

In other words, R2R^2 gives the proportion of the total variability in the response data that is explained by the model. Thus, it will always be a value between 0 and 1, with larger values indicating a better “goodness-of-fit” of the linear model.

For a simple linear model, R2R^2 can also be calculated by squaring the correlation coefficient found between the explanatory and response variables:

R2=r2R^2 = r^2

Since rr is between -1 and 1, R2R^2 is, again, between 0 and 1, with larger values of R2R^2 corresponding to a stronger correlation. Note that this stronger correlation could be either positive or negative, and R2R^2 cannot distinguish between the two.

It is important to emphasize that R2R^2 is only meaningful for LINEAR models, and won’t capture the variability that could be explained by models that are non-linear (e.g., curved, sinusoidal, clustered, etc.).

Tabular Presentation of Simple Linear Model Test Results

Similar to how ANOVA test results can be presented in a table, simple linear regression results can also be presented in a way that compares the sources of variation:

Source of Variation Degrees of Freedom Sum of Squares Mean Squares F p-value
BETWEEN 11 SSMSSM MSMMSM MSMMSE\frac{MSM}{MSE} Area under F-distribution and greater than F-statistic
WITHIN n2n-2 SSESSE MSEMSE
TOTAL n1n-1 SSTSST

  Watch It: Video -  Hypothesis Test Slope (4:40 minutes)

Hypothesis Test Slope
Click here for a transcript.

Hi, in this video, I'm going to show you three different ways to calculate R², the coefficient of determination, using Python. So let's get down to it. The dataset that we're going to use is what we've been using in previous videos: the nuclear power versus drowning dataset. The first option uses the correlation coefficient, or 'corr' here. So we'll calculate it as we've done previously. We'll put in one variable, and the order doesn't matter here. So, we go with drowning, and we'll say 'corr', the function that we have here, and the other variable, in this case, is Nuclear. We'll print this out. We'll say, "Okay, here's our Coefficient, our correlation coefficient, which is going to be equal to this reported value." We'll round it off just so it's not a whole bunch of decimal places, and we see that it's 0.91, so a pretty high correlation. We can find r2 as just the square of that. So we'll say r2 is equal to r squared here, and then we will print that value, so it will be our coefficient of determination, or I could just put r2. We'll go ahead and run that, and we see that it's the square of 0.9081, so, you know, a moderately good r squared value.

Okay, the second option uses the linear regression function, which we've seen previously as well. In this case, we have to import our scipy stats package. We import this as 'stats', and we'll say 'output equals stats.linregress', as we've done before, and we'll have our X variable and our Y, or our explanatory variable and our response variable, here. We'll see what our output looks like. You've seen this before. We've got all these values. The one we want to focus on is our R value here. We'll note that this is equal to our correlation coefficient we have above, and so we can pull that out and square it again. We'll say, print R-squared is equal to output.r value, and we will square this in place. Let's also round this just so we don't get this huge string of numbers. So I'll round that also to three decimal places and go ahead and run that. There we go. We get the same thing, 0.812.

Alright, and then the third and final way is using a new method, which you haven't seen before. This is from the stats models package. So we'll import that here, and we have to say 'import statsmodels.api'. We'll import this as 'SM' for shorthand, so we don't have to write 'statsmodels.api' every time. The way this works is a bit more complicated than what we have in the linregress. First, we'll define our X variable. We could call this anything we want, but it's handy to just call it X, and that is drowning, as we've seen before, and our Y variable, and in this case, it's nuclear. So we have our X and our Y, and then we will define a model using this function. 'OLS' stands for ordinary least squares. This is our simple linear model, and so we're calling this function. What we're going to do here is define the simple linear model. We'll have to give it the Y first and then the X, so different than what we had before. We should get in the habit of saying hasconst=True. There might be some reason for saying this is False, but throughout this course, we should always put in hasconst=True. Okay, and then once we define our model, we'll fit it. We'll say mod, as we've defined above, our model, dot fit, we'll do our fit function, and we'll say print res. I'm saying res for shorthand, being the results. So, results of the fit, dot summary. So we're going to print a summary of this model. So we go ahead and run that, and there we go.

You'll note, wait a minute, we've got a negative R-squared. That's not possible. What's going on here? You scroll down a little bit more. We'll see we just have one line here. This line is the slope for drowning, and what we should see, what we'd want to see, is a second line giving us the intercept. So we need to go back, and we need to define an intercept. The way we do that is by overwriting X. So we say, X = sm.add_constant. 'Constant' is another name for the intercept here, so add a constant to X. Okay, so all we're doing with this function here is adding an intercept term from the linear algebra solutions, essentially adding a column of ones in our matrix. And there we go. We see that our R-squared is 0.812, exactly as what we had before. Furthermore, if we scroll down for our constant, we get the intercept value that we saw before with linregress, 613 and some change. We can go up and check that intercept, 6113, some change. And, similarly, the slope for our explanatory variable, in this case, drowning. Additionally, we get the P-value associated with that slope right here, and the confidence interval. I should say the 95% confidence interval, so the limits being 0.25 and 0.975. So we get a lot of stuff in this output here. We also get the F-statistic associated with, or related to, the R-squared, as defined in the lesson here, and our degrees of freedom. So, a lot of nice diagnostic output to look at here. And then we also get the P-value associated with the F-statistic as well. So basically, the area under that F distribution above this F-statistic value.

Okay, so we'll look at similar input, or I should say, similar output in subsequent videos. But we'll end here for now with our three methods for getting our R-squared. Okay.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0(link is external)
The Colab file used in the video is available here(link is external). Make sure to sign in with your PSU credentials.


  Assess It: Check Your Knowledge

Knowledge Check