
F-Statistic
Let's start by comparing the hypotheses for a one-way ANOVA test (from Lesson 7) to the hypothese for a simple linear regression model:
Hypotheses for a One-Way ANOVA Test are: | Hypotheses for a Simple Linear Model are: |
---|---|
Ho: All population means, μ i are equal for all k categories | Ho: The model is ineffective (at explaining variability) |
Ha: At least one μ i is not equal to the others | Ha: The model is effective (at explaining variability) |
The concept that links these two sets of hypotheses is that the grouping of data into k categories in the ANOVA is a model. ANOVA looks at the change in mean (of the response variable) over a categorical variable, whereas a simple linear model looks at the change in mean over a quantitative variable. As such, the simple linear model can also use the F-statistic for its hypothesis test:
where the “Mean Square for the Model” (MSM) replaces “Mean Square for Groups” (MSG) in the numerator and has the equation:
This is essentially the same as MSG, with the main difference being that the denominator has degrees-of-freedom equal to 1 because there is only one explanatory variable in the simple linear model (). Similarly, the equation for MSE is modified with different degrees-of-freedon in the denominator to indicate that there are two “unknowns” in the simple linear model: the slope and intercept (the ’s):
Thus, the F-statistic is conceptually the same as what we saw for ANOVA in Lesson 7:
The p-value is determined from the F-distribution in the same way as introduced in Lesson 7 as well.
Coefficient of Determination:
While the F-statistic above represents the ratio of the variability in the response data that is explained by the model over the unexplained variability (variability outside the model), the coefficient of determination () is the ratio of the variability in the response data that is explained by the model over the TOTAL variability:
In other words, gives the proportion of the total variability in the response data that is explained by the model. Thus, it will always be a value between 0 and 1, with larger values indicating a better “goodness-of-fit” of the linear model.
For a simple linear model, can also be calculated by squaring the correlation coefficient found between the explanatory and response variables:
Since is between -1 and 1, is, again, between 0 and 1, with larger values of corresponding to a stronger correlation. Note that this stronger correlation could be either positive or negative, and cannot distinguish between the two.
It is important to emphasize that is only meaningful for LINEAR models, and won’t capture the variability that could be explained by models that are non-linear (e.g., curved, sinusoidal, clustered, etc.).
Tabular Presentation of Simple Linear Model Test Results
Similar to how ANOVA test results can be presented in a table, simple linear regression results can also be presented in a way that compares the sources of variation:
Source of Variation | Degrees of Freedom | Sum of Squares | Mean Squares | F | p-value |
---|---|---|---|---|---|
BETWEEN | Area under F-distribution and greater than F-statistic | ||||
WITHIN | |||||
TOTAL |