EME 210
Data Analytics for Energy Systems

The F-statistic

PrintPrint

ANOVA breaks up the total variability of the sample values into two kinds of variability: 1.) the variability within the groups and 2.) the variability between the groups. If the variability between the groups is large compared to the variability within the groups, we can say that the groups have different means. In equation form, this idea is noted:

SSTotal=SSG+SSE SS_{Total} = SSG + SSE

The terms are defined as:

  • SSTotalSS_{Total} : the total sum of squared deviations. This takes the distance of each data point from the mean of all data, squares that distance (so they’re all positive), and adds them all up. Thus, it measures the TOTAL VARIABILITY of the data.

  • SSGSSG: the sum of squared deviations for groups. This takes the distance of each group’s mean from the mean of all data (ungrouped), squares that distance, and adds the values up. Thus, it measures the VARIABILITY BETWEEN GROUPS.

  • SSESSE: the sum of squares for error. For each group, this takes the distance of each data point in the group from the mean of that group, squares that distance, and adds them all up. Thus, it measures the VARIABILITY WITHIN THE GROUPS.

Our goal with the ANOVA test is to compare the variability BETWEEN groups to the variability WITHIN groups, however we can’t properly compare SSGSSG to SSESSE directly since these use different amounts of data. So we need to look at the “mean square” for each:

Mean square for groups: MSG=SSGk1Mean square for error: MSE=SSEnk \begin{align} \text{Mean square for groups: } MSG &= \frac{SSG}{k-1} \\ \text{Mean square for error: } MSE &= \frac{SSE}{n-k} \end{align}

Here, nn is the total sample size (number of data values in all groups together) and kk is the number of groups. The denominators, k1k-1 and nkn-k represent the “degrees of freedom” (“df” or “dof” for short) for each term. We can then find the F-statistic as:

F=MSGMSE F = \frac{MSG}{MSE}

which effectively compares the variability BETWEEN groups (numerator) to the variability WITHIN groups (denominator). Thus, if the null hypothesis is true and the group means are actually equal, we expect the F-statistic to be about 1. Larger values of the F-statistics indicate a larger MSGMSG relative to MSEMSE, and thus a difference in means (or at least one mean).

The F-distribution and p-value

The p-value for our ANOVA test depends on the F-statistic above and the degrees of freedom. The degrees of freedom determines the shape of the probability density function for F (an example is pictured below). The p-value is then the area under the probability density function and GREATER THAN the F-statistic calculated from the sample data (shaded area in the figure below).

F-distribution and p-value
In this figure, the curve is the probability density function for F with degrees of freedom k − 1 = 4 and n − k = 95. The p-value is the shaded area under the curve and greater than the F-statistic value of 2.5, in this example.
Credit: © Penn State is licensed under CC BY-NC-SA 4.0
With the degrees of freedom staying constant, a larger F-statistic corresponds to a lower p-value.

One-way ANOVA Table

To facilitate an understanding of where the variability in the data is coming from, sometimes the results from an ANOVA test are presented in a table, typically taking the following format:

The typical format of a one-way ANOVA table.
Source of Variation Degrees of Freedom Sum of Squares Mean Squares F p-value
BETWEEN k1k-1 SSGSSG MSGMSG MSGMSE\frac{MSG}{MSE} Area under F-distribution and greater than F-statistic
WITHIN nkn-k SSESSE MSEMSE
TOTAL n1n-1 SSTSST

 FAQ