EME 210
Data Analytics for Energy Systems

Introduction to Chi-Square Tests

PrintPrint

Introduction to Chi-Square Tests

Read It: Chi-Square Tests

Imagine you have some data that is separated into categories. You are interested in determining if the frequencies (counts) in each category in your sample match those of the population. To answer this question, you can conduct a chi-square test! Here, "chi" is the Greek letter χ, which is pronounced "kai". For example, in the figure below, we show two histograms--the gray histogram is the theoretical rolls of two dice, while the blue is a sample of rolls. We could make some qualitative statements that compare the two plots. For instance, our sample includes more rolls of eight than the population, but the same amount of rolls of two. Often, however, we want to develop a more robust comparison than visual estimations. For this, we can conduct a chi-square test to confirm if the counts of each possible outcome are statistically significantly similar between the true population and the sample dataset. 

Two Histograms of a Chi-square Test with count and outcome
Two Histograms of a Chi-Square Test
Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0

To conduct a chi-square test, we first need to determine the hypotheses we will be testing. These hypotheses will look slightly different than those you learned in Lesson 5. As discussed in Lesson 5, the null hypothesis always uses equals signs. However, unlike the hypotheses in Lesson 5, for chi-square tests, the null hypothesis contains multiple proportions that are equal to the population proportions. Further, the alternative hypothesis is often written completely in words (i.e., no mathematical symbols). Generally, you can write something along the lines of "at least one proportion above is not as specified". An example of applying these hypotheses to the dice roll example is shown below. 

Example of Hypotheses for Chi-Square Tests

H 0 : p 2 = 1 36 , p 3 = 2 36 , p 4 = 3 36 , p 7 = 6 36 , , p 12 = 1 36

H A :  At least one of the  p  's above is not as specified 

To conduct a chi-square test, you need to implement the following steps: 

  1. Define the hypotheses
  2. Compute the expected count for each category: n p   _i where n is the number of observations and p _i is the population proportion
  3. Compute the sample chi-square statistic: x 2 = s u m (  frac  {  (observed count   expected count)  2 } {   expected count }
  4. Find the p-value based on the idealized chi-square distribution, shown in the figure below
  5. Make a conclusion based on the p-value

Generally, the chi-square distribution is find the p-value and make a subsequent conclusion. The chi-square distribution is interesting because it is always positive and is controlled by a single parameter, the "degrees of freedom", which we discussed in Lesson 6. Below, we show the distribution with 4 degrees of freedom. To calculate the p-value, we find the location of the test statistic and calculate the area under the curve and to the right of the sample statistic (this area is filled with red in the plot below). Generally, we will calculate the degrees of freedom as  k-1  where k is the number of categories in our dataset.

Chi Square test distribution plot
Distribution Plot Chi-Square, df=4
Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0


 Assess It: Check Your Knowledge

Knowledge Check