Chi-Square Test: Visualization

Read It: Visualizing Chi-Square Distributions

When creating visualizations for chi-square tests, you need to provide three key pieces of information: (1) the randomization distirbution of chi-square statistics, (2) the sample chi-square statistic, and (3) the idealized chi-square distribution. In the video below, we walk you through adding each of these parts to a single ggplot graph.

Watch It: Video - Visualization (7:06 minutes)

Click here for a transcript.

Welcome back to our video series on chi-squared tests. In this video we're going to talk about how we can visualize our chi-square randomization distribution, as well as an idealized chi-square distribution. And so, we're essentially going to make a histogram, but we're going to use a different way, different variable on the y-axis than we have in the past. So let's go ahead and get started. So, with our ggplot, our data is chi-square df. And this is part of the reason why up here we convert it into a data frame. And then we can say geom histogram, and then we have our aes statement. Our x is just chi-square, which is what we called our data. And then normally, we would leave this as it is and do a bit of a count. But we can say y equals after stats density in quotes. So this essentially tells Python to apply the density command to the data, rather than the count that is traditionally shown in a histogram. And then we can also define an outline color as black, and we can specify our bin width as 0.75. And then as we've done in the past, we can add a vertical line for our sample statistic, which is chi-square samp color is blue. And we can say line type dashed. And so here we have our our chi-square sampling distribution of the histogram and what the sample is. And so as the way the p-value works, all of the data that is above our sample statistic goes into our p-value. Which is why we have such a high one. You can see that a lot of that data is above the sample statistic.

But often it can be very helpful to show the idealized chi-square distribution. And so a chi-squared distribution, similar to the normal distribution that we learned in a previous lecture, is an idealized distribution that depends on only the degrees of freedom. And it'll generally follow this right-skew shape. So before we can actually get into the plotting, we do need to do a bit of prep work. First we need to define our degrees of freedom. And we're going to define this as 3. And officially, this is the number of categories, minus one. So we've got four suits, four categories, three minus one, or four minus one is three. And then we also need to create new columns in our chi-square df. So we need one for x pdf, and this is just a series of numbers that match the x-axis we see here. So we do np dot bin space 0 to 25, to 1000, or, with 1000 in-between. Effectively, this creates a series of numbers from 0 to 25, evenly spaced numbers from 0 to 25, and there's 1,000 values. And so, in effect this will count from 0 to 25, with putting a thousand values in between. And then we need the y pdf's. And this is where we actually figure out what the chi-squared distribution is that fits this data. And so we say stats dot chi-squared dot pdf. We give it our idealized x data, and we give it the degrees of freedom.

And so, now if we look at just the first five rows, we can see we've got our original chi-square sampling distribution, then we've got idealized x and y data. And so, if we want to plot this we can just continue on with our plot that we did last up here. And essentially, we can just add another line plot on top of it, where our x value is now x pdf, our y value is now y pdf. The color is blue, and I'm going to give it a size to make it a little bit more obvious. And so, here we can see the final plot where we've got our chi-squared distribution. We've got our idealized curve, here. And we can see that it does a pretty good job of following the data. And that is in effect, a way that we can see what the idealized chi-square curve is going to look like.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

Try It: GOOGLE COLAB

Click the Google Colab file used in the video here(link is external).
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, try to edit the follwoing code to create a visualization of the data you collected in the previous DataCamp exercise.

Note: You must be logged into your PSU Google Workspace in order to access the file.

# Copy your data from the previosu DataCamp exercise
table = pd.DataFrame({'Suit' : ['Diamonds', 'Hearts', 'Clubs', 'Spades'],
                                'Count' : [0,0,0,0]})
 
# Rerun randomization procedure code
 
# Create Visualization
deg_f = ... 
 
chisq_df['x_pdf'] = ...
chisq_df['y_pdf'] = ...
 
(ggplot(chisq_df) +
    geom_histogram(...) +
    geom_vline(...) +
    geom_line(...)
)

Once you have implemented this code on your own, come back to this page to test your knowledge.

Chi-Square Test: Visualization