Chi-Square Test: Randomization Procedure

Read It: Randomization Procedure

Often, we will not have a large enough sample to conduct the chi-square test following the steps laid out in the previous discussion. In these cases, it is possible to conduct a randomization procedure to simulate additional data from a single sample and conduct a chi-square test. Below, we will demonstrate this process with a deck of cards. Our hypotheses are:

$Ho: p_{diamonds} = p_hearts = p_clubs = p_spades = 0.25$

$Ha: At least one of the proportions above is not as specified$

We will start with some data collection in the first video. Then, in the second video, we will demonstrate the randomization procedure. Finally, the third video will show the p-value calculation.

Watch It: Video - Data Collection (4:59 minutes)

Click here for a transcript.

Hello, and welcome to the lesson on chi-squared tests. So in this lesson the first part we're going to talk about chi-square tests, and in this video we're going to talk about how we can develop a randomization procedure for those chi-squared tests. Now the first step of any chi-square test, and for the majority of tests that we do in this course, is defining our hypotheses.

So here in Google Colab I have already defined our hypotheses. We have our null hypothesis that each suit in this deck of cards that I have is set to 25 percent each. In other words, this is a fair deck. There's the same percentage of hearts, as diamonds, as clubs, as spades. And then our alternative hypothesis is that one of those proportions is not as specified. It doesn't tell us which one, just that one of them is not. So, before we get started, I'm going to go ahead and draw some data from this deck of cards. So I'll go ahead and give it a shuffle. And essentially what I'm going to do is fill in this table here where I've got all zeros and I'm counting how many diamonds, hearts, clubs, and spades we get. So I'm just going to draw the top card from the deck until we've got about 20 cards drawn. And so, the first card I drew was a spade. So I come in here, increase that to one, got another spade, a heart, a club, another heart, heart, a club, spade, another spade. So right now we haven't got a lot of diamonds. Here's a heart, another heart, finally, a diamond.

So we've got about 10 so far. So here's another diamond, diamond, club, diamond, diamond, heart, heart. Oh, I did that wrong. This should be a seven, spade, diamond. And I will go ahead and stop there. We've got just over 20 data points there, so now we're going to continue with the coding portion of the lecture. And so, we now have our table, and this table is essentially, we're writing it in a dictionary, that's what these curly brackets mean. And then we're converting it to a data frame with the command PD data frame and then we're going to print the table. So I've already got my libraries added. So I'm just going to go ahead and print this table, which is essentially giving us the count. So we've got six diamonds, seven hearts, three clubs, five spades. And our goal with this will be to determine whether or not these observed proportions are essentially close enough to a 25 percent proportion. Or, whether that, whether they're not. And you know one sample isn't great in terms of our statistical analysis, and so what we're going to do is create that randomization procedure that we've done, a similar one to what we've done before, so that we can generate more samples.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

Watch It: Video - Randomization (8:40 minutes)

Click here for a transcript.

And so, I've laid out a few steps here. So our first step is to calculate the sample statistic. And in particular, our chi-squared statistic is the sum of this fraction. We've got our observed counts minus our expected counts squared, divided by the expected counts, and then all summed together.

And so, in order to calculate this chi-square sample, first we need to get our observed counts, and our expected counts. Our observed counts are what we see here, so we need to calculate the expected counts and then we need to do this summation. A,nd so the first thing that we'll do is determine the sample size. So I'm going to take this table variable up here, I'm going to do the count column, so table dot count. And this is referring to the column count. And then I'm going to sum that. So essentially, by summing all of our counts we can actually get our sample size. And then I can print that, so we can see what it looks like - twenty-one cards drawn.

Next we need to get the expected counts. And so, I'm going to create a new array using this numpy dot array command, which is essentially just all of my expected proportions. Now this is what we wrote up here in our null hypothesis, 0.25, and it's the same for each category. And so, that is my expected probability, my proportion. And so then, my count exp is just my sample size, times my proportions. And we can print that at the end.

And so we can see that it's all the same because they all have the same proportion. But the chi-square test compares counts to each other. So we need to somehow take that proportion into account, and that's what we're doing here. So in theory, we would have 5.25 cards of each suit. Now that isn't actually possible, but that is sort of the ideal count based off of our sample size, 21. And so, then the next step is to actually calculate the chi-square statistic. So I'm going to call this chi-square underscore samp, because it's my sample statistic. I'm going to do two sets of parentheses and in the first parentheses I'm going to take my observed count, so this is following this function up here. So my observed count minus my expected count, squared. So in Python when we want to do an exponential, we do two asterisk.s And then I divide that by my count EXP. And then outside of all the parentheses, I sum it. So we can run that. And now I have my sample chi-square value as 1.66.

And so, this is the first step to developing our randomization procedure. The next step is actually implementing that randomization. So once again we're going to be using this NP random choice, which we're quite familiar with at this time. And in particular, we're going to replace equals true. And so, the first step within step two, is to initialize the variables. So we need to tell it how many iterations we want to do. And I'm going to set up an empty array.

And then we can start the procedure. So we can say for i in range of n, similar to what we've done before, and then I'm going to create a new variable that I call sim suits. And this is what's going to randomly draw a suit from our options. So we can say np dot random choice, and I'm just going to draw from table suit. And if we scroll back up here, we can see that the suit column in our table data frame is diamonds, hearts, clubs, or spades. So it's just going to draw one of those categories, similar to drawing a card. Then, slightly different from what we've done before, we need to define the probability at which to draw those suits, twenty five percent each. We define how many we want to draw, our sample size, lowercase n. And we replace equals true, so that we don't run out of suits. So this will be our simulated draws, but we need to make sure those get into counts, which is what the chi-squared test compares. And so, we can say sim underscore counts. And so, in square brackets I'm going to say sum or sim suits equals equals diamonds. And then comma, our next point in the array is the sum of where sim suits equals equals hearts. And I missed a parenthesis right there. So each of these summations needs its own parentheses so that we can read it. I'm going to enter down in between because it's still wrapped in the square brackets, Python knows it's still part of the same command. So then I do sim suits equals equals clubs. And then finally the sum of sim suits it equals spades. And so this will be my simulated count variable.

And then last but not least, we need to use that simulated count variable to calculate our sample, sampling distribution, our randomization chi-squared. So this is where our empty chi-square array comes in. Set the ith term equal to sim suits, or sim counts, minus count exp, which is what we defined up here, squared, divided by count exp, and then summed together.

So, we can run this, and then we can see what the chi-square looks like, just an array, a very long array. So I won't keep that there but good to know that it worked. And so, then the next step is actually calculating our p-value.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

Watch It: Video - P-Value (2:39 minutes)

Click here for a transcript.

So, before we can go in and calculate the p-value the way that we have in the past, we need to convert this array into a data frame, and this will help with visualization later. And so, we can say chi-square df is just pd dot data frame of chi-square. And our columns, we'll just call them chi-square. So we can run that, and then when we go to actually calculate our p-value, it is the length of chi-square df, in which pi squared df on the chi-square column is less, is greater than or equal to chi square samp, and then divided by n. So we've got an erroneous space there. And so, we can print this, and we can see that our p-value is 0.176.

And so, we can write a conclusion. We can say that the p-value is greater than 0.05, which is our pretty standard significance value. So we fail to reject. Or, we can say that we accept the null hypothesis. In other words, there is sufficient evidence to say that the proportions all equal 0.25. Or, we can say that the proportions are as specified, which is 0.25. And this is how we can then conclude that chi-squared test.

Try It: DataCamp - Apply Your Coding Skills

Use the card drawing simulator linked here to draw cards for your own sample. Then, apply the randomization procedure for the chi-square test.

# This will get executed each time the exercise gets initialized.

# libraries
import numpy as np
import pandas as pd

# draw cards
table = pd.DataFrame({'Suit' : ['Diamonds', 'Hearts', 'Clubs', 'Spades'],
                      'Count' : [0,0,0,0]})

# STEP 1: calculate sample statistic
n = ...
p = np.array([...])
count_exp = ...
chisq_samp = ...

# STEP 2: implement randomization procedure

# initializing the variables
N = 1000
chisq = np.empty(N)

# start the procedure
for i in range(N):
  sim_suits = ...
  sim_counts = [sum(sim_suits == 'Diamonds'), sum(sim_suits == 'Hearts'), 
                sum(sim_suits == 'Clubs'), sum(sim_suits == 'Spades')]
  chisq[i] = ...

# STEP 3: calculate p-value
chisq_df = pd.DataFrame(chisq, columns = ['Chi-Square'])
print('p-value: ', ...)

# libraries
import numpy as np
import pandas as pd

# draw cards
table = pd.DataFrame({'Suit' : ['Diamonds', 'Hearts', 'Clubs', 'Spades'],
                                     'Count' : [6,7,3,5]})

# STEP 1: calculate sample statistic
n = table.Count.sum()
p = np.array([0.25, 0.25, 0.25, 0.25])
count_exp = n*p
chisq_samp = ((table.Count - count_exp)**2/count_exp).sum()

# STEP 2: implement randomization procedure

# initializing the variables
N = 1000
chisq = np.empty(N)

# start the procedure
for i in range(N):
  sim_suits = np.random.choice(table['Suit'], p = p, size = n, replace = True)
  sim_counts = [sum(sim_suits == 'Diamonds'), sum(sim_suits == 'Hearts'),
                sum(sim_suits == 'Clubs'), sum(sim_suits == 'Spades')]
  chisq[i] = ((sim_counts - count_exp)**2/count_exp).sum()

# STEP 3: calculate p-value
chisq_df = pd.DataFrame(chisq, columns = ['Chi-Square'])
print('p-value: ', len(chisq_df[chisq_df['Chi-Square'] >= chisq_samp])/N)

Chi-Square Test: Randomization Procedure