Click here for a transcript.
Welcome back to the videos in chi-squared testing. In this lesson, in this video, we're going to go over another way that you might encounter chi-square testing. So, in the previous set of videos, we had a deck of cards that was all that had equal proportions across each of the suits. But oftentimes our populations won't have equal proportions across all groups. And so, in this video I'm going to show you how you can do a chi-square test where we have different populations across groups.
So here in this code, I'm calling it chi-square test part two, we've got some data on undergraduate enrollment in the College of Earth and Mineral Sciences in the Fall of 2022. And our goal is to compare the enrollment in the whole College of Earth and Mineral Sciences to enrollment in one of the courses we taught during the academic year. And so, our null hypothesis is that, is just each proportion based off the total enrollment on EMS. So, we can see that each major, EBF, Energy Engineering, Environmental Systems, and so on, has a different proportion of students. And then our alternative hypothesis is that at least one of these is not as specified. And so, we can go ahead and run the code that I've already got set up, which is essentially to read in the first week survey data. We can see that we've extracted just the majors from that survey. And then we can start by figuring out the count. So, before we were doing... so, we already knew the count right off the bat. But in this case, we need to actually extract amounts. So, we use this dot value counts command to do that. And we can see that we've got some other data. And we've got this Energy Engineering other data that we, that isn't going to work with our existing data set because we don't have any proportion specified for the other, or for dual majors. And so, we can go ahead and update this. First, we're going to account for this one dual major, here. We still want them to count towards the Energy Engineering count, but we don't need the other dual majors, so we can say dot replace. And so essentially, we're just going to replace comma other with nothing, empty quote open close quotation nothing, there. So, we can do that. And then we also need to remove these other majors. We don't know what major they are, so we can't count them towards any of the existing majors. So, we can just say that we want only the majors that are not equal to other.
So that ran, and then we can recount the data, this time storing it in accounts variable. So we can say majors needs to be plural there, dot value counts. Go ahead and print the counts. So here we can see how our counts actually work now that we've removed the other and added that dual major into Energy Engineering. So, this is our observed counts. The next step in our chi-square procedure is develop the existing or the expected counts, so counts_exp, is just counts dot sum. So this is our sample size of our observed data, times the array of proportions. And now, before, our array of proportions was just four times 0.25, but now that our proportions appear, are different for each category, we now need to manually import each of those. And a very critical point here; the order has to be the same as the order up here, because we ultimately are going to be comparing these two values. And so, we need to make sure that the order is the same across both observations and expected count. And so, I've listed the order here that is based off of this order. So we can do 0.493, for pp and g points 172, 0.163, and 0.04. And these values also match what we have up here in our null hypothesis.
So we can run that. We get our expected counts and then we can run the chi-square test. And for this purpose, I'm just going to do the one-liner. So we already know how to do the randomization procedure, but in this case, we'll just do the one- liner. So we say stats dot chi square. Our observation, we call it counts. Our expected values we call counts_exp. And then we can print p-value, which is just results dot p-value.
And so, then we can see the p-value is .017. And so we can write our conclusion. The p-value is less than 0.05, our significance level, so we reject the null hypothesis in favor of the alternative, that at least one of the proportions is not as specified. At least one, however, this chi-squared test doesn't tell us which one is different. It just tells us that at least one, at least one is different. And so, in order to actually get an idea about which one is different, we can just subtract our expected counts and our observed counts. And so, here we can see that the most extreme difference is png e, which had 11 more actual attendance than expected. Since the value is negative, that means observation is greater than expected. Meanwhile, the Environmental Systems group had nearly six less than expected. And so, we can then start to make some, have a discussion about which of these majors are primarily responsible for the rejection of our null hypothesis.