EME 210
Data Analytics for Energy Systems

Hypothesis Test for Difference of Proportions

PrintPrint

Hypothesis Test for Difference of Proportions

If you want to compare the proportion of one population to the proportion of another population, and you have samples from each population, then you want to perform a difference of proportions test.

 Read It: Hypothesis Test for Difference of Proportions

In a difference of proportions test, your statistical hypotheses will take the following form, and your statistic, computed on both the original sample and the randomized samples, is p ^ A p ^ B , where A and B just refer to two separate samples, belonging to two separate populations:

Test Hypotheses Statistic Randomization Procedure
Difference of Proportions H o :   p A p B = 0
H a :   p A p B < 0 or
p A p B > 0 or
p A p B 0
Difference of sample proportions, p ^ A p ^ B Reallocate observations between samples A and B

Typically, the nuill value here is 0, since the null hypothesis is usually that the two population proportions are equal, and thus their difference is zero.

Randomization Procedure

The randomization procedure for a difference of proportions test follows the same reallocation strategy as with the difference of means. Specifically, the pseudo-code for this procedure is:

  1. Obtain two samples of size n and m, respectively. 

  2. Organize the values of both samples into a DataFrame, with the sample ID in one column and their respective values in another.

  3. For i in 1, 2, ... N

    1. Randomly sample the value column without replacement.

    2. Calculate the sample proportions for each sample (e.g., by grouping by the sample ID). 

    3. Find the difference of the sample proportions. Make sure to do this in the same order as stated in the hypotheses!

  4. Combine all N randomized statistics into the randomization distribution

Again, the two sample sizes, n and m, do not need to be equal, but you may run into problems if one sample is very small and one is large. 

The following video demonstrates how to implement this procedure in Python, using the 2021 Texas Power Crisis as an example.

 Watch It: Video - Hypothesis Test for Difference of Proportions (11:41 minutes) 

Click here for a transcript.

In this video we're going to go over the one line test that can be used for a two sample mean hypothesis, a comparison of the means. And so, when we wrote these null hypotheses, our null hypothesis was that the mean of gas, minus the mean of wind was equal to zero. Thus, there was no difference between the two stated. And our alternative is that gas minus wind is less than zero, suggesting that there is a difference between the two. And here we calculated mean for the percent deficit as, generation minus capacity, divided by capacity. And so, we did all of this prep work and then we got into the actual randomization procedure where we calculated our sample difference, did our reallocation procedure where we sampled from percent deficit with replace equals to false, and recalculated these differences, and ultimately got a p-value of zero. Which suggested that we reject the null hypothesis in terms of this one-liner test. I'm going to call the results, results. And again, using that stat dot scipy library we can say stats dot t-test. And in this case we say IND for independent samples. And then we give it our first sample, which is sample gas. And a key point here is to make sure that you're matching the order that you write your alternative hypothesis in. And so, we did gas minus wind. So our first sample is gas. And I want to also point out that we are using the actual data here. So our sample is located from the samp.loc which we calculated in step one. We're not using any of the reallocation samples which we calculate in step three. So our first sample is samplegas, our second sample, samplewind. Here we need to say equal there equals false. And this just tells it that the variance is not equal between the two data frames to variables. So, it'll conduct a slightly different version of the t-test than if they were equal variance. And then we specify the alternative equals less, since we are using a left-tailed test. And then after that, we can say results dot p-value to print the p-value. And so, here we can see that we are very, very close to zero, following that same pattern that we have seen in the earlier two tests. It still has the same conclusion as our randomization procedure, but with more specificity. So again we reject the null hypothesis.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

The Google Colab Notebook use in the above video is available here, and the data are here. For the Colab file, remember to click "File" then "Save a copy in Drive". For the data, it is recommended to save to your Google Drive.


Try It: In PA, is the rate of acute lymphoblastic leukemia (ALL) among those exposed to fracking significantly different than the rate of ALL among those not exposed to fracking?

The following exercise builds off of the Try It exercise from "Hypothesis Test for Single Proportion". Recall from that page that:

In a recent study (Clark et al., 2022), researchers at Yale University find that proximity to hydraulic fracturing (a.k.a. "fracking") sites almost doubles the odds of a child getting acute lymphoblastic leukemia (ALL). Their study examines 405 Pennsylvania children with this form of leukemia between 2009-2017, when oil and gas development activity (including fracking) was at a high in PA. For a comparison, they also examine a "control group" of 2,080 randomly-selected children of similar age as those with leukemia (randomly-selected from a pool of 2,019,054 children, by Dr. Morgan's count).

The following two-way table is taken from Table 2 in the paper:

Fracking Exposure Children w/ ALL Children w/o ALL
Exposed 6 16
Unexposed 399 2064

This table shows the number of children who had ALL that had oil and gas development activity within 2km uphill from their residence ("Exposed"; 6 children) and those who were "Unexposed" to oil and gas development activity (didn't have oil and gas operations within 2km uphill from them). The third column shows the same counts for children without ALL.

Again, let's define "rate" as:

p = #  of cases with ALL   Total # of cases 

and we can further call p for the exposed group p e x p o s e d , and p for the unexposed group p u n e x p o s e d .

In the DataCamp code window below, design and conduct a test ro compare these proportions:

  1. Write your null and alternative hypotheses.
  2. Perform the hypothesis test using a randomization distribution.
  3. Report the p-value.
  4. What is the conclusion of your test? State this in terms of the hypotheses, and also in terms of the original question above. Make sure to indicate the significance level you are using.