Hypothesis Test for Difference of Proportions
If you want to compare the proportion of one population to the proportion of another population, and you have samples from each population, then you want to perform a difference of proportions test.
Read It: Hypothesis Test for Difference of Proportions
In a difference of proportions test, your statistical hypotheses will take the following form, and your statistic, computed on both the original sample and the randomized samples, is , where A and B just refer to two separate samples, belonging to two separate populations:
Test | Hypotheses | Statistic | Randomization Procedure |
---|---|---|---|
Difference of Proportions | or or |
Difference of sample proportions, | Reallocate observations between samples A and B |
Typically, the nuill value here is 0, since the null hypothesis is usually that the two population proportions are equal, and thus their difference is zero.
Randomization Procedure
The randomization procedure for a difference of proportions test follows the same reallocation strategy as with the difference of means. Specifically, the pseudo-code for this procedure is:
-
Obtain two samples of size n and m, respectively.
-
Organize the values of both samples into a DataFrame, with the sample ID in one column and their respective values in another.
-
For i in 1, 2, ... N
-
Randomly sample the value column without replacement.
-
Calculate the sample proportions for each sample (e.g., by grouping by the sample ID).
-
Find the difference of the sample proportions. Make sure to do this in the same order as stated in the hypotheses!
-
-
Combine all N randomized statistics into the randomization distribution
Again, the two sample sizes, n and m, do not need to be equal, but you may run into problems if one sample is very small and one is large.
The following video demonstrates how to implement this procedure in Python, using the 2021 Texas Power Crisis as an example.
Watch It: Video - Hypothesis Test for Difference of Proportions (11:41 minutes)
The Google Colab Notebook use in the above video is available here, and the data are here. For the Colab file, remember to click "File" then "Save a copy in Drive". For the data, it is recommended to save to your Google Drive.
Try It: In PA, is the rate of acute lymphoblastic leukemia (ALL) among those exposed to fracking significantly different than the rate of ALL among those not exposed to fracking?
The following exercise builds off of the Try It exercise from "Hypothesis Test for Single Proportion". Recall from that page that:
In a recent study (Clark et al., 2022), researchers at Yale University find that proximity to hydraulic fracturing (a.k.a. "fracking") sites almost doubles the odds of a child getting acute lymphoblastic leukemia (ALL). Their study examines 405 Pennsylvania children with this form of leukemia between 2009-2017, when oil and gas development activity (including fracking) was at a high in PA. For a comparison, they also examine a "control group" of 2,080 randomly-selected children of similar age as those with leukemia (randomly-selected from a pool of 2,019,054 children, by Dr. Morgan's count).
The following two-way table is taken from Table 2 in the paper:
Fracking Exposure | Children w/ ALL | Children w/o ALL |
---|---|---|
Exposed | 6 | 16 |
Unexposed | 399 | 2064 |
This table shows the number of children who had ALL that had oil and gas development activity within 2km uphill from their residence ("Exposed"; 6 children) and those who were "Unexposed" to oil and gas development activity (didn't have oil and gas operations within 2km uphill from them). The third column shows the same counts for children without ALL.
Again, let's define "rate" as:
and we can further call for the exposed group , and for the unexposed group .
In the DataCamp code window below, design and conduct a test ro compare these proportions:
- Write your null and alternative hypotheses.
- Perform the hypothesis test using a randomization distribution.
- Report the p-value.
- What is the conclusion of your test? State this in terms of the hypotheses, and also in terms of the original question above. Make sure to indicate the significance level you are using.