If you want to compare the proportion of one population to the proportion of another population, and you have samples from each population, then you want to perform a difference of proportions test.
In a difference of proportions test, your statistical hypotheses will take the following form, and your statistic, computed on both the original sample and the randomized samples, is , where A and B just refer to two separate samples, belonging to two separate populations:
Test | Hypotheses | Statistic | Randomization Procedure |
---|---|---|---|
Difference of Proportions | or or |
Difference of sample proportions, | Reallocate observations between samples A and B |
Typically, the nuill value here is 0, since the null hypothesis is usually that the two population proportions are equal, and thus their difference is zero.
The randomization procedure for a difference of proportions test follows the same reallocation strategy as with the difference of means. Specifically, the pseudo-code for this procedure is:
Obtain two samples of size n and m, respectively.
Organize the values of both samples into a DataFrame, with the sample ID in one column and their respective values in another.
For i in 1, 2, ... N
Randomly sample the value column without replacement.
Calculate the sample proportions for each sample (e.g., by grouping by the sample ID).
Find the difference of the sample proportions. Make sure to do this in the same order as stated in the hypotheses!
Combine all N randomized statistics into the randomization distribution
Again, the two sample sizes, n and m, do not need to be equal, but you may run into problems if one sample is very small and one is large.
The following video demonstrates how to implement this procedure in Python, using the 2021 Texas Power Crisis as an example.
The Google Colab Notebook use in the above video is available here [2], and the data are here [3]. For the Colab file, remember to click "File" then "Save a copy in Drive". For the data, it is recommended to save to your Google Drive.
The following exercise builds off of the Try It exercise from "Hypothesis Test for Single Proportion [4]". Recall from that page that:
In a recent study (Clark et al., 2022 [5]), researchers at Yale University find that proximity to hydraulic fracturing (a.k.a. "fracking") sites almost doubles the odds of a child getting acute lymphoblastic leukemia (ALL). Their study examines 405 Pennsylvania children with this form of leukemia between 2009-2017, when oil and gas development activity (including fracking) was at a high in PA. For a comparison, they also examine a "control group" of 2,080 randomly-selected children of similar age as those with leukemia (randomly-selected from a pool of 2,019,054 children, by Dr. Morgan's count).
The following two-way table is taken from Table 2 in the paper:
Fracking Exposure | Children w/ ALL | Children w/o ALL |
---|---|---|
Exposed | 6 | 16 |
Unexposed | 399 | 2064 |
This table shows the number of children who had ALL that had oil and gas development activity within 2km uphill from their residence ("Exposed"; 6 children) and those who were "Unexposed" to oil and gas development activity (didn't have oil and gas operations within 2km uphill from them). The third column shows the same counts for children without ALL.
Again, let's define "rate" as:
and we can further call for the exposed group , and for the unexposed group .
In the DataCamp code window below, design and conduct a test ro compare these proportions:
Links
[1] https://creativecommons.org/licenses/by-nc-sa/4.0/
[2] https://drive.google.com/file/d/1hH7L3MLUEBmfINsf4s0uU4hArgdwS9Rd/view?usp=sharing
[3] https://drive.google.com/file/d/1-0Vr1bJvUoGuzui6NLWi_AaW9ySqA1RM/view?usp=drive_link
[4] https://www.e-education.psu.edu/eme210/node/624
[5] https://ehp.niehs.nih.gov/doi/full/10.1289/EHP11092