Hypothesis Test for Single Proportion
A test for a single proportion is for when you have one sample and are interested in a proportion from it.
Read It: Hypothesis Test for Single Proportion
In a single proportion test, your statistical hypotheses will take the following form, and your statistic, computed on both the original sample and the randomized samples, is :
Test | Hypotheses | Statistic | Randomization Procedure |
---|---|---|---|
Single Proportion | or or |
Sample proportion, | Draw random samples from binomial distribution with |
As with the test for a single mean, the null value will typically be some threshold or standard that you are comparing your sample proportion against. For proportions, a common null value is 0.5, since this denotes fair odds between two options (e.g., a coin flip). Often, one wants to see whether or not something is diverging from this pure random chance behavior.
Randomization Procedure
The randomization procedure that we will use for a test of a single proportion involves simulating random samples from a binomial distribution, such that the values in each random sample are drawn according to the null value, . The binomial distribution only applies to a binary system, where there are only two categories in a categorical variable. Take, for example, a series of coin flips where we are interest in the proportion of heads:
We can simulate a series of coin flips (or similar process) using the binomial distribution. In Python, we can use:
x = np.random.binomial(n, p, N)
where n
is the sample size (e.g., number of flips), p
is the null value (e.g., 0.5 because the default is that the coin is fair), and N
is the number samples (e.g., 1000 or some similarly large positive integer). The output, x
, is a vector of length N that contains the numerator of the proportion of interest for each sample (e.g., the number of heads in each series of coin flips).
The pseudo-code for this procedure is:
-
Obtain a sample of size n
-
Simulate N numerators of the proportion of interest from the binomial distribution, each with sample size n and probability p = null value.
-
Calculate the sample proportions for these new randomized sample numerators by dividing each by the sample size, n.
-
Combine all N randomized statistics into the randomization distribution
Note that this procedure does not explicity require a "for" loop, as Step 2 is performing the simulation from the binomial distribution for all intended samples, N.
The following video demonstrates how to implement this procedure in Python, using the 2021 Texas Power Crisis as an example.
Watch It: Video - Hypothesis Test for Single Proportion (10:46 minutes)
The Google Colab Notebook use in the above video is available here, and the data are here. For the Colab file, remember to click "File" then "Save a copy in Drive". For the data, it is recommended to save to your Google Drive.
Try It: Is the rate of acute lymphoblastic leukemia (ALL) in PA significantly different than the national rate?
In a recent study (Clark et al., 2022), researchers at Yale University find that proximity to hydraulic fracturing (a.k.a. "fracking") sites almost doubles the odds of a child getting acute lymphoblastic leukemia (ALL). Their study examines 405 Pennsylvania children with this form of leukemia between 2009-2017, when oil and gas development activity (including fracking) was at a high in PA. For a comparison, they also examine a "control group" of 2,080 randomly-selected children of similar age as those with leukemia (randomly-selected from a pool of 2,019,054 children, by Dr. Morgan's count).
The following two-way table is taken from Table 2 in the paper:
Fracking Exposure | Children w/ ALL | Children w/o ALL |
---|---|---|
Exposed | 6 | 16 |
Unexposed | 399 | 2064 |
This table shows the number of children who had ALL that had oil and gas development activity within 2km uphill from their residence ("Exposed"; 6 children) and those who were "Unexposed" to oil and gas development activity (didn't have oil and gas operations within 2km uphill from them). The third column shows the same counts for children without ALL.
In order to answer the question in the section title and look at the rate of ALL, you need to know that the total number of children in PA, corresponding to the years of the study above, is 2,019,054 (PA Department of Health, Birth Statistics). You also need to know that the national rate during the time period of the study is about 32 cases per 1 million, or 0.000032 (Ward et al., 2014). Let's define "rate" as:
In the DataCamp code window below, design and conduct a test for single proportion:
- Write your null and alternative hypotheses.
- Perform the hypothesis test using a randomization distribution.
- Report the p-value.
- What is the conclusion of your test? State this in terms of the hypotheses, and also in terms of the original question above. Make sure to indicate the significance level you are using.