EME 210
Data Analytics for Energy Systems

Hypothesis Test for Single Proportion

PrintPrint

Hypothesis Test for Single Proportion

A test for a single proportion is for when you have one sample and are interested in a proportion from it.

Read It: Hypothesis Test for Single Proportion

In a single proportion test, your statistical hypotheses will take the following form, and your statistic, computed on both the original sample and the randomized samples, is p ^ :

Test Hypotheses Statistic Randomization Procedure
Single Proportion H o :   p = null value
H a : p <  null value  or
p >  null value  or
p  null value 
Sample proportion, p ^ Draw random samples from binomial distribution with p = null value

As with the test for a single mean, the null value will typically be some threshold or standard that you are comparing your sample proportion against. For proportions, a common null value is 0.5, since this denotes fair odds between two options (e.g., a coin flip). Often, one wants to see whether or not something is diverging from this pure random chance behavior.

Randomization Procedure

The randomization procedure that we will use for a test of a single proportion involves simulating random samples from a binomial distribution, such that the values in each random sample are drawn according to the null value, p . The binomial distribution only applies to a binary system, where there are only two categories in a categorical variable. Take, for example, a series of coin flips where we are interest in the proportion of heads:

p = #  Heads   Sample size, n 

We can simulate a series of coin flips (or similar process) using the binomial distribution. In Python, we can use:

x = np.random.binomial(n, p, N)

where n is the sample size (e.g., number of flips), p is the null value (e.g., 0.5 because the default is that the coin is fair), and N is the number samples (e.g., 1000 or some similarly large positive integer). The output, x, is a vector of length N that contains the numerator of the proportion of interest for each sample (e.g., the number of heads in each series of coin flips). 

The pseudo-code for this procedure is:

  1. Obtain a sample of size n

  2. Simulate N numerators of the proportion of interest from the binomial distribution, each with sample size n and probability p = null value.

  3. Calculate the sample proportions for these new randomized sample numerators by dividing each by the sample size, n. 

  4. Combine all N randomized statistics into the randomization distribution

Note that this procedure does not explicity require a "for" loop, as Step 2 is performing the simulation from the binomial distribution for all intended samples, N. 

The following video demonstrates how to implement this procedure in Python, using the 2021 Texas Power Crisis as an example.

 Watch It: Video - Hypothesis Test for Single Proportion (10:46 minutes)

Click here for a transcript.

Hi. In this video, we're going to go over how to do a hypothesis test for a single proportion using randomization distributions in Python. And we're going to continue on with this Texas Cold Snap example from the Power Crisis in 2021. And we're still going to be focusing on just when because this is a single proportion, so we have to look at a single sample here. Now, the proportion that we're going to look at, is the proportion of time that wind generation fell below the forecasted peak capacity. So, recall that this forecast to peak capacity is really, reflects the demand on the system. And so, we're going to look at the proportion of time that essentially supply was not meeting demand during this crisis. And again, just for wind here. Okay, so let's get to it.

So, our hypotheses for this single proportion. The null is that the forecast speed capacity was met half the time, or at least half the time. And the alternative is that this proportion, the proportion of time below the peak capacity, is greater than 0.5. That is, more than half the time wind was not meeting that capacity level. Okay, so that's kind of, that's our failure criteria here in the alternative. So, we're testing to see if either the null was met, or that null was not met. Again, we're not, we never accept this alternative hypothesis. So, let's start off. The first thing to do always in step one, is to calculate our sample statistic. So, what is our actual proportion from the data? So, let's calculate wind p-hat. So, our p-hat from the wind sample is the length of the elements in our wind vector. Recall that with the hypothesis test for the single mean, we extracted the generation from wind into a single vector called wind. So, this wind object is a series of the generation in gigawatts. And we're going to divide by little n, little n being the sample size of when p-hat. We calculated those back up in the hypothesis test for the single mean. So, n is the is the length, effectively, of wind here. And wind, again this generation of just the wind source.

So, let's see what our sample statistic is here. So, when we'll print this out 0.64. So, 64 of the time the actual generation fell below the forecasted. Keep forecasted peak capacity of 6.1 gigabytes. So, then the question is, well is that a lot? Is that statistically significantly different than half the time? So, that's what the hypothesis test here is going to answer. So, for a randomization distribution, this is going to be a bit simpler than what we did with the with the mean. We're not going to use random choice. Instead, our key function here is going to be random.binomial. We're going to use this binomial distribution to simulate conditions under the null hypothesis. So, if the null hypothesis were true, then half the time the generation would be above or equal to 6.1, and half the time be below. That'd be the, that's essentially what's going on in the null. And maybe those portions are a little bit different, but let's see here. So, with the binomial distribution, how we can simulate from that is that, well, we can essentially let the binomial distribution, is essentially doing a series of coin flips. So, we can flip a coin at every time during our two-day period here of the Texas cold snap. And you know, if it's heads, then will say that wind falls below the capacity. If it's tails, it falls above. So, that's that's essentially what's going on with this binomial distribution.

So, let's put it into action in Python here. We'll establish our little n is the length of the wind, and we already have that established above. But we'll reproduce that here, and again we can always specify a capital N and we'll keep it the same as what we had for the hypothesis test for the single means. And what our random dot binomial distribution is going to do is, it's going to return the counts of successes here. So, it's going to turn, return essentially, our numerator in this proportion. So, the number of times that wind falls below 6.1. And, what we need to give it is our sample size. So, we want to do n different flips, or trials, and our proportion here is going to be 0.1. I'm sorry 0.5. That's going to be our null value. We take this value straight from the null hypothesis up here. And our size is going to be n. So, we're going to do little n trials, capital N number of times.

So, let's go ahead and run that and we'll see what's contained in counts here. So, we see that we have a collection of a thousand counts, a thousand numbers of successes. And so, if we want to get our p-hats for means, so our simulated p-hat values will be just the counts divided by in this case, a little n. It's not letting me. Every time I hit return, it to capitalize it, yeah there we go. So there it is in proportion format. So these are p-hats, what we're looking at right now, the series of numbers. This is our randomization distribution. Let's visualize this.

So, we'll do p-hat data frame is equal to PD dot data frames. We're just going to transform p-hat into a data frame here, and give it some column names. We'll say p-hat, and then use our good friend GG Plot to visualize this. I'm going to use a different visualization than histograms. We did a histogram last time for the single mean. Let's use a, let's do a dot plot now. We'll see what this looks like. The reason, well, I'll explain, the reason why I'm using a dot plot here when we take a look at it. And as we had before, let's also put our sample statistic here, as a vertical dashed line. So, we'll have our sample statistic when p-hat is going to be a vertical-dashed line. And this x-intercept value, the color again will have the red, and line type can be dashed.

So, there we go. There's our randomization distribution. This is the dot plot. So, instead of having bars, we have a collection of dots stacked on one top of one another. And the reason why I like doing this for the randomization distributions, is that each dot represents one randomization. So, a single randomized sample. So, we can see basically, the number of randomized trials, in this case, that fall within each bin, within each category of p-hat here, which is nice. It gives us a sense of the scale, and how many trials have been running. Again, this vertical red dashed line, this is our true sample statistic. This is what the data say. And similar to what we saw with the hypothesis test for the single mean, this falls outside of our randomization distribution. In this case, it falls above it. But let's go back and reference our alternative hypothesis here, alter our closest. Our alternative hypothesis is that p is greater than 0.5, and here we see no randomized trial that is at least as great as this red dash line. So, this red dash line is pretty extreme here. So just by inspection, this week we know beforehand that we can reject the null hypothesis. We can formalize this into a into a p-value. Let me just copy and paste our code for printing the p-value here.

Our p-value here is not going to be based on random mean. Instead, we have p-hat is our series that contains a randomization distribution. And it's not wind mean this time, but rather wind p-hat. And we've got to flip this around. It's greater than this time. But still our p-value is zero. So, there's zero probability of getting a randomized trial at least as great as our sample statistic if the null hypothesis is true. Again, this randomization distribution is reflecting the condition under which the null hypothesis is true, and showing what sort of variability we would have in that. In our data these would be our p-hat if that normally boss is true. So, what should we write for a concluding statement? Again we can say since the p-value of zero is less than 0.05, we can reject the null hypothesis, that wind met or exceeded the forecast to peak capacity. I'm just going to write the capacity half the time. But the key term here, is rejecting the null hypothesis. So, we reject the null hypothesis, and we provide the reason, the reason being that our p-value is less than our significance level, our default significance level of 0.05. Okay, so that is our test for a single proportion.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

The Google Colab Notebook use in the above video is available here, and the data are here. For the Colab file, remember to click "File" then "Save a copy in Drive". For the data, it is recommended to save to your Google Drive.


Try It: Is the rate of acute lymphoblastic leukemia (ALL) in PA significantly different than the national rate?

In a recent study (Clark et al., 2022), researchers at Yale University find that proximity to hydraulic fracturing (a.k.a. "fracking") sites almost doubles the odds of a child getting acute lymphoblastic leukemia (ALL). Their study examines 405 Pennsylvania children with this form of leukemia between 2009-2017, when oil and gas development activity (including fracking) was at a high in PA. For a comparison, they also examine a "control group" of 2,080 randomly-selected children of similar age as those with leukemia (randomly-selected from a pool of 2,019,054 children, by Dr. Morgan's count).

The following two-way table is taken from Table 2 in the paper:

Fracking Exposure Children w/ ALL Children w/o ALL
Exposed 6 16
Unexposed 399 2064

This table shows the number of children who had ALL that had oil and gas development activity within 2km uphill from their residence ("Exposed"; 6 children) and those who were "Unexposed" to oil and gas development activity (didn't have oil and gas operations within 2km uphill from them). The third column shows the same counts for children without ALL.

In order to answer the question in the section title and look at the rate of ALL, you need to know that the total number of children in PA, corresponding to the years of the study above, is 2,019,054 (PA Department of Health, Birth Statistics). You also need to know that the national rate during the time period of the study is about 32 cases per 1 million, or 0.000032 (Ward et al., 2014). Let's define "rate" as:

p = #  of cases with ALL   Total # of cases 

In the DataCamp code window below, design and conduct a test for single proportion:

  1. Write your null and alternative hypotheses.
  2. Perform the hypothesis test using a randomization distribution.
  3. Report the p-value.
  4. What is the conclusion of your test? State this in terms of the hypotheses, and also in terms of the original question above. Make sure to indicate the significance level you are using.

 Assess It: Check Your Knowledge

Knowledge Check