Effect of Sample Size

Read It: The Effect of Sample Size in Hypothesis Testing

A larger sample size will lead to a narrower randomization distribution. Usually, this will lower the p-value. In other words, if you have a large sample, you can more definitively reject the null hypothesis when the sample statistic is close to the null value.
A smaller sample size will lead to a wider randomization distribution. Usually, this will increase the p-value. In other words, if you have a small sample, it is less likely you can reject the null hypothesis when the sample statistic is close to the null value.

MORE DATA IS ALWAYS BETTER!!!

The following video presents a numerical experiment where we can see the effect of sample size on the randomization distribution and the p-value.

Watch It: Video Effect of Sample Size - (12:06 minutes)

Click here for a transcript.

Hi. In this video, I want to show you what the effective sample size would have on our hypothesis tests and the results.

So, for starters, just to summarize, we saw in all five hypotheses tests that our p-value was zero. And this was using all available data, which one should use normally. You should always use all the data that you have available to yourself. But we had pretty large sample sizes there. So we're using all the data, good amount of data. Data collected every 15 minutes during that that two-day duration. So there's a wealth of data and because of that we got really low P values. We got, we had lots of evidence to definitively say that, well first with the single sample test that, both wind and gas didn't meet capacity. But then when we did the two sample tests and compared them, natural gas performed even worse than wind, so both failed. But natural gas failed more effectively, is what's being said in the results here. But we didn't see much variation in our p-values. Every time the p-value was zero, and we had these very definitive results, we always saw that the sample statistic fell well outside of the randomization distribution.

And so, I want to show you what the effect sample size has here; one to give some more variation in the results and what statements you should make if your P values are not zero, but then also to really give you motivation and show you the importance of collecting your data. Okay, so in order to do this we're gonna imagine some situations here. So imagine the situation that we have far less data than we actually have. Say for example, suppose during this two-day event across February 14th and 15th, in February 2021 we, maybe there was some sensor failure in the grid and we weren't able to record data every 15 minutes. Maybe we only had a handful of data in actuality. So, in order to kind of simulate that hypothetical or imagined situation, let's insert some code here and alter reality. So we're going to down sample the data set. No, this is something that you shouldn't normally do in practice. Again, in practice, you want to use all the data that's available to you. But I'm doing this here just to show you how these results would change if we weren't fortunate enough to have this much data.

So, in order to do this down sampling we're going to go back to genpiv and we're going to pivot this. I'm just going to, actually we've done this before so I'm just going to copy and paste where's our pivot. So we're going to pivot so that we have this situation. Now that we have, you know, values paired here, we can easily down sample. Essentially we can just select some of these times in which say, for example, the sensor would actually be working. And we can do this quite easily using a built-in function in pandas called sample. So this sample function is from pandas and what we do by specifying n say equals 20, is we're just going to randomly sample 20 values, or 20 rows from this data set. Okay, so instead of having, you know, 192 samples, we're going to go down to 20. So, far less data.

Then we're going to, we're going to reconstitute this in its original form. So we're going to unpivot AKA melt. And we did that here, so we can copy and paste this here. So we'll date time is now in index will bring us back to a column, and then we will melt this into this manifold here. So, let's actually go to runtime and we will run everything before up to this code. Now we can run this again and here we go here's our new down sample data set. We've got date, time, fuel generation. Note that we've got 40 rows here, so starting at 0 to 39, because our sample size is 20 and we've got two groups. Okay, so that's our new genpiv down sample. Now let's see what change this has on all our results. So we have a slightly different wind mean, although it's basically the same, 5.15. We'll redo a randomization distribution.

Oh, this randomization distribution has gotten quite a bit larger now, still centered on the null value of 6.1 is this for the single mean here, and it overlaps with our sample statistic, so our p-value now is not zero. 0.042 it is still less than 0.05 though, so we're still less than our significance level, our default significance level. But we should change this. Our P value is not zero anymore, it's 0.042. We would edit this statement. I'll just delete this right now. Therefore, we can still reject that. We still reject the null hypothesis, but 0.042 is less than 0.05, but just a little bit less. It's not dramatically less. Okay, let's continue on. What about with our test for single proportion?

Our sample statistic has gone up a bit, 0.7. Can just re-do all this code here again, we've got overlap here, and what's our p-value, 0.061. So this is not, this does not meet the criteria for rejecting normal hypothesis, that is using the default value of 0.05 so we would change this. Since the p-value of 0.061 is more than 0.05, we can no longer reject the null. We would say we fail to reject the null hypothesis. That we met or exceeded the capacity half the time. Nope, really, really important here we do not accept the alternative. We are not saying that the alternative hypothesis right here that this proportion of failure rate is greater than half time. We're not saying that that's true. We're saying that the null hypothesis is wrong. In other words the evidence, I'm sorry we're saying that we're not accepting the null hypothesis, we're just failing to reject it. Okay, and we're not accepting the alternative, right. Furthermore, continuing on, when we get into two samples we'll just recalculate our percent deficit values here note that this is you know much less detailed of a curve we have you can see where the data points are here where the inflection of the curve changes. And we will go ahead and run our test here. Our difference of means is no longer 12 and a half percent, 13.8. So that changes a bit because we've altered our data set.

Again, we've got overlap here, and again our p-value is greater than 0.05 so we would again alter this statement to reflect that this is no longer 0.0. But 0.77 is more than the significance level of 0.05, we fail to reject the null hypothesis. But there's a difference. Okay. So we fail to reject. We do not accept the null. We fail to reject it. Redoing now, going down to the mean of differences, running this again. Note that this python notebook makes it very handy to just rerun the code. We just go through and press play. We still have everything called gen PIV. We just alter gen PIV so it makes it very easy to examine this hypothetical situation. This is interesting. So when we do the difference of means, our p-value is 0.038. So, it's still low, still below 0.05. We still reject the null hypothesis.

But this is different than the difference of means that we did above. So take a look at this randomization distribution from about negative 20 to positive 20. Note that we still have the same sample statistic here, and minus 13.898, blah, blah, blah. When we go up to the difference of means we still have that same sample statistic. But the randomization distribution is fatter. It goes up to maybe minus 22.5, and positive 22.5. The value was 0.077. It's about twice as large as what we got with the mean of differences. So as I said before, when doing the paired comparison with the mean of differences, by constraining the randomization to be at those times within each time interval we're adding information to the test we're saying that the time makes a difference the time is valuable, and in doing that we get a lower p-value. So really interesting here that doing the difference of means we'd fail to reject the null. However, if we do the mean of differences we do reject the null we get a very different outcome. Just to round off the discussion on effective sample size, we'll redo our comparison of proportions. We run our code here, a portion of 1.7 agrees with what we had, with the single proportion difference 0.3, much wider outcome with the randomization distribution p-value quite low, 0.007.

So, we would still reject the null. In this case we still have the same outcome. So effective sample size, if we have a much smaller sample size, we do see these non-zero p-values. And in particular, some of those tests fail to reject the null. The p-value is larger than our default setting against level 0.05. And also in the two sample comparison with means, our difference in proportions fails to reject. Whereas our, I'm sorry, our difference of means fails to reject. Whereas, our mean of differences still rejects. So really interesting there. Okay, so sample size really important, larger sample size always better. More data is always better. Really important to repair that in mind. Okay. Thank you.

The Google Colab Notebook use in the above video is available here(link is external), and the data are here(link is external). For the Colab file, remember to click "File" then "Save a copy in Drive". For the data, it is recommended to save to your Google Drive.

Try It: Coin Flips

Imagine you have a coin that you suspect is "unfair", meaning that it is weighted more to one side. You decide to test the coin with a series of flips, where the number of flips is your sample size, n. In this series of flips, you record the proportion of "heads", and it comes out to 0.6 (regardless of the magnitude of n). With your hypotheses being:

Ho: p = 0.5
Ha: p > 0.5

write code in the DataCamp cell below to report the p-value and the standard error of the randomization distribution under two conditions: 1) when n = 10 flips, and 2) when n = 100 flips. What happens to the standard error as your sample size increases? What happens to the p-value?

Effect of Sample Size