Click here for a transcript.
Hi. In this video we're going to do a hypothesis test for two samples comparing means. But this time we're going to look at the mean of differences, that is, a paired comparison. This is in contrast to the difference of means that we did in the last video. So, let's get to it. So, the reason why we're doing a paired comparison here, or we're able to do a paired comparison, is because at each individual time here we have both a measurement of wind percent deficit and natural gas percent deficit. And so, we can compare what's going on at each individual time as we go along. We don't have to just compare entirety of one curve versus the entirety of another curve. There may be valuable information in constraining it, constraining that comparison at the individual time steps that we have here. So, how do we do this?
First thing is we have to reorganize our data a little bit. So, let's just create a new data object where it's called df2 from our original genpiv. We need to pivot this data frame. So, let's do this. And you'll, it'll start to make sense after we've done it. So, our index is going to be date time. Our columns will be created from the fuel column that we have in genpiv. And our values are going to come from percent deficit. So, let's see what this looks like. And it'll be clear what's being done here. So, what was the date time column in genpiv has now become an index. We had a fuel column that contained either natural gas, or wind, or two groups. And those are now two separate columns, here. And the values underneath are the percent deficit now the reason why we organize it as such is because now we have these pairs at each time step spread over two columns here, so this allows us to very easily then calculate the difference of each of these pairs.
So, I'll make a new variable difference that's simply our natural gas minus our wind and why am I doing natural gas minus wind and not wind minus natural gas? Well, this is because of our hypothesis. So, our hypothesis is that the mean of the difference between gas and wind and this is following the same order that we had in our first test of two means comparison of means looking at the difference in means. So, our order there was gas minus wind. So, we're preserving that same order. Again, this is based upon the supposition that we see in the from the data visualization that gas is under performing relative to wind. But the goal of this is to test that supposition test that hypothesis. So, there we have, there's our difference. We can see that it's been calculated. So, now we have a difference column that's simply the difference between those two things. And then finally, to get our sample statistic, which we can do down here. Our meandiff, we'll call it, is from this difference column and calculating the mean.
So, we can see what that value is. Lo and behold, minus twelve and a half percent. Note that this is identical to the greatest degree of precision that we have here, to what we had above with the difference of means minus 12.5628, blah, blah, blah, blah. They're mathematically the same thing. So we have the exact same sample statistic here just calculated slightly differently. However, the randomization test is going to be done considerably different. So again, just as a reminder, we do need numpy here. Include that to be complete. We're going to also copy df2 to something called sim here. So that we don't mess up our original data frame. It's just good practice. Capital N again will be a thousand, and little n again will be the number of rows that we have in our data frame of interest. We do need an empty vector here to store our statistics in after we do the randomization. And again, we'll fill this up as we march through our for loop
So, what goes into our for loop? Well first we need to do our random randomizing, here. What we'll, how we'll do it is we'll calculate this multiplier. The multiplier here is just going to be, and this should be in square brackets, Our multiplier is just going to be either one, or minus one, that is we either flip the sign of the difference, or we don't, because flipping the sign of the difference is equivalent to swapping the values across these two groups, or AKA across these two columns here. So, this is quite simple in terms of code.
We just use random choice again and we can just simply randomly select from the vector 1 or -1. So, we're just going to randomly choose one, or minus one, we're going to do it n times. And we will do that with replacement, of course, so we're going to end up with every time we go through this whole thing because a vector here or a column it's the same length as percent deficit. I'm sorry same length as difference. This can be filled with some random assortments of ones and minus ones. Okay. And then, our new simulated randomized difference column is going to be simply this sim multiplier times our same difference. So either we to look at the sign, or we don't. And then finally we store our sample statistic in this xbardiff vector, this empty vector here, and this will just be difference. And we'll find the mean of it. Okay.
So, there we go. And then we can of course visualize the randomization distribution xbardiff or we're still using that variable name. We're cycling that variable name, but we're not looking at the difference. It means we're looking at the mean of differences. So, we'll change that name up. And our sample statistic is actually mean diff. We'll call that something else here. Let's see what this looks like.
There we go. So immediately you might say, okay, well the mean diff, which is equal to this minus 12.5 percent still outside the realm of our randomization distribution, right? Again this randomization distribution is the realm of this is the the suite of possible events that we could see if the null hypothesis were true. And so, again, you can immediately tell from this that our p-value is still going to be zero. But there is one difference I want to tease out of this. So, in comparison to what we did with the difference of means, so take a look at this randomization distribution. Most of it's concentrated between about you know, minus seven. Deposit minus seven and a half to positive seven and a half. Whereas, we have a bit of a wider sweep of possible events when we did the difference of means.
So, in other words, this randomization distribution from the mean of differences is a little bit more concentrated around the null hypothesis value. In this case a zero percent deficit. Now the reason for that is because we're constraining the randomization to take place within each row before with a difference of means essentially these values could swap rows. They could end up anywhere. But here they can only stay within the rows. So, because we're maintaining that pairing because we're saying that there's added information somehow in keeping these values at the same times. But there's something valuable about that time information. We're getting a narrower range of possibilities in this randomization distribution. And thus a more definitive answer in our hypothesis test, even though the p-value is still zero. I encourage you to go take a look at a subsequent video effective sample size, in which we'll compare these. If we have a more limited sample size you can see some interesting differences there. Okay. Just around finish the discussion here. Let's calculate our p-value and arrive at a conclusion, we'll just borrow our p-value calculation code up here. It's not xbardiff anymore. Oh it is still xbardiff, but here we have new diff, and again, it's still less than, and again, we get a p-value of zero, and of course, our concluding statement would be the p-value is low. It's zero. It's less than 0.05. We will reject the null hypothesis. Same conclusion that we had under the difference of means. Okay. Thank you.