EME 210
Data Analytics for Energy Systems

Randomization Procedure for Correlation

PrintPrint

Randomization Procedure for Correlation

Read It: Randomization Procedure for Correlation

To conduct the randomization procedure for correlation, we will return to np.random.choice, which we used in Lesson 5 to conduct the hypothesis tests. The first step in the procedure is to define the hypotheses. For correlation tests, the hypotheses follow the same basic formula in which the null hypothesis assumes no correlation (i.e., ρ = 0) and the alternative hypothesis is some inequality of that (e.g., ρ > 0, ρ < 0, or ρ != 0), as indicated below.

H o :   ρ = 0

H a :   ρ > 0 , or  ρ < 0 , or  ρ 0

Once the hypotheses are defined, we can begin the randomization procedure. The main difference between this procedure and previous iterations is that we only shuffle one of the variables we are interested in. For example, say we have the following data: 

ID x y
A 1 4
B 2 5
C 3 6

In this case, we want to randomly sample either x or y without replacement (i.e., reallocate the data for only one variable). If we choose to randomly sample y, for example, the data may then look like this: 

ID x y
A 1 6
B 2 4
C 3 5

Note that all of the original y data is still there, it is simply in a new order, since we sampled without replacement. This procedure effectively breaks the pairs, so we can see if the correlation value was significantly different than 0--if the correlation is close to 0, then it won't matter if we break the pairs, but if the correlation is far from 0, it is more likely to matter and lead to a lower correlation value. Once the random sampling is completed, we recalculate the correlation coefficient with the new pairs, store that value, and repeat for a sufficiently large number of iterations (e.g., N = 1000). Below is a video demonstrating this procedure.

 Watch It: Video - Hypothesis Testing Correlation (10:25 minutes)

Click here for a transcript.

Welcome back to the videos on correlation. Today we're going to continue to talk about correlation, but we're going to get into that statistical inference side. And in particular, I'm going to introduce to you how to conduct a hypothesis test for testing the significance of correlation. So, let's jump in.

So, here we are in Google Colab. I've already loaded in the data. We're still working with that Marcellus Well Shale or Marcellus well and water data that we used up here to make this conclusion. But we're going to do a bit more of a quantitative analysis and focus on hypothesis testing today. And so, the first step in any hypothesis test is to state the hypotheses. And so, our null hypothesis is rho, the Greek letter, equals zero. And it's always going to be this correlation because of the significance test for correlation is whether or not the correlation is significantly different from zero. And so we have rho equals zero. And then our alternative here is that rho is greater than zero. Now this could be less than, this could be not equal to, but for this case I'm going to go greater than zero. And so once those are stated then we need to conduct the randomization procedure. And I've outlined these steps here, so that you can have them later when thinking about the randomization procedure. But we start with data. And now in the past we've shuffled data around. We've done reallocation randomization. In this case we're going to do a reallocation of sorts, but we're only going to do it with one variable. So say we start with this data here, X and Y. What we're then going to have to do is choose one variable to shuffle. Here I'm doing Y, and we sample without replacement, so reallocating the data for that one variable, and then we're going to recalculate the correlation coefficient and repeat that n number of times. And really the goal with this reallocation is to break the pairs. So we don't want to keep them paired together, we want to break them and then recalculate the correlation. Because if the correlation is equal to zero, then it shouldn't matter how much we shuffle the data, it should always be equal to zero. And so by shuffling we can sort of get at that significance level. And so we can go ahead and get started with the code. So the first thing we need to do is define our sample correlation coefficient. And I'll call it samp_core. And this is just how we calculated the correlation before, total basewatervolume dot core with total gas.

So this is our sample correlation. And then before we get into our randomization loop, we need to initialize the variables. So I'm going to make a copy because we don't want python to accidentally override our original data frame. So I'm going to call it sim. We need capital N for 1000 iterations. We need lowercase n, which is our sample size, which I'm just going to set to be the number of rows in our simulated, our copied data set. And then we need to create an empty array called simcorr, which we'll use to create our simulated correlation. So I'll go ahead and run that. If we come down here and check out sample core we can see that it's the same value that we got up here, just it hasn't been rounded. So then we need to start the randomization loop. And so, we started like we have all of our randomization loops for I in range N. And then we say sim total gas. So again we're selecting one variable. So this could be water, but I'm going to do total. Guess it doesn't matter which one you select as long as it's the same throughout all of your code. So I'm going to take some total gas and I'm going to shuffle it. So I'm going to say MP dot random choice. And then I'm going to randomly draw from again total gas, so making sure it's the same variable.

Our size is going to be n and then replace equals false. So there are two things that are a little bit different here from what we've done in the past. We're replace... we're sampling without replacement. So you need to make sure that you set replace equals to false. We've done that before, but the main change is that we're just going to simulate, we're just going to randomly sample from one variable, shuffle it, and now we're going to recalculate our correlation.

So this is total base water volume dot corr and total gas, making sure that you're using the simulated code here so that you are taking the correlation with our new re-shuffled data set. So we can run this and if we come down here we can print simcorr and see that we've got a lot of different correlation values, occasionally even some negative values. So one of our reallocations led to a negative relationship between total gas and total base water volume.

So that is the randomization. And like we have done before, a lot of times it's easier to then visualize as the next step. So before we can do that, we need to create data frame for sim corr because right now it is a numpy array and ggplot doesn't like to work with those. So we can say simcorr, the columns I'm just going to write as simcorr, so keep the same columns. And then we can say ggplot simcorr DF, and we can do a histogram.

And so our x value is the only one we need which is simcorr, and then I'm going to add a vertical line where the x intercept equals simcorr, get our sample data set. The color is blue, and we will do line type equals dashed. And I'm going to enter that down so it's a little easier, and wrap this in parentheses so python knows it's all one line. And then we can go ahead and run that. And so, here we can see this is our simulated correlation values. So they're centered around zero, and this is our sample one. And so depending on how we framed our alternative, this value is based, either going to be zero, if it's however much is greater than that, or one, however much is less than that. But we have done a greater than test and so we can say print the p-value, and it is just the length of simcorr where simcorr is greater than or equal to sampcorr divided by n.

So we can run this, and see we get zero, which we expected. There is no data above our sample line, which leads us to reject the null hypothesis. And in particular we can say that the p-value is less than the significance level of 0.05, and give exactly what the p-value is. So we reject the null hypothesis in favor of the alternative that the correlation between these variables is significantly greater than zero. And so, that is how you can do a randomization procedure to test the significance of correlation.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

 Try It: OPTION 2 DataCamp - Apply Your Coding Skills

Using the same data as the previous DataCamp exercise, edit the following code to implement a randomization procedure. 

# This will get executed each time the exercise gets initialized. # libraries import pandas as pd import numpy as np # create some data df = pd.DataFrame({'x': np.random.randint(0,100,1000), 'y': np.random.randint(0,200,1000)}) # calculate sample statistic sampcorr = ... # Initialize Variables sim = df.copy(deep = True) N = 1000 n = sim.shape[0] simcorr = np.empty(N) # Randomization Loop for i in range(N): sim['x'] = ... simcorr[i] = ... # calculate p-value print('p-value: ', ...) # libraries import pandas as pd import numpy as np # create some data df = pd.DataFrame({'x': np.random.randint(0,100,1000), 'y': np.random.randint(0,200,1000)}) # calculate sample statistic sampcorr = df['x'].corr(df['y']) # Initialize Variables sim = df.copy(deep = True) N = 1000 n = sim.shape[0] simcorr = np.empty(N) # Randomization Loop for i in range(N): sim['x'] = np.random.choice(sim['x'], size = n, replace = False) simcorr[i] = sim['x'].corr(sim['y']) # calculate p-value print('p-value: ', len(simcorr[simcorr >= sampcorr])/N)


 Assess It: Check Your Knowledge

Knowledge Check