Click here for a transcript.
Hi, in this video we're going to continue summarizing categorical theta, but looking at two-way tables. So, basically looking at value counts across two different categorical variables simultaneously. So let's get to it. We're going to continue on again with our state energy rankings data. And recall that last time we made a new categorical variable asking whether a state produced natural gas or not using this np.wear function. We're going to borrow the same code here, and we're going to make a second categorical variable looking at whether a state has total energy production above one quadrillion BTU or not. So, we'll develop another variable. We'll call this above one quadrillion BTU, question mark, and we'll replace our isnull question with a different statement. So we'll ask ourselves whether the total energy is greater than a thousand because total energies in units of trillions of BTU. So a thousand trillions as a quadrillion. And this time, if it is above, then we're going to return a yes, yes it is above one quadrillion. And if it's not, so if this statement returns a false, again we're going to say no, it is not above, it is not above one quadrillion BTU. And again, we'll do the same processing. We'll confirm, convert it to categorical and confirm that. Let's go ahead and run this. There we go. We see that our question above one quadrillion BTU is categorical. So now we have two interesting categorical variables here, produces NG, and above one quadrillion BTU. Let's take a look at how many values we have in each combination of categories between these two categorical variables. So, in order to do this, we're going to use the cross tab function available from pandas, and we'll make a new object called twt same for two-way table. Reference pandas with td, function is cross tab, and we'll take our data frame which we're calling df3 right now. And we're going to give it the first categorical variable that we want, or we developed before that produces NG. And then, we give it the second categorical variable that we want in the 2A table. And that's going to be the one that we just made above one quadrillion BTU. And let's take a look at what we get.
There we go. So we've got our first variable, Produces NG, no, yes. So that's giving up the rows. Our second variable above one quadrillion BTU, giving the columns, no, yes. And the count of all the states that make these up. So there are 18 states that do not produce NG and do not have one quadrillion, produce more than one quadrillion BTU, and total energy, and so on, and so forth. Another handy feature of crosstab is if we add the argument margins equals true, what this will do is it'll give total columns and rows to this as we see here. So it added on all, which gives just the summation across all the rows. So, 18, plus 0, is 18. 18 plus 15, is 33. And so on. And then it gives us all row, which gives the summation down all the columns. So, 18 plus 18, is 36, etc. And then in this last intersection between the all column and all row gives all the observations in the data set. So the total number of rows. So, 51 in this case, 51 states. So, let's use this to find some proportions. Now first, suppose we want to find what proportion of states that produce natural gas also produce more than one quadrillion BTU in total. Let's borrow the same structure that we had before for finding proportions. I'm just going to copy and paste down here.
Now we have to change this up a little bit because twt now is a data frame, it's not a series anymore. So, we have to if we want to extract values from this, we need to reference the row and the column. Okay. And we need to do so using that dot loc function that we introduced before. So our numerator for this proportion is going to be those states that produce natural gas and also produce more than one quadrillion BTU. Now, that's located in the yes and yes column and row, respectively. I want this value 15, right. So we're going to take our twt option object data frame, do dot loc, so use our dot loc function, and we want to reference the yes row. And we want to reference the yes column to extract that 15. Here. Let's just make sure that worked real quick.
And there we go. Our denominator, on the other hand, is going to be the number of states that produce natural gas. So it's the first criteria here. So this. Or then what number of states that produce natural gas also produce one quadrillion. So if we want to get the total number of states that produce natural gas, where do we find that on the table? Well, it's in this yes row, right, produces natural gas, yes. Before we had found that it was 33. And we see that referenced under the all column here. So here we can say twt dot loc. We want the yes row, and the all column, twt not txt 33. There we go. And then we just find the proportion. So, 0.45 or 45 and a half percent. Okay. Well, let's do a slightly different proportion. And this is only just slightly different, but it highlights the power of language in stating these proportions. So, what proportion of states that produce more than one quadrillion BTU in total also produce natural gas? So we're reversing it. So, let's borrow our same structure here.
So now we're still, we still have the same numerator and this is because we're still looking at the number of states that do both things, that need both criteria: produce more than one quadrillion BTU, and also produce natural gas. So, we'll stick with this, 15. You still want that, but our denominator now changes. Our denominator is just what the first criteria is, and that is the states that produce more than one quadrillion BTU. So, not the states that produce natural gas now, but more than one quadrillion BTU, and that is under our yes column, and at our all row. So all we have to do is flip the order of what we have in this line for the denominator, the all row, yes column,
15 States in total that produce more than one quadrillion BTU. We see that all of them produce natural gas. And so, our proportion is one or a hundred percent. Lastly, one more proportion here, what proportion of states produce more than one quadrillion BTU and also produce natural gas? Wait a minute, this is the same question I just asked! No, we've by removing this, that produce we're now looking at out of all the states all 51, which ones do both things? Whereas this one was just looking at only the states that produce one quadrillion BTU. So we've got this conditional statement that here, that changes the game dramatically. So, we'll take this again. We're still using the same numerator, so we're still looking at the states that do both things. But now our denominator changes again. We want to look at all the states. So what proportion of all states do both things? And we can get this in the all row, in the all column, to get this 51. So, all in all, and go ahead and run this. Let me see that, 29.4 percent of the states out of all states produce more than one quadrillion BTU, and also produce natural gas. Okay, so that is two-way tables, and a bit more about proportions and how they can get more complex when we're looking at two different categorical variables. Thank you.