EME 210
Data Analytics for Energy Systems

More Roshambo: Two-Way Tables

PrintPrint

  Read It: Two-Way Tables

In the previous section, we saw how frequency tables displayed the counts of observations in each category of a single categorical variable. However, sometimes it is useful to see how these counts change across the categories of a second categorical variable. A "two-way table" gives the counts of observations in each and every combination of categories from two different categorical variables:

Category A Category B Category C
Category 1 # of observations in A and 1 # of observations in B and 1 # of observations in C and 1
Category 2 # of observations in A and 2 # of observations in B and 2 # of observations in C and 2
Category 3 # of observations in A and 3 # of observations in B and 3 # of observations in C and 3

Let's look at a more interesting example that Dr. Morgan found on Instagram:

Instagram image that state over 50% of people in the US who live near hazardous waste sites are people of color
Exmple of a two-way table from Instagram
Instragram 

We are given a proportion here: 50% (well, actually "over 50%", but let's just round down to 50%). This could easily be exchanged with the actual count of people, so let's treat the proportion synonymously with the count for this discussion. What is the categorical variable here? Race, with the categories "people of color" and "not people of color" (or, "white"). What does the frequency table look like?

Race Frequency (near Haz. Waste Site)
People of Color 50%
White 50%

However, this frequency table alone doesn't give the full picture. In other words, the 50% quoted in the picture above doesn't signify much without more context. What is that context? We need to know how many people of each race category in the U.S. as a whole. In other words, we need to compare the frequency of each race near the hazardous waste sites to the baseline frequency. If the rest of the U.S. is 50% people of color to begin with, then the statement in the picture above isn't significant at all. However, according to the 2020 U.S. Census, this isn't the case:

Location
Race Near Haz. Waste Site In all of U.S.
People of Color 50% 38.4%
White 50% 61.6%

Here, the two-way table allows us to see that the proportion of those living near hazardous waste sites is disproportionately comprised of people of color compared to the rest of the U.S. In summary, two-way tables allow us to make some pretty powerful comparisons of frequencies across two categorical variables.

Making two-way tables in Python is very easy, using the crosstab function from Pandas. The following video demonstrate the use of this function, as well as finding proportions from the resulting two-way table.


  Watch It: Two-Way Tables (9:43 minutes)

Click here for a transcript.

Hi, in this video we're going to continue summarizing categorical theta, but looking at two-way tables. So, basically looking at value counts across two different categorical variables simultaneously. So let's get to it. We're going to continue on again with our state energy rankings data. And recall that last time we made a new categorical variable asking whether a state produced natural gas or not using this np.wear function. We're going to borrow the same code here, and we're going to make a second categorical variable looking at whether a state has total energy production above one quadrillion BTU or not. So, we'll develop another variable. We'll call this above one quadrillion BTU, question mark, and we'll replace our isnull question with a different statement. So we'll ask ourselves whether the total energy is greater than a thousand because total energies in units of trillions of BTU. So a thousand trillions as a quadrillion. And this time, if it is above, then we're going to return a yes, yes it is above one quadrillion. And if it's not, so if this statement returns a false, again we're going to say no, it is not above, it is not above one quadrillion BTU. And again, we'll do the same processing. We'll confirm, convert it to categorical and confirm that. Let's go ahead and run this. There we go. We see that our question above one quadrillion BTU is categorical. So now we have two interesting categorical variables here, produces NG, and above one quadrillion BTU. Let's take a look at how many values we have in each combination of categories between these two categorical variables. So, in order to do this, we're going to use the cross tab function available from pandas, and we'll make a new object called twt same for two-way table. Reference pandas with td, function is cross tab, and we'll take our data frame which we're calling df3 right now. And we're going to give it the first categorical variable that we want, or we developed before that produces NG. And then, we give it the second categorical variable that we want in the 2A table. And that's going to be the one that we just made above one quadrillion BTU. And let's take a look at what we get.

There we go. So we've got our first variable, Produces NG, no, yes. So that's giving up the rows. Our second variable above one quadrillion BTU, giving the columns, no, yes. And the count of all the states that make these up. So there are 18 states that do not produce NG and do not have one quadrillion, produce more than one quadrillion BTU, and total energy, and so on, and so forth. Another handy feature of crosstab is if we add the argument margins equals true, what this will do is it'll give total columns and rows to this as we see here. So it added on all, which gives just the summation across all the rows. So, 18, plus 0, is 18. 18 plus 15, is 33. And so on. And then it gives us all row, which gives the summation down all the columns. So, 18 plus 18, is 36, etc. And then in this last intersection between the all column and all row gives all the observations in the data set. So the total number of rows. So, 51 in this case, 51 states. So, let's use this to find some proportions. Now first, suppose we want to find what proportion of states that produce natural gas also produce more than one quadrillion BTU in total. Let's borrow the same structure that we had before for finding proportions. I'm just going to copy and paste down here.

Now we have to change this up a little bit because twt now is a data frame, it's not a series anymore. So, we have to if we want to extract values from this, we need to reference the row and the column. Okay. And we need to do so using that dot loc function that we introduced before. So our numerator for this proportion is going to be those states that produce natural gas and also produce more than one quadrillion BTU. Now, that's located in the yes and yes column and row, respectively. I want this value 15, right. So we're going to take our twt option object data frame, do dot loc, so use our dot loc function, and we want to reference the yes row. And we want to reference the yes column to extract that 15. Here. Let's just make sure that worked real quick.

And there we go. Our denominator, on the other hand, is going to be the number of states that produce natural gas. So it's the first criteria here. So this. Or then what number of states that produce natural gas also produce one quadrillion. So if we want to get the total number of states that produce natural gas, where do we find that on the table? Well, it's in this yes row, right, produces natural gas, yes. Before we had found that it was 33. And we see that referenced under the all column here. So here we can say twt dot loc. We want the yes row, and the all column, twt not txt 33. There we go. And then we just find the proportion. So, 0.45 or 45 and a half percent. Okay. Well, let's do a slightly different proportion. And this is only just slightly different, but it highlights the power of language in stating these proportions. So, what proportion of states that produce more than one quadrillion BTU in total also produce natural gas? So we're reversing it. So, let's borrow our same structure here.

So now we're still, we still have the same numerator and this is because we're still looking at the number of states that do both things, that need both criteria: produce more than one quadrillion BTU, and also produce natural gas. So, we'll stick with this, 15. You still want that, but our denominator now changes. Our denominator is just what the first criteria is, and that is the states that produce more than one quadrillion BTU. So, not the states that produce natural gas now, but more than one quadrillion BTU, and that is under our yes column, and at our all row. So all we have to do is flip the order of what we have in this line for the denominator, the all row, yes column,

15 States in total that produce more than one quadrillion BTU. We see that all of them produce natural gas. And so, our proportion is one or a hundred percent. Lastly, one more proportion here, what proportion of states produce more than one quadrillion BTU and also produce natural gas? Wait a minute, this is the same question I just asked! No, we've by removing this, that produce we're now looking at out of all the states all 51, which ones do both things? Whereas this one was just looking at only the states that produce one quadrillion BTU. So we've got this conditional statement that here, that changes the game dramatically. So, we'll take this again. We're still using the same numerator, so we're still looking at the states that do both things. But now our denominator changes again. We want to look at all the states. So what proportion of all states do both things? And we can get this in the all row, in the all column, to get this 51. So, all in all, and go ahead and run this. Let me see that, 29.4 percent of the states out of all states produce more than one quadrillion BTU, and also produce natural gas. Okay, so that is two-way tables, and a bit more about proportions and how they can get more complex when we're looking at two different categorical variables. Thank you.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

  Try It: DataCamp - More Roshambo

Continuing with the Roshambo exercise from the previous page, can you find the proportion of those who played Rock, ended up winning? First find the appropriate two-way table and reference the values in that table to find the appropriate proportion.

By inspection of the table, do the data suggest that one option is more favorable to play?


  Assess It: Check Your Knowledge

Knowledge Check

Use the table below to answer the following questions.

Make Model Type City MPG
Audi A4 Sport 18
BMW X1 SUV 17
Chevy Tahoe SUV 10
Chevy Camaro Sport 13
Honda Odyssey Minivan 14

FAQ