EME 210
Data Analytics for Energy Systems

Roshambo: Frequency Tables and Proportions

PrintPrint

  Read It: Roshambo

Roshambo (a.k.a., Rock, Paper, Scissors) is a tried-and-true method for settling disputes. In this game, two opponents synchronously state "Rock... Paper... Scissors... Shoot!", and on "Shoot!", each has to display one of three choices with their hand: Rock, Paper, or Scissors. The outcome is determined by:

Player 1
Rock Paper Scissors
Player 2 Rock Tie, play again Player 1 wins Player 2 wins
Paper Player 2 wins Tie, play again Player 1 wins
Scissors Player 1 wins Player 2 wins

Tie, play again

 Try It: Roshambo

Your response given in the form above will be randomly paired with another response, and the outcome will be stored in a dataset, which is to be used in the coding exercise further below. But first, let's learn about frequency tables and proportions.


 Learn It: Frequency Tables

Frequency tables list how many times each category appears within a categorical variable. These tables plainly show which categories are the most common, and which are rarely observed. The Pandas function that we will use to make a frequency table is "value_counts". Run the DataCamp code below to see how it works:

It is important to remember that "value_counts" works on a Series, and not a DataFrame. We pass "value_counts" a column from a DataFrame in the example above.

 Learn It: Proportions

A proportion, p , is the fraction of cases in a population that have a certain value or meet some criteria. From a data sample, the statistic is:

p ^ = #  of cases that meet criteria   Total # of cases 

It is easy to calculate a sample proportion from a categorical variable using the output from "value_counts".


  Watch It: Video - Importing Files (9:16 minutes)

Click here for a transcript.

All right, in this video we're going to cover summarizing categorical data. And in doing this, we're going to cover frequency tables, that is, counting the values within each category in a categorical variable and proportions. So distilling that information into a single value or proportion.

So, let's get to it. We're going to continue on with our state energy ranking data set from EIA. In order to exercise this, we're gonna have to make a new categorical variable. So the categorical variables that we have in that, so far we just have state, which isn't very interesting because we have a single state in each category. So let's make a more interesting categorical variable. And our categorical variable here is going to ask the question, produces natural gas or not. So, we'll make a new variable. We'll say df3 in square brackets and quote, we'll simply ask reduce produces NG for natural gas, question mark. So we'll expect yes or no response here. Yes, it produces. No, it doesn't. To make this happen we'll borrow a function from python called where. So the where function, and the way the where works is, we first give it an argument that returns a true or false, so some kind of Boolean statement as we used before. So, for example, we can say from pandas is null and we'll ask ourselves is this df3 natural gas column null or not? So it's going to go through value by value in this variable natural gas, and if the value is null it'll return whatever we have for the second argument here. So we'll put a no, it doesn't produce natural gas. However, if the value is null returns false, so it is not null, then that means that that state produces a natural gas. So that's the way this where function works. You always give it something that returns a true false, so a Boolean statement. If it's true then it's going to return whatever values in the second argument. If it's false it'll return whatever values in the third argument here.

So let's see this in action. We can just take a look at this variable real quick, see what we've gotten. And we see we have a bunch of yes' and a bunch of no's. So remember, these last 18 are no because they had null values for natural gas, but the upper ones all produce the natural gas. Let's take a look at the the data types for this producing natural gas object. So this is a text variable, not a categorical one yet. So let's course this into the categorical variable and then we can move on. And so, we just need to use our as type function here. And as type category and we'll take a look at our D types again and make sure that's worked. And it did. We've got a categorical variable there, and so we can move on.

So now let's make use of our frequency table to see how many states produce nitric acid and how many do not. So we'll make, we'll store this table in an object called freq, get our data frame, our new variable produces natural gas. We'll start with that encased in a square bracket. And the function we want to use here is value counts so it's a dot value_counts. And as the name of this function implies it's just going to count the values in this variable category by category. So let's take a look at what we get. So we'll print our frequency table and lo and behold, we see that we have those 18 states are a no, they do not produce natural gas, and the remaining 33 are a yes, they do. Recall that there's 51 states because we're including DC. Okay. Well that's great. Let's find a proportion from this.

So if we want to ask ourselves, well what proportion of states produce natural gas? Well, you could probably do this in your head or very easily with a calculator, or even just in the cell here. It'd be very easy to say, oh well, it's 18 divided by 33, plus 18, right. And you'll get the right answer. But this isn't really programming. So, if we have to ask ourselves if we were to update this data set, if we got this next year, and we wanted to just reuse this code, could we do so without having to go in and really edit a whole lot? And if we write it this way then the answer is no, we'd have to go in and manually adjust these values by inspections table, and that involves work.

So let's write this in code that will automatically calculate the proportion for us, even as values change. So, let's first find our numerator of the proportion. And so, we're going to reference our frequency table. So this frequencies table is just stored as a series. So it's just the values 33 and 18, indexed by yes and no. So if we want to access those states, the count of the states that do produce natural gas, we just need to say yes. So frequency square brackets yes. And that will return our value of 33, as we see here. The denominator, well we want to, our denominator needs to be all the states. So we need to get 33 plus 18. And there's many ways we could do this. We could say frequency of yes, plus frequency of no.

Something that's a little bit more general would be to just simply sum. We'll make use of the sum function in this frequency table. and this is more general and easier to use if we had, say, suppose we had five categories. We wouldn't want to necessarily take out all five categories. Or, if there were say 50 categories, you wouldn't want to type them all out. But regardless of the number of categories, if we do sum, then we'll get, we'll always get the total of the frequency table. And we'll get the total number of observations that we have. And we can see that here 33, 51 observations of states. Let's put this together. Well, our proportion here will be just our numerator divided by our denominator. And let's print this. So then, we'll print our proportion is equal to product. And there's a proportion 0.647, a whole bunch of digits. It's good practice to round this off. Let's round this to say three significant figures and redo this. There we go. So, 0.647 states produce natural gas. We could simplify this. So, here we did this in a whole bunch of different lines. but we could really just get this done in in one line. We print the proportion, we can just replace prop here with the input that we had for our numerator object divided by the sum of frequency. And there we go. We've got it done in just one line. And we'll get the same answer.

Or, if we want to turn this into a percentage, basically instead of a proportion, just convert this to have units of percent. We'll still have the same value, but we will now multiply this by 100, and also important to include percent afterwards. So, we'll do that. I don't like this little space after there. We can access the argument set or whatever we want to separate our values in the in the print function. And I'm gonna, the default is to have a space there. I'm just going to get rid of that. There we go. So that looks a little bit better, 64.7 percent of states produce natural gas. So there we have it. We've developed a frequency table, or the count in each category. This is a very simple example with just two categories but you know if we have lots of categories it'll work just the same. And we found a proportion using code here, and so as the values of the underlying data set change, this code will update accordingly. And we've got proportion reported in a bunch of different ways here. Okay. Thank you.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

  Try It: DataCamp - Find a Frequency Table and Proportion from Roshambo Data

Using the examples above, see if you can find: 1) a frequency table of Rock, Paper, and Scissors played, and 2) the proportion of times that Rock was played:


  Assess It: Check Your Knowledge

Knowledge Check

Use the table below to answer the following questions.

Make Model Type City MPG
Audi A4 Sport 18
BMW X1 SUV 17
Chevy Tahoe SUV 10
Chevy Camaro Sport 13
Honda Odyssey Minivan 14

 FAQ