Lesson 2: Types of data and summarizing data

Overview

The following section will define different types of data and variables, and will introduce you to some fundamental usage of Pandas along the way. This will lead into summarizing variables into "statistics": proportions and the mean, primarily.

Learning Outcomes

By the end of this lesson, you should be able to:

Differentiate categorical vs. quantitative variables
Examine how Python interprets different columns of a dataset
Execute Python code to merge datasets
Calculate frequency tables and two-way tables
Determine a proportion from those tables
Estimate the center of a set of quantitative values in two ways: the mean and the median
Find the spread, or standard deviation, of some quantitative values

Lesson Roadmap

Lesson Roadmap
Type	Assignment	Location
To Read	Lock, et. al.: 1, 2.1	textbook
To Do	Complete Homework H02: Data Structure & Categorical Variables Take Lesson 2 Quiz	Canvas Canvas

Questions?

If you prefer to use email:

If you have any questions, please send a message through Canvas. We will check daily to respond. If your question is one that is relevant to the entire class, we may respond to the entire class rather than individually.

If you prefer to use the discussion forums:

If you have questions, please feel free to post them to the General Questions and Discussion forum in Canvas. While you are there, feel free to post your own responses if you, too, are able to help a classmate.

Variables, Data Types, and Other Important Terminology

Read It: Important Terminology

First, recall our table from Lesson 1:

HOME ID	DIVISION	KWH
10460	Pacific	3491.900
10787	East North Central	6195.942
11055	Mountain North	6976.000
14870	Pacific	10979.658
12200	Mountain South	19472.628
12228	South Atlantic	23645.160
10934	East South Central	19123.754
10731	Middle Atlantic	3982.231
13623	East North Central	9457.710
12524	Pacific	15199.859

* Data Source: Residential Energy Consumption Survey (RECS) [1], U.S. Energy Information Administration (accessed Nov. 15th, 2021)

You may already be familiar with talking about the parts of this table as rows and columns. In Data Analytics, one goal of preparing data for subsequent analysis is to get the data into a table (or DataFrame, if using Pandas), such that each row represents a case or unit, and each column represents a variable:

Rows: cases or units
Columns: variables

Let's further define these new terms:

Cases or units: the subjects or objects for which we have data

Variable: any characteristic that is recorded for the cases

So, in the example table above, the cases are individual homes (identified by HOME ID), and the variables are DIVISION (geographic region in the U.S.) and KWH (electricity usage).

Types of Variables

For this course, it is important to distinguish two different variables:

Categorical variable: divides the cases into groups, placing each case into exactly one of two or more categories

Quantitative variable: measures or records a numerical quantity for each case

This distinction is important because the type of variables determines what sorts of analyses and methods can be applied to it. Following the example table above, DIVISION is a categorical variable and KWH is a quantitative variable (since it reports the quantity of energy used by each house).

Yet another way to distinguish variables is based on their purpose or role in the intended analysis, if the analysis is looking at relationships between two or more variables:

Response variable: the variable for which an understanding, inference, or prediction is desired

Explanatory variable: used to do the understanding, inference, or prediction of the response variable

So, for example, if we hypothesized that geographic location somehow affects electricity usage, then KWH would be the response variable and DIVISION would be the explanatory variable. In other words, we would want to see whether or not DIVISION explains how KWH responds. Furthermore, this relationship can be swapped: if the hypothesis was that electricity can identify in which region a home resides, then KWH is now the explanatory variable and DIVISION is the response. In other words, KWH could, perhaps, explain the DIVISION.

Data Types in Python

There are many data types in Python and Pandas. Let's focus on some of the common ones that are also relevant for this course.

Any categorical variables should be represented by data type:

category: which contains a finite list of text values

This is not to be confused with:

str or object: string (text; characters), which do not necessarily define categories

Quantitative variables will typically show up as:

int or int64: integers (whole numbers)

float or float64: floating-point number (decimal; continuous valued)

Note that categories could just as well be labeled with numbers as they could with text, so just because a variable has numbers in it does not necessarily mean that it is quantitative.

Lastly, one other Pandas data type that will come up in the course because it is especially important when analyzing energy generation and consumption is:

datetime64: a calendar date and/or time

Watch It: Video - Investigating and Naming Columns in Datasets (13:43 minutes)

Click here for a transcript.

Hi, in the next few videos we're going to cover some data processing and data manipulation code in Python starting in this video looking at data types and doing some other useful procedures in Python. So, let's get to it. To exemplify these procedures, we're going to work with data from the Energy Information Administration. In particular, we're going to look at state rankings. So, we'll go to their website that's linked here. So, the U.S Energy Information Administration publishes a wealth of data online. What we're going to take a look at is under this production section here. We're, in particular, we're going to look at two data sets, two CSV files, one total energy, one natural gas. And these are quantities of energy and natural gas produced by states and they also rank them. I encourage you to go on and explore other data sets published by the Energy Information Administration, but for now we're just going to focus on these two for simplicity's sake. It's also useful to preview the data in spreadsheet format before we start playing around with it in Python.

So, let's start with the natural gas data set. So, here's natural gas production ranked by state, one down through 33, 34, starting with Texas on down. Here, the quantities of natural gas, we also have this blank column here, note rankings are blah blah blah blah blah. Interesting. Turning to total energy production also ranked by state going down to 51. So, here we have all the states, including DC, and we also have this curious blank note column here. Okay so let's dive into the python code. I've already mounted the Google Drive and imported the data sets following procedures that we've seen in the previous lesson. I've also already run these. We've got our natural gas data set stored as ngdf for data frame and we've got total energy stored as TEdf. Now if we want to just see what columns, we have a useful function. Here is the dot columns function. When we run this, we just get the column names. This is as opposed to, for example going back up here and saying NGdf and getting the whole entire data frame, just to scroll up and see what columns we have. So, this is a more concise way to have it. Furthermore, we can do this for the total energy data frame. Let's see what columns we have there. So, both have rank, both have state, the natural gas has natural gas market to production, and units of million cubic feet. Total energy is total energy production in trillions of BTU. And then both also have this note column which is not an actual variable. Furthermore, if we wanted to store these we could say, for example, make a new object called NG columns which would be equal to NGdf dot columns, and then I'm going to add something else here. I'm going to add dot to list. This will just convert all these names into an object called a list. So, this could be useful if I wanted to store these names and reference them later on.

Okay, another useful tool is to remove columns. So, for example, we have this weird note column. That isn't actually a variable. It doesn't contain any data. Let's get rid of that. So, starting with the natural gas data frame, and NGdf, we'll say NGdf dot drop. So, drop is the function that's going to get rid of whatever column or columns we specify here. So, I'm going to go up to my, I'm just going to copy and paste the column from here, make sure we drop this, and furthermore at the end of this I'm going to run NGdf dot columns again just to verify that we've gotten rid of that column. And indeed, we have. We no longer have this note column. Let's just repeat this real quick for the total energy data frame. This has a slightly different named note column so we'll copy and paste there and run that as well, and we see that that has been removed. Yet another useful function is to rename columns. This is particularly useful if we've got some overly verbose column names. For example, total energy production showing BTU. I don't want to have to type that over and over and over again. Let's call that, as well as this natural gas marketed production, something else. So, the syntax for using this rename function is stated generically above here, but let's put this into use. We'll again alter our NGdf or data frame. So, we're going to perform some function on the right-hand side and store it as what we've previously been calling it. So, overwrite it. We'll say NGdf dot renames, the function used here, and again we're going to specify some columns. But now we need to put in a dictionary, so curly brackets. And we're going to give the existing name, so we want to change this name right here, this natural gas marketed production we'll put that in there and we'll put in a colon. And after the colon we're going to put what we want to change it to. Let's call it Natural Gas. And again, we can use our columns function to verify that that change has happened. Indeed, it has. Let's go ahead and repeat this also for our total energy. Let's rename these object names and instead of natural gas we have total energy that we want to change. Stick that in there, and we'll call this total energy and run this. And again, we verify that that change has happened.

Okay, so that covers some handy tools to manipulate the data and get it into something that's more manageable. Let's turn our attention to data types now. And oftentimes after we import data, we need to convert some of the data types because in the import process python has automatically recognized some variables as what they ought to be. So, let's start by finding what data types we currently have. We'll start with our NG data frame and the key function here is dtypes, d standing for data, type standing for types. Go ahead and run this. Here we go. So, this rank variable is in 64 in standing for integer, the state variable is an object. Object is another name for text. Natural gas also text. Let's do the same for our total energy. Again, rank is an integer. That's what we expect. State an object AKA text, so expect total energy. Also, an integer. Hmm, that's interesting. As compared to the Natural Gas data set, natural gas is an object. And really, we don't want that to be an object. That should be a number. Right? It shouldn't be text. We need this. We would prefer that this first word as a quantitative variable. So how do we make that happen? How do we convert this to a quantitative variable? Well, the first thing we need to do is we need to recognize that this natural gas data set, if we go up and inspect it, has a number of dashes in the natural gas column here. This poses a problem if we if we try to convert that column to integers, or floats look at an error. So, we need to replace those dashes first. And what we're going to do is we're going to replace them with something called NaN, stands for not a number, or no value. And we see this above, if we scroll back up again, we see that NaN values have been inserted into this strange note column. This isn't, this is very important. So, these values are basically placeholders saying that that cell was blank. And so, instead of just leaving it blank, it actually has something there to make it more obvious that it's blank. Really important is that zeros were not automatically stored there. You need to bear in mind that the zero is an actual value. If zeros were here, that would imply that there was a value of zero, that we knew that there was a zero quantity there. But in fact, these are blank. And so, we need to note that with NaN.

So, let's go back down and substitute these dashes with NaN. Now a handy function for doing this is this replace function. So, we're going to replace whatever the first object from first argument is here. So, dash dash, that's the thing we want to replace. And we want to replace this with NaN. Now we can't just write n a n like that. NaN is special. And to make this special quantity appear appropriately we need to use the numpy package. So, we'll import that as numpy and we need to call np.nan. So, this is clearly telling python that we've got these special NaN values here. Let's take a look and verify that this worked. So, reprinting this data frame we scroll down, and indeed where we have those dashes, we now have nan. So, now we can move on to converting data types. So, we will take our NG data frame here, we want to just convert the natural gas column. So, we're going to just reference this column. We do that by taking our data frame, putting square brackets, and then within quotes we've got the variable name there. And we can say .astype "float64". That'll convert it to a quantity. Right, we can check and make sure that that worked. We can say NGdf dot dtypes. And indeed, natural gas is now a float. That's great.

Another thing we can do, is we can convert, say for example, state to a categorical variable. So, NGdf state. And we'll use the same syntax here, so as that type, except instead of float we'll say category. And we will repeat this for the total energy data frame as well. We can verify that that worked, NGdf dot dtypes. We see that states now a category and indeed for total energy we will see the same thing, category. Okay, so that's a bit of data processing and investigating and changing data types. Thank you.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [2]

Try It: DataCamp: Identify Data Types

The Pandas function for identifying data types is dtype. See it in action here:

We can see that house and state are both object (text), whereas monthly energy usage is int (numerical). We've previously identified state as a categorical variable, but it is not showing up as such in the code output above. In order to tell Python to recognize state as a categorical variable, we need to add another line of code:

The astype function can be used to convert to other data types as well (for example, from int to float).

A Note on Data Collection

Now, although the table above is good, in its simplicity, for illustrating these important terms and types of variables, it is a bad dataset for investigating the sort of relationship hypothesized above. This is because the sample size is too small; we've only sampled three houses out of a tremendously larger population. Therefore, we do not expect to have adequate representation of the whole population of houses. We will talk more about sample size in the coming lessons, as this is a very important aspect of conducting experiments and surveys ("Data Collection"). Usually, larger sample sizes incur larger costs. However, you need a sufficiently large sample size to ensure that you have adequate representation. Besides having a large enough sample size, it is also important to have a sample of data that is unbiased. In other words, we do not want our data to disproportionately represent one group in the population, which is called "sampling bias". There are other forms of bias, but let's limit our discussion to just sampling bias for now. The main way to combat this form of bias is to collect a random sample.

Random sample: each case in the population has an equal chance of being selected in a sample of size n cases. The goal is to avoid sampling bias.

Therefore, in examining a dataset like that which is in the table above, it is important to know how the data were collected.

Assess It: Check Your Knowledge

Knowledge Check

Use the table below to answer the following questions.

Make	Model	Type	City MPG
Audi	A4	Sport	18
BMW	X1	SUV	17
Chevy	Tahoe	SUV	10
Chevy	Camaro	Sport	13
Honda	Odyssey	Minivan	14

FAQ

Data Manipulation: Merging DataFrames

Read It: Merging DataFrames

Often, one can perform more interesting and meaningful analyses by using larger datasets, with more variables and/or cases. One way of creating new, sizeable DataFrames is by merging multiple smaller DataFrames. The merge function in Pandas makes the task or merging two DataFrames (let's call them LeftDataFrame and RightDataFrame) relatively easy:

1	`LeftDataFrame.merge(RightDataFrame, how='inner', on=key)`

Here, we are just focusing on just the main inputs to the merge function, and one can view the documentation [3] to see the other optional inputs. To start, "key" is a variable (or set of variables) common to both DataFrames, and that we want to use as the basis for merging the two DataFrames. Usually, "key" identifies the unique cases in each DataFrame. The argument "how" indicates which rows to keep from the DataFrames. To explain this further, let's examine each of the four possible methods for "how"

1.) how = 'inner'

To begin, consider both DataFrames as a Venn diagram, where each circle represents the values of the "key" variable contained within that DataFrame:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "Inner." Results described in text below.

Merging Datasets Figure 1

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 [2]

The 'inner' merge method will only keep the key values that are common to both DataFrames, or, in other words, the intersection of the two circles. The following cartoon depicts how this method would work on two example DataFrames:

Two dataframes being merged: LeftDataFrame, depicting keys 1, 2, and 3 and VAR_X x1, x2, and x3. RightDataFrame depicting keys 1, 2, 4 and Var_Y y1, y2, y3. Merged shows only keys 1 and 2 and Vars x1, x2 and y1, y1. Results described in text below.

Merging Datasets Figure 2

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 [2]

Note that key values of 3 and 4 are dropped because 3 is not contained in RightDataFrame and 4 is not contained in LeftDataFrame. Only 1 and 2 are retained since they are common to both DataFrames. Further, note that the resultant DataFrame contains all variables (columns) from each of the two input DataFrames (and "key" is not repeated).

2.) how = 'outer'

In contrast to the 'inner' method, 'outer' keeps all the key values from both DataFrames, or, in other words, the union of the two circles in the Venn diagram:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "Outer." Results described in text above.

Merging Datasets Figure 3

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 [2]

With 'outer' applied, our cartoon with the two example DataFrames now looks like this:

Merging Datasets Figure 4

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 [2]

Now key values of 3 and 4 are retained, and NA values are inserted where either Var_X or Var_Y are not observed.

As a special use case, if one wishes to simply "stack" two DataFrames (say, with identical variable names but different rows/keys), they can use 'outer' and leave the on argument blank (defaults to all columns).

3.) how = 'left'

The 'left' method keeps all the rows with keys from the LeftDataFrame:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "left." Results described in text above.

Merging Datasets Figure 5

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 [2]

And inserts NA values where variables from the RightDataFrame are not observed:

Two dataframes being merged: LeftDataFrame, depicting keys 1, 2, 3 and VAR_X x1, x2, and x3. RightDataFrame depicting keys 1, 2, 4 and Var_Y y1, y2, y3. Merged shows keys 1, 2, 3 and Var_X x1, x2, x3 and Var_Y y1, y2, NA. Results described in text above.

Merging Datasets Figure 6

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 [2]

4.) how = 'right'

In contrast, the 'right' method keeps all the rows with keys from the RightDataFrame:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "right." Results described in text above.

Merging Datasets Figure 7

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 [2]

And inserts NA values where variables from the LeftDataFrame are not observed:

Merging Datasets Figure 8

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 [2]

Watch It: Video - Datatypes in Python (13:43 minutes)

Click here for a transcript.

Hi. In this video we're going to cover merging data frames. So, we're going to continue with our state energy ranking data from the Energy Information Administration. In this video, we're going to merge the natural gas data frame ngdf, with the total energy data frame tedf. Let's exemplify this first with the inner merge. This is the default method. And we'll do this by creating a new data frame, which is called DF, generically PD. To reference pandas, merge these two objects: ngdf, tedf. We want to merge the natural gas with the total energy. We also need to specify on which variable to merge. So, let's do this by state, that is, it's going to match Texas and natural gas with Texas, and total energy Pennsylvania in the first, Pennsylvania the second, and so on, and so forth. So it's going to pair them off on a basis of state. Let's see what this looks like. Here we go. So we've got first the rankings. We've got rank underscore X. That's interesting. What does that mean? Well, these are the rankings from the natural gas data frame. That's our first data frame. So it gives it the subscript X as opposed to rank Y. So this is the rank from the second one. Note that these two variables were called the same thing. So it automatically appended these X and the Y to distinguish the two. Then we have state. This was our variable, our key variable, that we're performing the merge on. So we have all the states here, and then we've got our natural gas and our total energy. And now this miraculous function has automatically paired the values according to what state they belong to.

However, there's a problem with this. We scroll down, we don't have all the states. This stops at 33. Note that the indexes for this start at zero. So we've got 33 states here. There really ought to be 51, 50 states, plus DC. So what happened? Why'd you get rid of those? Well it only kept the states that were equal in natural gas and total energy. Recall that in the natural gas data frame we had some states that had their full names in there. Let's try this a different way. Let's try it with the outer join. So say it's DF again PDF for pandas dot merge. And again, we'll have our ngdf, or tedf, on equals state. And now we'll say, how is outer. So we're going to do an outer merge. Let's see what result we get here. So we still have our rank X, our rank y, that all looks the same. But now we've got down to 69 states. Well, that's strange. That's too many. I'll click on this thing to just get a better inspection of the output here. What's going on? If we scroll down, we need to display some more values here. We see that we've got these states by name here, that don't have any data, no natural gas production, no total energy production. And then we've got their corresponding acronyms. So how did this happen? Well, the natural gas production data set had these states listed out in their full name because they didn't produce any natural gas, so it listed them differently for whatever reason. Whereas the total energy data set does have data for these states and uses their two-letter abbreviation. And the merge function can't associate these two things because it's just not that intelligent. It doesn't recognize that Hawaii is abbreviated HI. We would need to tell it that. So, what do we do? Well, this will be for the next video, logic and sub setting. We'll solve that problem there. For now, that's it for merging.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [2]

Try It: DataCamp - Merge Two DataFrames

In the exercise below, you are given two DataFrames:

HOME ID	DIVISION	KWH
10460	Pacific	3491.900
10787	East North Central	6195.942
11055	Mountain North	6976.000
14870	Pacific	10979.658
12200	Mountain South	19472.628
12228	South Atlantic	23645.160
10934	East South Central	19123.754
10731	Middle Atlantic	3982.231
13623	East North Central	9457.710
12524	Pacific	15199.859

HOME ID	CLIMATE	TOTAL SQUARE FEET
13623	Cold/Very Cold	1881
11055	Cold/Very Cold	2455
10460	Marine	800
12200	Hot-Dry/Mixed-Dry	3800
12524	Marine	1110
10332	Hot-Dry/Mixed-Dry	1630
12106	Cold/Very Cold	5184
11718	Mixed-Humid	1812
14500	Marine	2195
13193	Marine	847

* Data Source: Residential Energy Consumption Survey (RECS)(link is external) [1], U.S. Energy Information Administration (accessed Nov. 15th, 2021)

Note that these two DataFrames share some of the same cases, identified by "HOME ID" (a numeric idenifier unique to each home). In the exercise below, start by seeing what happens when you perform an 'inner' merge:

How does the resulting DataFrame change when you peform an 'outer' merge? A 'left' merge? A 'right' merge? Pay attention to which rows are kept and where NA values are inserted.

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Data Manipulation: Logic and Subsetting

The last section discussed how to build a new larger DataFrame by merging two DataFrames. When a large DataFrame has too much unnecessary data for the intended analysis, it is useful to subset it to a smaller DataFrame that only contains the relevant data. In this section, we'll discuss two common forms of subsetting: 1.) subsetting observations (rows), and 2.) subsetting variables (columns)

Read It: Subsetting

0.) Logic

Before we actually introduce how to subset a DataFrame, we need to first cover logic in Python. Logical statements in Python are also called "Boolean operators", and that is because the result of a logical statement is a Boolean variable, or a binary variable containing "True" (the logical condition is met) or "False" (the logical condition is not met). Forms of logic in Python are listed in the table below:

Code	Meaning
`<`	Less than
`>`	Greater than
`==`	Equals
`!=`	Does not equal
`<=`	Less than or equal to
`>=`	Greater than or equal to
`pd.isnull(obj)`	is Nan
`pd.notnull(obj)`	is not Nan
`&`	and
`!`	or
`~`	not
`df.any()`	any
`df.all()`	all

Try it: Logic

Note that the logical statement returns a vector of the same length as x, with a True in the position where the condition is met and a False in positions where the condition isn't met. Try out some of the other logical statements in the table above. How does the output change?

Since & and ! combine two logical statements, the logical statements that are being combined need to be enclosed in parentheses. For example, try print( (x > 3) & (x < 5) ) in the code above.

1.) Subsetting observations (rows)

These logical statements can then be used to subset DataFrames by selecting and returning only those rows that meet the conditions (yield True). This sort of filtering works by replacing x in the code above with a reference to a variable in the DataFrame, and enclosing the logical statement in [] immediately after the name of the DataFrame. The example below uses the table of household energy use from previous sections:

HOME ID	DIVISION	KWH
10460	Pacific	3491.900
10787	East North Central	6195.942
11055	Mountain North	6976.000
14870	Pacific	10979.658
12200	Mountain South	19472.628
12228	South Atlantic	23645.160
10934	East South Central	19123.754
10731	Middle Atlantic	3982.231
13623	East North Central	9457.710
12524	Pacific	15199.859

Try it: Subsetting

Suppose we only want to retain those houses (rows) that use more than 10,000 KWH. We can perform the following:

Another way to accomplish this filtering would be to do df.query('KWH > 10000'). Try out some other logical statements in the code above and see how the resulting DataFrame is subset. Note that Division is a categorical variable whose values are strings (text). Make sure that, when referring to these values, you put them in quotes: df[ df.Division == 'East North Central' ].

Some other useful functions for subsetting by rows include:

Code	Meaning
`df.head(n)`	Select first n rows (useful for understanding contents of a DataFrame).
`df.tail(n)`	Select last n rows.
`df.nlargest(n, 'variable')`	Select rows containing the largest n values in variable.
`df.nsmallest(n, 'variable')`	Select rows containing the smallest n values in variable.
`df.drop_duplicates()`	Remove rows with identical values.

2.) Subsetting variables (columns)

Unlike rows which can be identified by the values they contain, variables (columns) in a DataFrame are referred to by name, and these names should be unique. We've actually already seen subsetting by variable in action above, in the form of:

df.KWH

which can alternatively be written as

df['KWH']

Multiple columns can be selected by including a list in the square brackets: df[ ['HOME ID', 'KWH'] ].

As a side note, df.columns may be useful for finding the column names.

Watch It: Video - Roshambo (7:45 minutes)

Click here for a transcript.

All right. In this video we're going to continue on where we left off after merging our two state energy ranking data sets, the natural gas and total energy. And we're going to solve this issue of these repeated rows for different states through logical statements and sub setting. So, let's get to it. So the issue that we're having here is that we've got 18 states who have their full names here that don't have any data because they didn't produce natural gas and they weren't in the total energy data set. And we've got their two letter abbreviations coming from the total energy data set because they have the total energy produced data. So, what we want to do is get rid of basically all these rows that have the full names. You don't have any data attached to them. How do we do this? Well, we can do this using logical statements and filtering or subsetting our data frame. So, let's make a new data frame df2. And we'll start with df, this is the output of the outer merge done above in the previous video, and we're going to make use of our pd notnull function here. Because what we want to do is basically we want to keep all the rows that either have natural gas data, so do not have NaN in the natural gas column, or do not have NaN in the total energy column, or do not have any in in either column. In other words, we want to get rid of those rows that have any, and in both columns. So we will reference our natural gas column here, calling the data frame after the merge. So df square brackets natural gas. And let's just see how this notnull function works on the natural gas column alone. So let's go ahead and run this. And we'll print df2. Let me see how we have our columns there, but we only have 33 states represented here.

So these are our 33 states that produce natural gas from the natural gas data set. So we want to also include those states that had some total energy production even though they didn't have natural gas. So how do we do this? We say we want to keep those rows who have values in natural gas column, or those rows that have values or notnull in the total energy column. So, let's go ahead and run this. So we run this again. We saw the same columns. We scroll down and we now have all 50 states plus DC. I know this goes to 68 but that's just because the index, the indices here, have been broken. We've gotten rid of some of those rows in the middle that had some indices. If we want to reset that, we can just say df2 dot reset index, and we'll get our sequential numbering back. So we see now that these indices go down to 50. They start at zero, go down to 50, so there's 51. Again 50 states plus DC.

Furthermore, you'll note that we have our old indices as the second column here. We might want to get rid of that. We also might want to get rid of some other things here. For instance, you know these rankings don't really need them, we're just interested in these quantities. So if we want to filter by columns we can, we have some handy tools that are disposal here. So we can access values using in particular this look function for getting multiple columns or rows at the same time. So, for example, if we want to select one, or more than one column, let's make a new data frame, df3, now from our df dot loc. And we'll use colon to say all rows. So we're going to get all rows and then separated by a comma here we'll put in whatever columns we want so if we only want to keep say for example, state and total energy and natural gas we can enter them as a list. So we can close them in square brackets, separate these quoted values, text values, and commas here, df3. Let's see what we get. So, here's our data frame state, total energy, natural gas. Note that these columns are now in the order that I specified. So before we had natural gas coming before energy and I've switched that around just by the order that I've stated their names in here. And we've effectively gotten rid of those rankings and that old index here. So, now we have a much cleaner data set.

Furthermore, we can for example, with this look function select one or more rows. Conversely, we're not going to make use of this later on so we'll just say for example cases in the df3.loc we want to say, for example, the first five rows give you something like that specifies the rows. We need to say what columns we want. If we want to say all columns you could do that, the colon cases. And there we go, we have the first five rows. Note you can mix this up, you can do a combination of selecting different rows and different columns too if we wanted to, but I think you get the idea here. And then just last but not least, like to exemplify how to get, or easily display, I should say, the top energy producers here. So say for example, one of the top five total energy producers, we've got this handy function and largest where we say and largest five for the first five and then whatever variable we want the largest values from. And so here we see that the largest total energy producers are in, number one Texas, number two, Pennsylvania, number three, Wyoming, Oklahoma, and West Virginia, on down. So those are some handy tools to to get started with processing data changing data types, and in this video, subsetting the data into a clean data product. Okay, thank you.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [2]

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Roshambo: Frequency Tables and Proportions

Read It: Roshambo

Roshambo (a.k.a., Rock, Paper, Scissors) is a tried-and-true method for settling disputes. In this game, two opponents synchronously state "Rock... Paper... Scissors... Shoot!", and on "Shoot!", each has to display one of three choices with their hand: Rock, Paper, or Scissors. The outcome is determined by:

		Player 1
		Rock	Paper	Scissors
Player 2	Rock	Tie, play again	Player 1 wins	Player 2 wins
	Paper	Player 2 wins	Tie, play again	Player 1 wins
	Scissors	Player 1 wins	Player 2 wins	Tie, play again

Try It: Roshambo

Your response given in the form above will be randomly paired with another response, and the outcome will be stored in a dataset, which is to be used in the coding exercise further below. But first, let's learn about frequency tables and proportions.

Learn It: Frequency Tables

Frequency tables list how many times each category appears within a categorical variable. These tables plainly show which categories are the most common, and which are rarely observed. The Pandas function that we will use to make a frequency table is "value_counts". Run the DataCamp code below to see how it works:

It is important to remember that "value_counts" works on a Series, and not a DataFrame. We pass "value_counts" a column from a DataFrame in the example above.

Learn It: Proportions

A proportion, $p$ , is the fraction of cases in a population that have a certain value or meet some criteria. From a data sample, the statistic is:

$\hat{p} = \frac{# of cases that meet criteria}{Total # of cases}$

It is easy to calculate a sample proportion from a categorical variable using the output from "value_counts".

Watch It: Video - Importing Files (9:16 minutes)

Click here for a transcript.

All right, in this video we're going to cover summarizing categorical data. And in doing this, we're going to cover frequency tables, that is, counting the values within each category in a categorical variable and proportions. So distilling that information into a single value or proportion.

So, let's get to it. We're going to continue on with our state energy ranking data set from EIA. In order to exercise this, we're gonna have to make a new categorical variable. So the categorical variables that we have in that, so far we just have state, which isn't very interesting because we have a single state in each category. So let's make a more interesting categorical variable. And our categorical variable here is going to ask the question, produces natural gas or not. So, we'll make a new variable. We'll say df3 in square brackets and quote, we'll simply ask reduce produces NG for natural gas, question mark. So we'll expect yes or no response here. Yes, it produces. No, it doesn't. To make this happen we'll borrow a function from python called where. So the where function, and the way the where works is, we first give it an argument that returns a true or false, so some kind of Boolean statement as we used before. So, for example, we can say from pandas is null and we'll ask ourselves is this df3 natural gas column null or not? So it's going to go through value by value in this variable natural gas, and if the value is null it'll return whatever we have for the second argument here. So we'll put a no, it doesn't produce natural gas. However, if the value is null returns false, so it is not null, then that means that that state produces a natural gas. So that's the way this where function works. You always give it something that returns a true false, so a Boolean statement. If it's true then it's going to return whatever values in the second argument. If it's false it'll return whatever values in the third argument here.

So let's see this in action. We can just take a look at this variable real quick, see what we've gotten. And we see we have a bunch of yes' and a bunch of no's. So remember, these last 18 are no because they had null values for natural gas, but the upper ones all produce the natural gas. Let's take a look at the the data types for this producing natural gas object. So this is a text variable, not a categorical one yet. So let's course this into the categorical variable and then we can move on. And so, we just need to use our as type function here. And as type category and we'll take a look at our D types again and make sure that's worked. And it did. We've got a categorical variable there, and so we can move on.

So now let's make use of our frequency table to see how many states produce nitric acid and how many do not. So we'll make, we'll store this table in an object called freq, get our data frame, our new variable produces natural gas. We'll start with that encased in a square bracket. And the function we want to use here is value counts so it's a dot value_counts. And as the name of this function implies it's just going to count the values in this variable category by category. So let's take a look at what we get. So we'll print our frequency table and lo and behold, we see that we have those 18 states are a no, they do not produce natural gas, and the remaining 33 are a yes, they do. Recall that there's 51 states because we're including DC. Okay. Well that's great. Let's find a proportion from this.

So if we want to ask ourselves, well what proportion of states produce natural gas? Well, you could probably do this in your head or very easily with a calculator, or even just in the cell here. It'd be very easy to say, oh well, it's 18 divided by 33, plus 18, right. And you'll get the right answer. But this isn't really programming. So, if we have to ask ourselves if we were to update this data set, if we got this next year, and we wanted to just reuse this code, could we do so without having to go in and really edit a whole lot? And if we write it this way then the answer is no, we'd have to go in and manually adjust these values by inspections table, and that involves work.

So let's write this in code that will automatically calculate the proportion for us, even as values change. So, let's first find our numerator of the proportion. And so, we're going to reference our frequency table. So this frequencies table is just stored as a series. So it's just the values 33 and 18, indexed by yes and no. So if we want to access those states, the count of the states that do produce natural gas, we just need to say yes. So frequency square brackets yes. And that will return our value of 33, as we see here. The denominator, well we want to, our denominator needs to be all the states. So we need to get 33 plus 18. And there's many ways we could do this. We could say frequency of yes, plus frequency of no.

Something that's a little bit more general would be to just simply sum. We'll make use of the sum function in this frequency table. and this is more general and easier to use if we had, say, suppose we had five categories. We wouldn't want to necessarily take out all five categories. Or, if there were say 50 categories, you wouldn't want to type them all out. But regardless of the number of categories, if we do sum, then we'll get, we'll always get the total of the frequency table. And we'll get the total number of observations that we have. And we can see that here 33, 51 observations of states. Let's put this together. Well, our proportion here will be just our numerator divided by our denominator. And let's print this. So then, we'll print our proportion is equal to product. And there's a proportion 0.647, a whole bunch of digits. It's good practice to round this off. Let's round this to say three significant figures and redo this. There we go. So, 0.647 states produce natural gas. We could simplify this. So, here we did this in a whole bunch of different lines. but we could really just get this done in in one line. We print the proportion, we can just replace prop here with the input that we had for our numerator object divided by the sum of frequency. And there we go. We've got it done in just one line. And we'll get the same answer.

Or, if we want to turn this into a percentage, basically instead of a proportion, just convert this to have units of percent. We'll still have the same value, but we will now multiply this by 100, and also important to include percent afterwards. So, we'll do that. I don't like this little space after there. We can access the argument set or whatever we want to separate our values in the in the print function. And I'm gonna, the default is to have a space there. I'm just going to get rid of that. There we go. So that looks a little bit better, 64.7 percent of states produce natural gas. So there we have it. We've developed a frequency table, or the count in each category. This is a very simple example with just two categories but you know if we have lots of categories it'll work just the same. And we found a proportion using code here, and so as the values of the underlying data set change, this code will update accordingly. And we've got proportion reported in a bunch of different ways here. Okay. Thank you.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [2]

Try It: DataCamp - Find a Frequency Table and Proportion from Roshambo Data

Using the examples above, see if you can find: 1) a frequency table of Rock, Paper, and Scissors played, and 2) the proportion of times that Rock was played:

Assess It: Check Your Knowledge

Knowledge Check

Use the table below to answer the following questions.

Make	Model	Type	City MPG
Audi	A4	Sport	18
BMW	X1	SUV	17
Chevy	Tahoe	SUV	10
Chevy	Camaro	Sport	13
Honda	Odyssey	Minivan	14

FAQ

More Roshambo: Two-Way Tables

Read It: Two-Way Tables

In the previous section, we saw how frequency tables displayed the counts of observations in each category of a single categorical variable. However, sometimes it is useful to see how these counts change across the categories of a second categorical variable. A "two-way table" gives the counts of observations in each and every combination of categories from two different categorical variables:

	Category A	Category B	Category C
Category 1	# of observations in A and 1	# of observations in B and 1	# of observations in C and 1
Category 2	# of observations in A and 2	# of observations in B and 2	# of observations in C and 2
Category 3	# of observations in A and 3	# of observations in B and 3	# of observations in C and 3

Let's look at a more interesting example that Dr. Morgan found on Instagram:

Instagram image that state over 50% of people in the US who live near hazardous waste sites are people of color

Exmple of a two-way table from Instagram

Instragram

We are given a proportion here: 50% (well, actually "over 50%", but let's just round down to 50%). This could easily be exchanged with the actual count of people, so let's treat the proportion synonymously with the count for this discussion. What is the categorical variable here? Race, with the categories "people of color" and "not people of color" (or, "white"). What does the frequency table look like?

Race	Frequency (near Haz. Waste Site)
People of Color	50%
White	50%

However, this frequency table alone doesn't give the full picture. In other words, the 50% quoted in the picture above doesn't signify much without more context. What is that context? We need to know how many people of each race category in the U.S. as a whole. In other words, we need to compare the frequency of each race near the hazardous waste sites to the baseline frequency. If the rest of the U.S. is 50% people of color to begin with, then the statement in the picture above isn't significant at all. However, according to the 2020 U.S. Census [4], this isn't the case:

	Location
Race	Near Haz. Waste Site	In all of U.S.
People of Color	50%	38.4%
White	50%	61.6%

Here, the two-way table allows us to see that the proportion of those living near hazardous waste sites is disproportionately comprised of people of color compared to the rest of the U.S. In summary, two-way tables allow us to make some pretty powerful comparisons of frequencies across two categorical variables.

Making two-way tables in Python is very easy, using the crosstab function from Pandas. The following video demonstrate the use of this function, as well as finding proportions from the resulting two-way table.

Watch It: Two-Way Tables (9:43 minutes)

Click here for a transcript.

Hi, in this video we're going to continue summarizing categorical theta, but looking at two-way tables. So, basically looking at value counts across two different categorical variables simultaneously. So let's get to it. We're going to continue on again with our state energy rankings data. And recall that last time we made a new categorical variable asking whether a state produced natural gas or not using this np.wear function. We're going to borrow the same code here, and we're going to make a second categorical variable looking at whether a state has total energy production above one quadrillion BTU or not. So, we'll develop another variable. We'll call this above one quadrillion BTU, question mark, and we'll replace our isnull question with a different statement. So we'll ask ourselves whether the total energy is greater than a thousand because total energies in units of trillions of BTU. So a thousand trillions as a quadrillion. And this time, if it is above, then we're going to return a yes, yes it is above one quadrillion. And if it's not, so if this statement returns a false, again we're going to say no, it is not above, it is not above one quadrillion BTU. And again, we'll do the same processing. We'll confirm, convert it to categorical and confirm that. Let's go ahead and run this. There we go. We see that our question above one quadrillion BTU is categorical. So now we have two interesting categorical variables here, produces NG, and above one quadrillion BTU. Let's take a look at how many values we have in each combination of categories between these two categorical variables. So, in order to do this, we're going to use the cross tab function available from pandas, and we'll make a new object called twt same for two-way table. Reference pandas with td, function is cross tab, and we'll take our data frame which we're calling df3 right now. And we're going to give it the first categorical variable that we want, or we developed before that produces NG. And then, we give it the second categorical variable that we want in the 2A table. And that's going to be the one that we just made above one quadrillion BTU. And let's take a look at what we get.

There we go. So we've got our first variable, Produces NG, no, yes. So that's giving up the rows. Our second variable above one quadrillion BTU, giving the columns, no, yes. And the count of all the states that make these up. So there are 18 states that do not produce NG and do not have one quadrillion, produce more than one quadrillion BTU, and total energy, and so on, and so forth. Another handy feature of crosstab is if we add the argument margins equals true, what this will do is it'll give total columns and rows to this as we see here. So it added on all, which gives just the summation across all the rows. So, 18, plus 0, is 18. 18 plus 15, is 33. And so on. And then it gives us all row, which gives the summation down all the columns. So, 18 plus 18, is 36, etc. And then in this last intersection between the all column and all row gives all the observations in the data set. So the total number of rows. So, 51 in this case, 51 states. So, let's use this to find some proportions. Now first, suppose we want to find what proportion of states that produce natural gas also produce more than one quadrillion BTU in total. Let's borrow the same structure that we had before for finding proportions. I'm just going to copy and paste down here.

Now we have to change this up a little bit because twt now is a data frame, it's not a series anymore. So, we have to if we want to extract values from this, we need to reference the row and the column. Okay. And we need to do so using that dot loc function that we introduced before. So our numerator for this proportion is going to be those states that produce natural gas and also produce more than one quadrillion BTU. Now, that's located in the yes and yes column and row, respectively. I want this value 15, right. So we're going to take our twt option object data frame, do dot loc, so use our dot loc function, and we want to reference the yes row. And we want to reference the yes column to extract that 15. Here. Let's just make sure that worked real quick.

And there we go. Our denominator, on the other hand, is going to be the number of states that produce natural gas. So it's the first criteria here. So this. Or then what number of states that produce natural gas also produce one quadrillion. So if we want to get the total number of states that produce natural gas, where do we find that on the table? Well, it's in this yes row, right, produces natural gas, yes. Before we had found that it was 33. And we see that referenced under the all column here. So here we can say twt dot loc. We want the yes row, and the all column, twt not txt 33. There we go. And then we just find the proportion. So, 0.45 or 45 and a half percent. Okay. Well, let's do a slightly different proportion. And this is only just slightly different, but it highlights the power of language in stating these proportions. So, what proportion of states that produce more than one quadrillion BTU in total also produce natural gas? So we're reversing it. So, let's borrow our same structure here.

So now we're still, we still have the same numerator and this is because we're still looking at the number of states that do both things, that need both criteria: produce more than one quadrillion BTU, and also produce natural gas. So, we'll stick with this, 15. You still want that, but our denominator now changes. Our denominator is just what the first criteria is, and that is the states that produce more than one quadrillion BTU. So, not the states that produce natural gas now, but more than one quadrillion BTU, and that is under our yes column, and at our all row. So all we have to do is flip the order of what we have in this line for the denominator, the all row, yes column,

15 States in total that produce more than one quadrillion BTU. We see that all of them produce natural gas. And so, our proportion is one or a hundred percent. Lastly, one more proportion here, what proportion of states produce more than one quadrillion BTU and also produce natural gas? Wait a minute, this is the same question I just asked! No, we've by removing this, that produce we're now looking at out of all the states all 51, which ones do both things? Whereas this one was just looking at only the states that produce one quadrillion BTU. So we've got this conditional statement that here, that changes the game dramatically. So, we'll take this again. We're still using the same numerator, so we're still looking at the states that do both things. But now our denominator changes again. We want to look at all the states. So what proportion of all states do both things? And we can get this in the all row, in the all column, to get this 51. So, all in all, and go ahead and run this. Let me see that, 29.4 percent of the states out of all states produce more than one quadrillion BTU, and also produce natural gas. Okay, so that is two-way tables, and a bit more about proportions and how they can get more complex when we're looking at two different categorical variables. Thank you.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [2]

Try It: DataCamp - More Roshambo

Continuing with the Roshambo exercise from the previous page, can you find the proportion of those who played Rock, ended up winning? First find the appropriate two-way table and reference the values in that table to find the appropriate proportion.

By inspection of the table, do the data suggest that one option is more favorable to play?

Assess It: Check Your Knowledge

Knowledge Check

Use the table below to answer the following questions.

Make	Model	Type	City MPG
Audi	A4	Sport	18
BMW	X1	SUV	17
Chevy	Tahoe	SUV	10
Chevy	Camaro	Sport	13
Honda	Odyssey	Minivan	14

FAQ

Summarizing Quantitative Data: Histograms and Measures of Center

Read It: Histograms

Let's start our discussion with histograms, because this particular form of data visualization is useful for illustrating the measures of center that we discuss a little later in this section. Histograms simply show the number (or "Count") of observations whose values fall within pre-defined intervals (or "bins"). There are many ways to make a histogram in Python, and a convenient method is in the Pandas library, where one applies the function hist to any quantitative variable:

df['x'].hist()

where you would replace df with your dataframe name and x with the variable name. The height of the columns indicates the relative number of observations in each interval. For example, in the histogram below, the tallest column belongs to the interval 0 to 0.1.

Bell shaped curve; x axis begins at approximately -4 and ends at 4; y axis count ranges from 0 to 400; summarized in text above and below

Bell-shaped & Symmetric

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [2]

Histograms effectively visualize the distribution of a quantitative variable, showing us what values that variable spans (i.e., the minimum and maximum), where most values are concentrated, and how spread-out typical values are. Some typical shapes of histograms are represented in this section. In the figure above, we have a bell-shaped and symmetric distribution, in which the counts fall off in a similar pattern above and below the center (this example happens to be centered at 0). The next figure shows an example of a right-skewed distribution, where observations are concentrated at lower values, with fewer observations at much larger values. This distribution appears to have a "tail" going off to the right. Note that the values need not be concentrated at 0, or even be positive; these are just features of this example.

Right-skewed

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [2]

The next figure shows a left-skewed distribution, with a "tail" going off to the left. Most observations are concentrated at larger values. Note that the values need not be concentrated at 0, or even be negative; these are just features of this example.

Left-skewed

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [2]

An example of a distribution that is symmetric but not bell-shaped is depicted below. Even though it is centered at 0, that is not where most values are concentrated. This particular example is "bimodal", in that it has two locations where values are concentrated (about -2 and 2).

Symmetric but not bell-shaped

Read It: Measures of Center

Most often, the primary characteristic of interest for a distribution of data is its center, because one wants to know the most representative value for the variable of interest. in the example below, even though we witness values as large as 4 and as small as -4, these are very rare; the most common values are close to the center at approximately 0.

bell curve rising from around -3, centered at a mean of approximately 0, then tailing off again at +3; as described in text above

Centered: Mean = 0.004 Median = 0.005

Perhaps the most common way to measure the center of a quantitative variable is with the mean, in which you add up all the values and divide by the number of values:

$Mean = \frac{Sum of all data values}{Number of data values} = \frac{x_{1} + x_{2} + x_{3} + \dots + x_{n}}{n} = \frac{\sum_{}^{} x}{n}$

Here, n is the length of x, or the number of values in that vector. When someone talks about the "average", they usually are referring to the mean. When one calculates the mean from a sample of data, we call that the "sample mean" and note it with $\bar{x}$ or "x-bar", whereas the mean of the entire population is noted with $\mu$ or "mu" (a Greek letter). The image below shows this distinction in notation graphically. Throughout this course, whenever we talk about a value that characterizes the population, we refer to that as a "parameter"; and when we talk about a value calculated from a sample, we call that a "statistic".

Another measure of the center of a distribution is the median, which is the middle value after you sort the values from lowest to highest (or the mean of the middle two if there are an even number of values):

Median:

If $n$ is odd, median is middle value of ordered data values

If $n$ is even, median is mean of the middle two values of ordered data values

The median value is noted with m.

A third measure of the center of a distribution is the mode, which is simply the value that appears most often in a variable:

Mode: The value that appears most frequently

We won't work with the mode much in this course, as it is a less useful measure of the center. Often with quantitative variables, the mode cannot be calculated because each value is unique. It is better applied to ranges of values (e.g., comparing the heights of bins in the histograms above), or for categorical data, where values do tend to repeat and one usually sees a distinct category with more observations.

cyclical diagram showing how a data is collected from a population to quantify as a statistic to measure and make an inference; as described in the text above

Notation for Mean

Click for a text description of Notation for Standard Deviation.

The mean is more affected by skewness (and outliers) than the median!

For symmetric distributions, the mean and median will agree upon the measure of the center. Take the example below, where the mean estimates the center at 0.004 and the median at 0.005. Although not exactly equal, these values are very close (indeed, both are represented by vertical dark blue lines, which are indistinguishable because they plot right on top of one another).

bell curve rising from around -3, centered at a mean of approximately 0, then tailing off again at +3; as described in text above; mean and median both approximately at 0; as described in text above

Influence of Skewness: Mean = 0.004. Median = 0.005

However, for skewed distributions, the mean and median will generally not agree, and the more skewed the distribution, the more these two estimates will disagree. The reason for this disagreement is that the mean will be more influenced by the extreme values in the tail of the distribution, since it sums the values in the numerator. The median doesn't care about the magnitude of the values, just their rank in terms of smallest to largest. Take for example the right-skewed distribution below, where the mean is 1.123 and significantly greater than the median of 0.993 (here the two dark blue vertical lines representing these values are distinguishable). The large positive values in the right tail have led to a larger mean than the median.

right skewed distribution portraying how means and medians are different as described in text above

Influence of Skewness: Mean = 1.123 Median = 0.993

Thus, the median tends to give a more stable measure of the center that is robust to extreme values and outliers. For this reason, the median is often the go-to measure of the center for skewed distributions, such as salaries and housing prices, for example.

Read It: EPA Flight Tool

In the videos below use data from EPA's Greenhouse Gas Reporting Program (GHGRP [5]) to demonstrate how to make histograms and find the mean and median in Python. You may want to first explore the EPA's web app: Facility Level Information on GreenHouse gases Tool (FLIGHT [6]). This will help you understand the content of the dataset, which gives volumes of emissions from power plants, factories, refineries, etc. across the country.

screen capture of EPA's web app: Facility Level Information on GreenHouse gases Tool

Caption: A screenshot of FLIGHT [6].

Image Credit: EPA

Watch It: Measures of Center (10:39 minutes)

Click here for a transcript.

Hi, in this video we're going to take a look at the underlying data for the EPA's flight tool. These are data on greenhouse gas emissions from facilities all across the United States by facilities, I mean, power plants, waste processing centers, factories, and so on, and so forth. So, I highly recommend you click on this flight tool to explore the database in another format. And if you click on that, you'll go to their web app. So, this is the flight tool here, and here it's a map-based overview of the data. You can zoom in on different localities and see where these facilities are located and information about them. You can also filter by different facilities and variables, and so on, so forth. So go ahead and check that out. However, what we're going to do here is we're going to look again at the underlying data. We're going to do our own thing with it outside of the web app. And we are going to ultimately find which state has the largest amount of emissions from facilities. So this is not emissions from the automotive sector or anything else, just from facilities. And along the way we're going to look at histograms. We're going to see how to calculate the mean and the median and then ultimately how to summarize all the emissions within each state.

So, let's get to it. So, we're all familiar with importing the data. We've got that. Those lines of code here, except there's one difference now that you may not be familiar with, and that is this skip rows argument. So, why do I have skip rows here? Let me explain. If we take a look at the underlying CSV, our data frame or our table doesn't really start until row four, here with the column headers. And then all the data below. We've got three superfluous lines here that really aren't part of the data frame at all. And we don't want to include. So, we just need to skip those three rows. So, pandas makes this incredibly easy. We say skip rows equals three. So, we skip over those unnecessary lines there. And here, highlight this. This highlights the importance of taking a look at the CSV file before doing the importing just to understand exactly what is in this data file. And you can open up these CSV files and in Google Sheets, or Microsoft Excel. So, once we have this imported, let's go ahead and take a look at the size of this data set. And one handy function to do this is shape. And all the shape gives us are the number of rows in this data set, and the number of columns. So, 1, 5, 3, 8, 6 some odd rows. The rows here are different facilities and we've got 66 columns. Furthermore, we should take a look at what the columns are exactly. So, let's print out the column names here.

So, we've got facility ID, facility names, city, state, etc, etc. And one really important variable here is total reported direct emissions. So, we're going to focus on that, although there's a lot of other interesting things here too, in particular, the industry type. So what type is, what form of industries producing these emissions. Lots of other useful things here, so let's get down to it. Let's start by finding the histogram. So, with a histogram we want to make sure that the data that we plot in our visualization in the histogram spans from the minimum of the data to the maximum of the data. At least it needs to cover the span of the data. So, if we want to find the min and the max, let's just say ghg min. we want to find our minimum total reported direction emissions. We just use the min function, very easy. We'll do the same thing for the max. And let's print these results. We'll say min is ght min and max is ghg max. So, our direct emission values for all the facilities span from zero to some really large number here. So, what is this, about 20 million? More than 20 million, so very large number here. So, let's see what happens when we make a histogram of this. So, we'll call our data frame, and one easy way to make a histogram is through pandas. We just say histogram.hist and we give it the column that we want to make the histogram of. And this is, in this case, total reported direct emissions. And we'll run this. And so, here we go. We get our histogram, and what is done is it's used some predefined, you know, rules of thumb to to make adequate bin sizes of our histogram here. So, the histogram is counting the values that fall within specified ranges and and then plotting those frequencies as bars. So, this is essentially a frequency table represented graphically where the categories are just different ranges of the quantitative variable. However, when we use the default method here, we don't know what those ranges are. And that's a little bit problematic. We're not sure basically what the break point is between this first bin and the second bin. We don't know where this stops. So, it's good practice to manually specify the number of bins that you have. So, we could do this fairly crudely, and we could say one e to the 7 7 when you do the seven being 10 million up to 20 million. But then we have to go up to 30 million because our largest value is slightly more than 20 million. So, we could do this, we'll have three bins. We've got four values specifying the break points between those three bins. You visualize this and we see a pretty crude histogram. I say this is crude because we pretty much have all our values in this first bin here and then a few values in the second bin and probably one value in this third bin. So, these bins are too fat. We don't have a good enough resolution. And so, we could break this up some more. We could subdivide it. A handy tool here is the range function. Built in range will just give us a sequence of values from some integer to some other integer. So, let's say, so let's say 21 million. And by some step, let's say we want to go by 1 million. And we get that, all the list of values going up in that increment. So, this is a handy way to build these bins if we want to have many of them. And we don't want to manually specify the break point between each and every one. So, there we go. We get a much finer histogram, and we know where the break points are. We know this is going from zero to one million, one million to two million, two million to three million, and so on, so forth.

Okay, so that's the histogram, and it's telling us the spread of the emissions from the facilities. In particular, we see we've got a lot of values that are close to zero, very few values that are really, really large. So this is what we call right-skewed. This is an example of a right-skewed histogram. Let's go ahead and calculate some measures of center. So, what is the average level of emissions in these facilities? I'll just do this in a couple of print statements. So, if we want to find the mean, also denoted x bar, this is going to be equal to our quantitative variable, phdm in this case, is total reported direct emissions. And we just say dot mean. Calculate the mean. It's also a good practice to include units. Our units here are millions of tons of CO2 equivalent, so, mmt CO2 millions of metric tons CO2 equivalent. So, that'll give us the mean. And let's compare this to the median. So instead of the mean function we just have the median dependent on there. And this doesn't have the notation x bar. The median is just the median. Let's run this. We see that our mean, it's got 388 000, our median 65 000, or so. A lot less. What this indicates to us, with the mean being much larger than the median, is that the distribution is indeed right skewed. In other words, this mean, the value of the mean, gets pulled up by these much larger values up here. These are most likely outliers. So, the mean is more susceptible to these outliers. The median is more robust to them. Note that the median falls, you know, well within this first spike here. The mean somewhere up here.

Watch It: Measures of Center Part 2 (6:16 minutes)

Click here for a transcript.

Lastly, let's find our total emissions for each state. So what we need to do here is within each state, sum up all of the emissions from all of the facilities within that state. So far we've just been looking at all the facilities collectively, but let's actually tally up the quantities within each state. So in order to do that, let's create a new object called stem, for state emissions. And we'll reference our initial data set. And we're going to use this group by function. So we're going to say we want to group by the variable, state, here. So they're going to be our groups. And within each group we want to perform whatever comes next. In this case we want to sum up the values.

When we do this, we retain all of our columns there. But ultimately, our total reported direct emissions are the sum within each state. Also one byproduct of this is that state becomes an index here. So let's reset that. Let's include in here this function reset index. And that'll just turn the state back into a normal column. There we go. So now we see state is just a regular column. Okay, so we've done that. We've summed all the total reported direct emissions within each state. Now let's find what the max is. So our max emissions will be the max of stem total reported direct emissions. And let's see what that max value is. Some really large value. What is this? This is 372 million and some change.

Okay. But we don't know what state that belongs to, so now we need to find which row contains this max value. So let's just make some dummy variable, we'll call it f. And we'll say stem total reported direct emissions equals equals max m. Note that I'm making a logical statement here. I'm using this equals to say which of these values are equal to max m. And what we'll get then is a Boolean variable. In other words, a variable containing only true and false. So it's false everywhere where it doesn't contain max m. And then we've got this one true in row 45, or index 45 here. But still that doesn't tell us what state, it just tells us the index. But we're getting there. So now we just need to query that row. We say stem of f. And you get the row that has the max total for direct emissions, the state being Texas. And there we have our answer. A quicker way to do this would be to say max row stem dot loc, using our loc function here of stem total reported direct emissions. And instead of max we're going to use idx max. So return the index where max is located, not the max value itself. Okay. And I need to type square brackets here so we can then see what max row is. And it reports all the variables, all the variable values for whatever the max is. We see it's Texas here. And we could put this into a nice print statement. I could say print the state with the highest facility emissions is, and then we say max row state.

There we go. So we found our answer. All right, so we've covered histograms, mean, median, and then ultimately summarizing data within groups. Note that Texas here, you know, it is the largest of the continental United States. And so you might say well of course it has the highest total direct emissions it's a really, really large state - lots of people, lots of industrial activity, makes a lot of sense. If we wanted to normalize this by state size, we could look at the average, what the average is across all the facilities. See if there's any state that's got, whose facilities are producing above average emissions, which case we'd replace the sum here with mean. So you can replace this sum with any other functions if you want to summarize your data differently than groups. Okay, we'll stop there for now.

Try It: DataCamp - Find the mean and median

Let's return to our sample of residential energy use, which gives the total electricity consumed in a year (KHW):

HOME ID	DIVISION	KWH
10460	Pacific	3491.900
10787	East North Central	6195.942
11055	Mountain North	6976.000
14870	Pacific	10979.658
12200	Mountain South	19472.628
12228	South Atlantic	23645.160
10934	East South Central	19123.754
10731	Middle Atlantic	3982.231
13623	East North Central	9457.710
12524	Pacific	15199.859

* Data Source: Residential Energy Consumption Survey (RECS)(link is external) [1], U.S. Energy Information Administration (accessed Nov. 15th, 2021)

Can you use Python to determine, from this sample, the typical annual home electricity use? Find both the mean and median.

Assess It: Check Your Knowledge

Use the table below to answer the following questions.

Make	Model	Type	City MPG
Audi	A4	Sport	18
BMW	X1	SUV	17
Chevy	Tahoe	SUV	10
Chevy	Camaro	Sport	13
Honda	Odyssey	Minivan	14

FAQ

Summarizing Quantitative Data: Measures of Spread

Read It: Measures of Spread: Standard Deviation

In addition to knowing the center of a distribution of data values, it is often useful to know how dispersed or spread-out the distribution it. A common measure of this spread is the standard deviation:

Standard Deviation: the typical distance of a data value from the mean. $s = \sqrt{\frac{\sum (x - \bar{x})^{2}}{n - 1}}$

A larger standard deviation indicates a more dispersed collection of values, whereas a smaller standard deviation indicates a narrower distribution. To make the interpretation of the standard deviation more quantitative: a new, randomly drawn value from the population will have a 95% chance of being within the range -2 to +2 standard deviations from the mean (i.e., $[\overset{―}{x} - 2 s, \overset{―}{x} + 2 s]$ ). This is depicted in the example image below.

A histogram of a quantitative variable bell curve. 95% of the data is within two standard deviations from the mean. Mean = 0.002 Media = 0.007 Standard Deviation = 0.997 At +2 standard deviation, the “z-score”: z*s = *s

Measures of Spread: Standard Deviation

Click for a text description of Measures of Spread: Standard Deviation.

The image above also introduces the z-score, which we will come back to later in the course. The z-score, $z$ , is simply the number (or fraction) of standard deviations that a value is from the mean. The z-score can be reported for a data point, or for a hypothetical value, such as the far right red line in the example above (where $z = 2$ ).

A Note on Notation

The formula above gives the standard deviation for a sample, noted s. The standard deviation for the entire population is noted with $σ$ (or "sigma", a Greek letter).

cyclical diagram showing how a data is collected from a population to quantify as a statistic to measure and make an inference; the standard deviation is noted and described in the text above

Notation for Standard Deviation

Click for a text description of Notation for Standard Deviation.

Watch It: Video - Measures of Spread (4:36 minutes)

Click here for a transcript.

Hi, in this video we're going to keep on using the EPA flight data set. This is, these are facility level emissions of greenhouse gas data, but now to exemplify how to calculate measures of spread. So we're going to use those data to find standard deviation and then the interquartile range in Python. So, let's get to it. So first off, to find the standard deviation, we can find this in a very similar way as to how we found the mean and median above. So I'll do this in a print function, as we did before. So standard deviation is equal to, and then we call our whatever variable it is we want to do this on,
so total reported direct emissions in this case. And our function now is not mean or median, but STD for standard deviation. And this actually shares the same units as the underlying variable. So in this case, millions of metric tons of CO2 equivalent. Okay, so our measure of spread, basically our measurement of how much these direct emissions vary across all the facilities in this data set, comes out to be a little bit more than a million. Pretty large number! Okay. Well, let's compare this to the interquartile range now. So we're going to calculate the interquartile range using the describe function. And there's different ways to do this describe function. I want to do it this way because the describe function is actually quite handy for a number of things. So let's first find that. So we'll call our variable total reported direct emissions, and our function here is just describe, and here we have to put in parentheses at the very end. And then we'll view the output.

Here we go. So this is a series actually, and it's got some, a lot of nice values. So we see our standard deviation here. We see our mean value. It also tells us how many values are being used here; so 6,481, in this case. Note the use of scientific notation here. And then the minimum zero, max of 20 million or so, and then 25, 50, and 75 percent. This 50 percent is the median. So this agrees with exactly what we had in the median above because this 50 percent
refers to the 50th percentile. And that is another name for the median. So those are synonymous. And then we also have the 25th percentile, and the 75th percentile. So 25 percent of the values are less than 31,000 and some change, here. And 75 percent of the data values are less than 186,000 and some change here.

So our interquartile range is the difference between the 75th percentile and the 25th percentile. So we can simply find this by referencing these values in the describe series above. So it's our 75th percentile minus our 25th percentile. And you can view that value. Turns out to be 154 000 or so. So quite a bit less than standard deviation. So these are two different measures of the spread. The standard deviation is you know the typical distance away from the mean. And if you recall, in the comparison of the mean and median we have a right-skewed distribution up here. And so, our mean was a lot larger than our median. And so, there's a lot more spread about that mean, if we're looking about it that way. Whereas, the median and the interquartile range are much more concentrated down in this lower end where we have more observations.

So in summary, that's how you can find the standard deviation and the interquartile range. Along the way we found this handy describe function which is useful for summarizing the data in one fell swoop. And again, we got to do all this with the fascinating flight data set on greenhouse gas emissions across the U.S. Okay, thank you.

The above video demonstrates how to calculate standard deviation using the std function from Pandas. There is also a std function from Numpy, however in order to use that function to calculate standard deviation as given in the above formula, you need to specify the argument ddof=1 to represent the 1 being subtracted from n in the denominator:

1	`numpy.std(x, ddof=1)`

where x is an array of values.

Try It: DataCamp - Find the standard deviation

You already found the mean and median of the total electricity consumed in a year (KHW) in the previous page.

HOME ID	DIVISION	KWH
10460	Pacific	3491.900
10787	East North Central	6195.942
11055	Mountain North	6976.000
14870	Pacific	10979.658
12200	Mountain South	19472.628
12228	South Atlantic	23645.160
10934	East South Central	19123.754
10731	Middle Atlantic	3982.231
13623	East North Central	9457.710
12524	Pacific	15199.859

* Data Source: Residential Energy Consumption Survey (RECS)(link is external) [1], U.S. Energy Information Administration (accessed Nov. 15th, 2021)

Now can you find the standard deviation using Python?

Assess It: Check Your Knowledge

Use the table below to answer the following questions.

Make	Model	Type	City MPG
Audi	A4	Sport	18
BMW	X1	SUV	17
Chevy	Tahoe	SUV	10
Chevy	Camaro	Sport	13
Honda	Odyssey	Minivan	14

FAQ

Summary and Final Tasks

Summary

In Lesson 2, you've learned about different types of variables: categorical versus quantitative. You've also learned about corresponding data types in Python, both how to recognize those types and change a variable from one type to another. Lesson 2 has presented a lot of basic tools, mostly coming from the Pandas library in Python, for data manipulation and processing. These tools are essential from going from the raw imported data to a useful DataFrame, from which we can easily perform subsequent analyses. Lesson 2 ended with some very basic analyses, in the context of summarizing data: finding proportions, means, medians, standard deviations, and other summary statistics.

As you continue through this course, you may find the Pandas cheet sheet [7] to be a handy reference to the functions we've covered in this lesson.

Assess It: Check Your Knowledge Quiz

Use the table below to answer the questions below.

Make	Model	Type	City MPG
Audi	A4	Sport	18
BMW	X1	SUV	17
Chevy	Tahoe	SUV	10
Chevy	Camaro	Sport	13
Honda	Odyssey	Minivan	14

Reminder - Complete all of the Lesson 2 tasks!

You have reached the end of Lesson 2! Double-check the to-do list on the Lesson 2 Overview page to make sure you have completed all of the activities listed there before you begin Lesson 3.

Lesson 2: Types of data and summarizing data

Overview

Overview

Learning Outcomes

Lesson Roadmap

Questions?

Variables, Data Types, and Other Important Terminology

Read It: Important Terminology

Let's further define these new terms:

Types of Variables

Data Types in Python

Watch It: Video - Investigating and Naming Columns in Datasets (13:43 minutes)

Try It: DataCamp: Identify Data Types

A Note on Data Collection

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Data Manipulation: Merging DataFrames

Read It: Merging DataFrames

Watch It: Video - Datatypes in Python (13:43 minutes)

Try It: DataCamp - Merge Two DataFrames

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Data Manipulation: Logic and Subsetting

Read It: Subsetting

0.) Logic

Try it: Logic

1.) Subsetting observations (rows)

2.) Subsetting variables (columns)

Watch It: Video - Roshambo (7:45 minutes)

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Roshambo: Frequency Tables and Proportions

Read It: Roshambo

Try It: Roshambo

Learn It: Frequency Tables

Learn It: Proportions

Watch It: Video - Importing Files (9:16 minutes)

Try It: DataCamp - Find a Frequency Table and Proportion from Roshambo Data

Assess It: Check Your Knowledge

Knowledge Check

FAQ

More Roshambo: Two-Way Tables

Read It: Two-Way Tables

Watch It: Two-Way Tables (9:43 minutes)

Try It: DataCamp - More Roshambo

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Summarizing Quantitative Data: Histograms and Measures of Center

Read It: Histograms

Read It: Measures of Center

The mean is more affected by skewness (and outliers) than the median!

Read It: EPA Flight Tool

Watch It: Measures of Center (10:39 minutes)

Watch It: Measures of Center Part 2 (6:16 minutes)

Try It: DataCamp - Find the mean and median

Assess It: Check Your Knowledge

FAQ

Summarizing Quantitative Data: Measures of Spread

Read It: Measures of Spread: Standard Deviation

Watch It: Video - Measures of Spread (4:36 minutes)

Assess It: Check Your Knowledge

FAQ

Summary and Final Tasks

Summary

Assess It: Check Your Knowledge Quiz

Use the table below to answer the questions below.

Reminder - Complete all of the Lesson 2 tasks!

Lesson 2 FAQ