Variables, Data Types, and Other Important Terminology

Read It: Important Terminology

First, recall our table from Lesson 1:

HOME ID	DIVISION	KWH
10460	Pacific	3491.900
10787	East North Central	6195.942
11055	Mountain North	6976.000
14870	Pacific	10979.658
12200	Mountain South	19472.628
12228	South Atlantic	23645.160
10934	East South Central	19123.754
10731	Middle Atlantic	3982.231
13623	East North Central	9457.710
12524	Pacific	15199.859

* Data Source: Residential Energy Consumption Survey (RECS)(link is external), U.S. Energy Information Administration (accessed Nov. 15th, 2021)

You may already be familiar with talking about the parts of this table as rows and columns. In Data Analytics, one goal of preparing data for subsequent analysis is to get the data into a table (or DataFrame, if using Pandas), such that each row represents a case or unit, and each column represents a variable:

Rows: cases or units
Columns: variables

Let's further define these new terms:

Cases or units: the subjects or objects for which we have data

Variable: any characteristic that is recorded for the cases

So, in the example table above, the cases are individual homes (identified by HOME ID), and the variables are DIVISION (geographic region in the U.S.) and KWH (electricity usage).

Types of Variables

For this course, it is important to distinguish two different variables:

Categorical variable: divides the cases into groups, placing each case into exactly one of two or more categories

Quantitative variable: measures or records a numerical quantity for each case

This distinction is important because the type of variables determines what sorts of analyses and methods can be applied to it. Following the example table above, DIVISION is a categorical variable and KWH is a quantitative variable (since it reports the quantity of energy used by each house).

Yet another way to distinguish variables is based on their purpose or role in the intended analysis, if the analysis is looking at relationships between two or more variables:

Response variable: the variable for which an understanding, inference, or prediction is desired

Explanatory variable: used to do the understanding, inference, or prediction of the response variable

So, for example, if we hypothesized that geographic location somehow affects electricity usage, then KWH would be the response variable and DIVISION would be the explanatory variable. In other words, we would want to see whether or not DIVISION explains how KWH responds. Furthermore, this relationship can be swapped: if the hypothesis was that electricity can identify in which region a home resides, then KWH is now the explanatory variable and DIVISION is the response. In other words, KWH could, perhaps, explain the DIVISION.

Data Types in Python

There are many data types in Python and Pandas. Let's focus on some of the common ones that are also relevant for this course.

Any categorical variables should be represented by data type:

category: which contains a finite list of text values

This is not to be confused with:

str or object: string (text; characters), which do not necessarily define categories

Quantitative variables will typically show up as:

int or int64: integers (whole numbers)

float or float64: floating-point number (decimal; continuous valued)

Note that categories could just as well be labeled with numbers as they could with text, so just because a variable has numbers in it does not necessarily mean that it is quantitative.

Lastly, one other Pandas data type that will come up in the course because it is especially important when analyzing energy generation and consumption is:

datetime64: a calendar date and/or time

Watch It: Video - Investigating and Naming Columns in Datasets (13:43 minutes)

Click here for a transcript.

Hi, in the next few videos we're going to cover some data processing and data manipulation code in Python starting in this video looking at data types and doing some other useful procedures in Python. So, let's get to it. To exemplify these procedures, we're going to work with data from the Energy Information Administration. In particular, we're going to look at state rankings. So, we'll go to their website that's linked here. So, the U.S Energy Information Administration publishes a wealth of data online. What we're going to take a look at is under this production section here. We're, in particular, we're going to look at two data sets, two CSV files, one total energy, one natural gas. And these are quantities of energy and natural gas produced by states and they also rank them. I encourage you to go on and explore other data sets published by the Energy Information Administration, but for now we're just going to focus on these two for simplicity's sake. It's also useful to preview the data in spreadsheet format before we start playing around with it in Python.

So, let's start with the natural gas data set. So, here's natural gas production ranked by state, one down through 33, 34, starting with Texas on down. Here, the quantities of natural gas, we also have this blank column here, note rankings are blah blah blah blah blah. Interesting. Turning to total energy production also ranked by state going down to 51. So, here we have all the states, including DC, and we also have this curious blank note column here. Okay so let's dive into the python code. I've already mounted the Google Drive and imported the data sets following procedures that we've seen in the previous lesson. I've also already run these. We've got our natural gas data set stored as ngdf for data frame and we've got total energy stored as TEdf. Now if we want to just see what columns, we have a useful function. Here is the dot columns function. When we run this, we just get the column names. This is as opposed to, for example going back up here and saying NGdf and getting the whole entire data frame, just to scroll up and see what columns we have. So, this is a more concise way to have it. Furthermore, we can do this for the total energy data frame. Let's see what columns we have there. So, both have rank, both have state, the natural gas has natural gas market to production, and units of million cubic feet. Total energy is total energy production in trillions of BTU. And then both also have this note column which is not an actual variable. Furthermore, if we wanted to store these we could say, for example, make a new object called NG columns which would be equal to NGdf dot columns, and then I'm going to add something else here. I'm going to add dot to list. This will just convert all these names into an object called a list. So, this could be useful if I wanted to store these names and reference them later on.

Okay, another useful tool is to remove columns. So, for example, we have this weird note column. That isn't actually a variable. It doesn't contain any data. Let's get rid of that. So, starting with the natural gas data frame, and NGdf, we'll say NGdf dot drop. So, drop is the function that's going to get rid of whatever column or columns we specify here. So, I'm going to go up to my, I'm just going to copy and paste the column from here, make sure we drop this, and furthermore at the end of this I'm going to run NGdf dot columns again just to verify that we've gotten rid of that column. And indeed, we have. We no longer have this note column. Let's just repeat this real quick for the total energy data frame. This has a slightly different named note column so we'll copy and paste there and run that as well, and we see that that has been removed. Yet another useful function is to rename columns. This is particularly useful if we've got some overly verbose column names. For example, total energy production showing BTU. I don't want to have to type that over and over and over again. Let's call that, as well as this natural gas marketed production, something else. So, the syntax for using this rename function is stated generically above here, but let's put this into use. We'll again alter our NGdf or data frame. So, we're going to perform some function on the right-hand side and store it as what we've previously been calling it. So, overwrite it. We'll say NGdf dot renames, the function used here, and again we're going to specify some columns. But now we need to put in a dictionary, so curly brackets. And we're going to give the existing name, so we want to change this name right here, this natural gas marketed production we'll put that in there and we'll put in a colon. And after the colon we're going to put what we want to change it to. Let's call it Natural Gas. And again, we can use our columns function to verify that that change has happened. Indeed, it has. Let's go ahead and repeat this also for our total energy. Let's rename these object names and instead of natural gas we have total energy that we want to change. Stick that in there, and we'll call this total energy and run this. And again, we verify that that change has happened.

Okay, so that covers some handy tools to manipulate the data and get it into something that's more manageable. Let's turn our attention to data types now. And oftentimes after we import data, we need to convert some of the data types because in the import process python has automatically recognized some variables as what they ought to be. So, let's start by finding what data types we currently have. We'll start with our NG data frame and the key function here is dtypes, d standing for data, type standing for types. Go ahead and run this. Here we go. So, this rank variable is in 64 in standing for integer, the state variable is an object. Object is another name for text. Natural gas also text. Let's do the same for our total energy. Again, rank is an integer. That's what we expect. State an object AKA text, so expect total energy. Also, an integer. Hmm, that's interesting. As compared to the Natural Gas data set, natural gas is an object. And really, we don't want that to be an object. That should be a number. Right? It shouldn't be text. We need this. We would prefer that this first word as a quantitative variable. So how do we make that happen? How do we convert this to a quantitative variable? Well, the first thing we need to do is we need to recognize that this natural gas data set, if we go up and inspect it, has a number of dashes in the natural gas column here. This poses a problem if we if we try to convert that column to integers, or floats look at an error. So, we need to replace those dashes first. And what we're going to do is we're going to replace them with something called NaN, stands for not a number, or no value. And we see this above, if we scroll back up again, we see that NaN values have been inserted into this strange note column. This isn't, this is very important. So, these values are basically placeholders saying that that cell was blank. And so, instead of just leaving it blank, it actually has something there to make it more obvious that it's blank. Really important is that zeros were not automatically stored there. You need to bear in mind that the zero is an actual value. If zeros were here, that would imply that there was a value of zero, that we knew that there was a zero quantity there. But in fact, these are blank. And so, we need to note that with NaN.

So, let's go back down and substitute these dashes with NaN. Now a handy function for doing this is this replace function. So, we're going to replace whatever the first object from first argument is here. So, dash dash, that's the thing we want to replace. And we want to replace this with NaN. Now we can't just write n a n like that. NaN is special. And to make this special quantity appear appropriately we need to use the numpy package. So, we'll import that as numpy and we need to call np.nan. So, this is clearly telling python that we've got these special NaN values here. Let's take a look and verify that this worked. So, reprinting this data frame we scroll down, and indeed where we have those dashes, we now have nan. So, now we can move on to converting data types. So, we will take our NG data frame here, we want to just convert the natural gas column. So, we're going to just reference this column. We do that by taking our data frame, putting square brackets, and then within quotes we've got the variable name there. And we can say .astype "float64". That'll convert it to a quantity. Right, we can check and make sure that that worked. We can say NGdf dot dtypes. And indeed, natural gas is now a float. That's great.

Another thing we can do, is we can convert, say for example, state to a categorical variable. So, NGdf state. And we'll use the same syntax here, so as that type, except instead of float we'll say category. And we will repeat this for the total energy data frame as well. We can verify that that worked, NGdf dot dtypes. We see that states now a category and indeed for total energy we will see the same thing, category. Okay, so that's a bit of data processing and investigating and changing data types. Thank you.

Try It: DataCamp: Identify Data Types

The Pandas function for identifying data types is dtype. See it in action here:

We can see that house and state are both object (text), whereas monthly energy usage is int (numerical). We've previously identified state as a categorical variable, but it is not showing up as such in the code output above. In order to tell Python to recognize state as a categorical variable, we need to add another line of code:

The astype function can be used to convert to other data types as well (for example, from int to float).

A Note on Data Collection

Now, although the table above is good, in its simplicity, for illustrating these important terms and types of variables, it is a bad dataset for investigating the sort of relationship hypothesized above. This is because the sample size is too small; we've only sampled three houses out of a tremendously larger population. Therefore, we do not expect to have adequate representation of the whole population of houses. We will talk more about sample size in the coming lessons, as this is a very important aspect of conducting experiments and surveys ("Data Collection"). Usually, larger sample sizes incur larger costs. However, you need a sufficiently large sample size to ensure that you have adequate representation. Besides having a large enough sample size, it is also important to have a sample of data that is unbiased. In other words, we do not want our data to disproportionately represent one group in the population, which is called "sampling bias". There are other forms of bias, but let's limit our discussion to just sampling bias for now. The main way to combat this form of bias is to collect a random sample.

Random sample: each case in the population has an equal chance of being selected in a sample of size n cases. The goal is to avoid sampling bias.

Therefore, in examining a dataset like that which is in the table above, it is important to know how the data were collected.

Assess It: Check Your Knowledge

Knowledge Check

Use the table below to answer the following questions.

Make	Model	Type	City MPG
Audi	A4	Sport	18
BMW	X1	SUV	17
Chevy	Tahoe	SUV	10
Chevy	Camaro	Sport	13
Honda	Odyssey	Minivan	14