
Read It: Important Terminology
First, recall our table from Lesson 1:
HOME ID | DIVISION | KWH |
---|---|---|
10460 | Pacific | 3491.900 |
10787 | East North Central | 6195.942 |
11055 | Mountain North | 6976.000 |
14870 | Pacific | 10979.658 |
12200 | Mountain South | 19472.628 |
12228 | South Atlantic | 23645.160 |
10934 | East South Central | 19123.754 |
10731 | Middle Atlantic | 3982.231 |
13623 | East North Central | 9457.710 |
12524 | Pacific | 15199.859 |
* Data Source: Residential Energy Consumption Survey (RECS)(link is external), U.S. Energy Information Administration (accessed Nov. 15th, 2021)
You may already be familiar with talking about the parts of this table as rows and columns. In Data Analytics, one goal of preparing data for subsequent analysis is to get the data into a table (or DataFrame, if using Pandas), such that each row represents a case or unit, and each column represents a variable:
- Rows: cases or units
- Columns: variables
Let's further define these new terms:
Cases or units: the subjects or objects for which we have data
Variable: any characteristic that is recorded for the cases
So, in the example table above, the cases are individual homes (identified by HOME ID), and the variables are DIVISION (geographic region in the U.S.) and KWH (electricity usage).
Types of Variables
For this course, it is important to distinguish two different variables:
Categorical variable: divides the cases into groups, placing each case into exactly one of two or more categories
Quantitative variable: measures or records a numerical quantity for each case
This distinction is important because the type of variables determines what sorts of analyses and methods can be applied to it. Following the example table above, DIVISION is a categorical variable and KWH is a quantitative variable (since it reports the quantity of energy used by each house).
Yet another way to distinguish variables is based on their purpose or role in the intended analysis, if the analysis is looking at relationships between two or more variables:
Response variable: the variable for which an understanding, inference, or prediction is desired
Explanatory variable: used to do the understanding, inference, or prediction of the response variable
So, for example, if we hypothesized that geographic location somehow affects electricity usage, then KWH would be the response variable and DIVISION would be the explanatory variable. In other words, we would want to see whether or not DIVISION explains how KWH responds. Furthermore, this relationship can be swapped: if the hypothesis was that electricity can identify in which region a home resides, then KWH is now the explanatory variable and DIVISION is the response. In other words, KWH could, perhaps, explain the DIVISION.
Data Types in Python
There are many data types in Python and Pandas. Let's focus on some of the common ones that are also relevant for this course.
Any categorical variables should be represented by data type:
category
: which contains a finite list of text values
This is not to be confused with:
str
or object
: string (text; characters), which do not necessarily define categories
Quantitative variables will typically show up as:
int
or int64
: integers (whole numbers)
or
float
or float64
: floating-point number (decimal; continuous valued)
Note that categories could just as well be labeled with numbers as they could with text, so just because a variable has numbers in it does not necessarily mean that it is quantitative.
Lastly, one other Pandas data type that will come up in the course because it is especially important when analyzing energy generation and consumption is:
datetime64
: a calendar date and/or time
Watch It: Video - Investigating and Naming Columns in Datasets (13:43 minutes)
Try It: DataCamp: Identify Data Types
The Pandas function for identifying data types is dtype
. See it in action here:
We can see that house and state are both object (text), whereas monthly energy usage is int (numerical). We've previously identified state as a categorical variable, but it is not showing up as such in the code output above. In order to tell Python to recognize state as a categorical variable, we need to add another line of code:
The astype
function can be used to convert to other data types as well (for example, from int to float).
A Note on Data Collection
Now, although the table above is good, in its simplicity, for illustrating these important terms and types of variables, it is a bad dataset for investigating the sort of relationship hypothesized above. This is because the sample size is too small; we've only sampled three houses out of a tremendously larger population. Therefore, we do not expect to have adequate representation of the whole population of houses. We will talk more about sample size in the coming lessons, as this is a very important aspect of conducting experiments and surveys ("Data Collection"). Usually, larger sample sizes incur larger costs. However, you need a sufficiently large sample size to ensure that you have adequate representation. Besides having a large enough sample size, it is also important to have a sample of data that is unbiased. In other words, we do not want our data to disproportionately represent one group in the population, which is called "sampling bias". There are other forms of bias, but let's limit our discussion to just sampling bias for now. The main way to combat this form of bias is to collect a random sample.
Random sample: each case in the population has an equal chance of being selected in a sample of size n cases. The goal is to avoid sampling bias.
Therefore, in examining a dataset like that which is in the table above, it is important to know how the data were collected.
Assess It: Check Your Knowledge
Knowledge Check
Use the table below to answer the following questions.
Make | Model | Type | City MPG |
---|---|---|---|
Audi | A4 | Sport | 18 |
BMW | X1 | SUV | 17 |
Chevy | Tahoe | SUV | 10 |
Chevy | Camaro | Sport | 13 |
Honda | Odyssey | Minivan | 14 |