
The last section discussed how to build a new larger DataFrame by merging two DataFrames. When a large DataFrame has too much unnecessary data for the intended analysis, it is useful to subset it to a smaller DataFrame that only contains the relevant data. In this section, we'll discuss two common forms of subsetting: 1.) subsetting observations (rows), and 2.) subsetting variables (columns)
Read It: Subsetting
0.) Logic
Before we actually introduce how to subset a DataFrame, we need to first cover logic in Python. Logical statements in Python are also called "Boolean operators", and that is because the result of a logical statement is a Boolean variable, or a binary variable containing "True" (the logical condition is met) or "False" (the logical condition is not met). Forms of logic in Python are listed in the table below:
Code | Meaning |
---|---|
< |
Less than |
> |
Greater than |
== |
Equals |
!= |
Does not equal |
<= |
Less than or equal to |
>= |
Greater than or equal to |
pd.isnull(obj) |
is Nan |
pd.notnull(obj) |
is not Nan |
& |
and |
! |
or |
~ |
not |
df.any() |
any |
df.all() |
all |
Try it: Logic
Note that the logical statement returns a vector of the same length as x
, with a True
in the position where the condition is met and a False
in positions where the condition isn't met. Try out some of the other logical statements in the table above. How does the output change?
Since &
and !
combine two logical statements, the logical statements that are being combined need to be enclosed in parentheses. For example, try print( (x > 3) & (x < 5) )
in the code above.
1.) Subsetting observations (rows)
These logical statements can then be used to subset DataFrames by selecting and returning only those rows that meet the conditions (yield True
). This sort of filtering works by replacing x in the code above with a reference to a variable in the DataFrame, and enclosing the logical statement in [] immediately after the name of the DataFrame. The example below uses the table of household energy use from previous sections:
HOME ID | DIVISION | KWH |
---|---|---|
10460 | Pacific | 3491.900 |
10787 | East North Central | 6195.942 |
11055 | Mountain North | 6976.000 |
14870 | Pacific | 10979.658 |
12200 | Mountain South | 19472.628 |
12228 | South Atlantic | 23645.160 |
10934 | East South Central | 19123.754 |
10731 | Middle Atlantic | 3982.231 |
13623 | East North Central | 9457.710 |
12524 | Pacific | 15199.859 |
Try it: Subsetting
Suppose we only want to retain those houses (rows) that use more than 10,000 KWH. We can perform the following:
Another way to accomplish this filtering would be to do df.query('KWH > 10000')
. Try out some other logical statements in the code above and see how the resulting DataFrame is subset. Note that Division is a categorical variable whose values are strings (text). Make sure that, when referring to these values, you put them in quotes: df[ df.Division == 'East North Central' ]
.
Some other useful functions for subsetting by rows include:
Code | Meaning |
---|---|
df.head(n) |
Select first n rows (useful for understanding contents of a DataFrame). |
df.tail(n) |
Select last n rows. |
df.nlargest(n, 'variable') |
Select rows containing the largest n values in variable. |
df.nsmallest(n, 'variable') |
Select rows containing the smallest n values in variable. |
df.drop_duplicates() |
Remove rows with identical values. |
2.) Subsetting variables (columns)
Unlike rows which can be identified by the values they contain, variables (columns) in a DataFrame are referred to by name, and these names should be unique. We've actually already seen subsetting by variable in action above, in the form of:
df.KWH
which can alternatively be written as
df['KWH']
Multiple columns can be selected by including a list in the square brackets: df[ ['HOME ID', 'KWH'] ].
df.columns
may be useful for finding the column names.