Data Manipulation: Logic and Subsetting

The last section discussed how to build a new larger DataFrame by merging two DataFrames. When a large DataFrame has too much unnecessary data for the intended analysis, it is useful to subset it to a smaller DataFrame that only contains the relevant data. In this section, we'll discuss two common forms of subsetting: 1.) subsetting observations (rows), and 2.) subsetting variables (columns)

Read It: Subsetting

0.) Logic

Before we actually introduce how to subset a DataFrame, we need to first cover logic in Python. Logical statements in Python are also called "Boolean operators", and that is because the result of a logical statement is a Boolean variable, or a binary variable containing "True" (the logical condition is met) or "False" (the logical condition is not met). Forms of logic in Python are listed in the table below:

Code	Meaning
`<`	Less than
`>`	Greater than
`==`	Equals
`!=`	Does not equal
`<=`	Less than or equal to
`>=`	Greater than or equal to
`pd.isnull(obj)`	is Nan
`pd.notnull(obj)`	is not Nan
`&`	and
`!`	or
`~`	not
`df.any()`	any
`df.all()`	all

Try it: Logic

Note that the logical statement returns a vector of the same length as x, with a True in the position where the condition is met and a False in positions where the condition isn't met. Try out some of the other logical statements in the table above. How does the output change?

Since & and ! combine two logical statements, the logical statements that are being combined need to be enclosed in parentheses. For example, try print( (x > 3) & (x < 5) ) in the code above.

1.) Subsetting observations (rows)

These logical statements can then be used to subset DataFrames by selecting and returning only those rows that meet the conditions (yield True). This sort of filtering works by replacing x in the code above with a reference to a variable in the DataFrame, and enclosing the logical statement in [] immediately after the name of the DataFrame. The example below uses the table of household energy use from previous sections:

HOME ID	DIVISION	KWH
10460	Pacific	3491.900
10787	East North Central	6195.942
11055	Mountain North	6976.000
14870	Pacific	10979.658
12200	Mountain South	19472.628
12228	South Atlantic	23645.160
10934	East South Central	19123.754
10731	Middle Atlantic	3982.231
13623	East North Central	9457.710
12524	Pacific	15199.859

Try it: Subsetting

Suppose we only want to retain those houses (rows) that use more than 10,000 KWH. We can perform the following:

Another way to accomplish this filtering would be to do df.query('KWH > 10000'). Try out some other logical statements in the code above and see how the resulting DataFrame is subset. Note that Division is a categorical variable whose values are strings (text). Make sure that, when referring to these values, you put them in quotes: df[ df.Division == 'East North Central' ].

Some other useful functions for subsetting by rows include:

Code	Meaning
`df.head(n)`	Select first n rows (useful for understanding contents of a DataFrame).
`df.tail(n)`	Select last n rows.
`df.nlargest(n, 'variable')`	Select rows containing the largest n values in variable.
`df.nsmallest(n, 'variable')`	Select rows containing the smallest n values in variable.
`df.drop_duplicates()`	Remove rows with identical values.

2.) Subsetting variables (columns)

Unlike rows which can be identified by the values they contain, variables (columns) in a DataFrame are referred to by name, and these names should be unique. We've actually already seen subsetting by variable in action above, in the form of:

df.KWH

which can alternatively be written as

df['KWH']

Multiple columns can be selected by including a list in the square brackets: df[ ['HOME ID', 'KWH'] ].

As a side note, df.columns may be useful for finding the column names.

Watch It: Video - Roshambo (7:45 minutes)

Click here for a transcript.

All right. In this video we're going to continue on where we left off after merging our two state energy ranking data sets, the natural gas and total energy. And we're going to solve this issue of these repeated rows for different states through logical statements and sub setting. So, let's get to it. So the issue that we're having here is that we've got 18 states who have their full names here that don't have any data because they didn't produce natural gas and they weren't in the total energy data set. And we've got their two letter abbreviations coming from the total energy data set because they have the total energy produced data. So, what we want to do is get rid of basically all these rows that have the full names. You don't have any data attached to them. How do we do this? Well, we can do this using logical statements and filtering or subsetting our data frame. So, let's make a new data frame df2. And we'll start with df, this is the output of the outer merge done above in the previous video, and we're going to make use of our pd notnull function here. Because what we want to do is basically we want to keep all the rows that either have natural gas data, so do not have NaN in the natural gas column, or do not have NaN in the total energy column, or do not have any in in either column. In other words, we want to get rid of those rows that have any, and in both columns. So we will reference our natural gas column here, calling the data frame after the merge. So df square brackets natural gas. And let's just see how this notnull function works on the natural gas column alone. So let's go ahead and run this. And we'll print df2. Let me see how we have our columns there, but we only have 33 states represented here.

So these are our 33 states that produce natural gas from the natural gas data set. So we want to also include those states that had some total energy production even though they didn't have natural gas. So how do we do this? We say we want to keep those rows who have values in natural gas column, or those rows that have values or notnull in the total energy column. So, let's go ahead and run this. So we run this again. We saw the same columns. We scroll down and we now have all 50 states plus DC. I know this goes to 68 but that's just because the index, the indices here, have been broken. We've gotten rid of some of those rows in the middle that had some indices. If we want to reset that, we can just say df2 dot reset index, and we'll get our sequential numbering back. So we see now that these indices go down to 50. They start at zero, go down to 50, so there's 51. Again 50 states plus DC.

Furthermore, you'll note that we have our old indices as the second column here. We might want to get rid of that. We also might want to get rid of some other things here. For instance, you know these rankings don't really need them, we're just interested in these quantities. So if we want to filter by columns we can, we have some handy tools that are disposal here. So we can access values using in particular this look function for getting multiple columns or rows at the same time. So, for example, if we want to select one, or more than one column, let's make a new data frame, df3, now from our df dot loc. And we'll use colon to say all rows. So we're going to get all rows and then separated by a comma here we'll put in whatever columns we want so if we only want to keep say for example, state and total energy and natural gas we can enter them as a list. So we can close them in square brackets, separate these quoted values, text values, and commas here, df3. Let's see what we get. So, here's our data frame state, total energy, natural gas. Note that these columns are now in the order that I specified. So before we had natural gas coming before energy and I've switched that around just by the order that I've stated their names in here. And we've effectively gotten rid of those rankings and that old index here. So, now we have a much cleaner data set.

Furthermore, we can for example, with this look function select one or more rows. Conversely, we're not going to make use of this later on so we'll just say for example cases in the df3.loc we want to say, for example, the first five rows give you something like that specifies the rows. We need to say what columns we want. If we want to say all columns you could do that, the colon cases. And there we go, we have the first five rows. Note you can mix this up, you can do a combination of selecting different rows and different columns too if we wanted to, but I think you get the idea here. And then just last but not least, like to exemplify how to get, or easily display, I should say, the top energy producers here. So say for example, one of the top five total energy producers, we've got this handy function and largest where we say and largest five for the first five and then whatever variable we want the largest values from. And so here we see that the largest total energy producers are in, number one Texas, number two, Pennsylvania, number three, Wyoming, Oklahoma, and West Virginia, on down. So those are some handy tools to to get started with processing data changing data types, and in this video, subsetting the data into a clean data product. Okay, thank you.

Read It: Subsetting

0.) Logic

Try it: Logic

1.) Subsetting observations (rows)

2.) Subsetting variables (columns)

Watch It: Video - Roshambo (7:45 minutes)

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Navigation

EMS

Programs

Related Links