
Read It: Merging DataFrames
Often, one can perform more interesting and meaningful analyses by using larger datasets, with more variables and/or cases. One way of creating new, sizeable DataFrames is by merging multiple smaller DataFrames. The merge function in Pandas makes the task or merging two DataFrames (let's call them LeftDataFrame and RightDataFrame) relatively easy:
LeftDataFrame.merge(RightDataFrame, how='inner', on=key)
Here, we are just focusing on just the main inputs to the merge function, and one can view the documentation to see the other optional inputs. To start, "key" is a variable (or set of variables) common to both DataFrames, and that we want to use as the basis for merging the two DataFrames. Usually, "key" identifies the unique cases in each DataFrame. The argument "how" indicates which rows to keep from the DataFrames. To explain this further, let's examine each of the four possible methods for "how"
1.) how = 'inner'
To begin, consider both DataFrames as a Venn diagram, where each circle represents the values of the "key" variable contained within that DataFrame:

The 'inner' merge method will only keep the key values that are common to both DataFrames, or, in other words, the intersection of the two circles. The following cartoon depicts how this method would work on two example DataFrames:

Note that key values of 3 and 4 are dropped because 3 is not contained in RightDataFrame and 4 is not contained in LeftDataFrame. Only 1 and 2 are retained since they are common to both DataFrames. Further, note that the resultant DataFrame contains all variables (columns) from each of the two input DataFrames (and "key" is not repeated).
2.) how = 'outer'
In contrast to the 'inner' method, 'outer' keeps all the key values from both DataFrames, or, in other words, the union of the two circles in the Venn diagram:

With 'outer' applied, our cartoon with the two example DataFrames now looks like this:

Now key values of 3 and 4 are retained, and NA values are inserted where either Var_X or Var_Y are not observed.
As a special use case, if one wishes to simply "stack" two DataFrames (say, with identical variable names but different rows/keys), they can use 'outer' and leave the on argument blank (defaults to all columns).
3.) how = 'left'
The 'left' method keeps all the rows with keys from the LeftDataFrame:

And inserts NA values where variables from the RightDataFrame are not observed:

4.) how = 'right'
In contrast, the 'right' method keeps all the rows with keys from the RightDataFrame:

And inserts NA values where variables from the LeftDataFrame are not observed:

Watch It: Video - Datatypes in Python (13:43 minutes)
Hi. In this video we're going to cover merging data frames. So, we're going to continue with our state energy ranking data from the Energy Information Administration. In this video, we're going to merge the natural gas data frame ngdf, with the total energy data frame tedf. Let's exemplify this first with the inner merge. This is the default method. And we'll do this by creating a new data frame, which is called DF, generically PD. To reference pandas, merge these two objects: ngdf, tedf. We want to merge the natural gas with the total energy. We also need to specify on which variable to merge. So, let's do this by state, that is, it's going to match Texas and natural gas with Texas, and total energy Pennsylvania in the first, Pennsylvania the second, and so on, and so forth. So it's going to pair them off on a basis of state. Let's see what this looks like. Here we go. So we've got first the rankings. We've got rank underscore X. That's interesting. What does that mean? Well, these are the rankings from the natural gas data frame. That's our first data frame. So it gives it the subscript X as opposed to rank Y. So this is the rank from the second one. Note that these two variables were called the same thing. So it automatically appended these X and the Y to distinguish the two. Then we have state. This was our variable, our key variable, that we're performing the merge on. So we have all the states here, and then we've got our natural gas and our total energy. And now this miraculous function has automatically paired the values according to what state they belong to.
However, there's a problem with this. We scroll down, we don't have all the states. This stops at 33. Note that the indexes for this start at zero. So we've got 33 states here. There really ought to be 51, 50 states, plus DC. So what happened? Why'd you get rid of those? Well it only kept the states that were equal in natural gas and total energy. Recall that in the natural gas data frame we had some states that had their full names in there. Let's try this a different way. Let's try it with the outer join. So say it's DF again PDF for pandas dot merge. And again, we'll have our ngdf, or tedf, on equals state. And now we'll say, how is outer. So we're going to do an outer merge. Let's see what result we get here. So we still have our rank X, our rank y, that all looks the same. But now we've got down to 69 states. Well, that's strange. That's too many. I'll click on this thing to just get a better inspection of the output here. What's going on? If we scroll down, we need to display some more values here. We see that we've got these states by name here, that don't have any data, no natural gas production, no total energy production. And then we've got their corresponding acronyms. So how did this happen? Well, the natural gas production data set had these states listed out in their full name because they didn't produce any natural gas, so it listed them differently for whatever reason. Whereas the total energy data set does have data for these states and uses their two-letter abbreviation. And the merge function can't associate these two things because it's just not that intelligent. It doesn't recognize that Hawaii is abbreviated HI. We would need to tell it that. So, what do we do? Well, this will be for the next video, logic and sub setting. We'll solve that problem there. For now, that's it for merging.
Try It: DataCamp - Merge Two DataFrames
In the exercise below, you are given two DataFrames:
HOME ID | DIVISION | KWH |
---|---|---|
10460 | Pacific | 3491.900 |
10787 | East North Central | 6195.942 |
11055 | Mountain North | 6976.000 |
14870 | Pacific | 10979.658 |
12200 | Mountain South | 19472.628 |
12228 | South Atlantic | 23645.160 |
10934 | East South Central | 19123.754 |
10731 | Middle Atlantic | 3982.231 |
13623 | East North Central | 9457.710 |
12524 | Pacific | 15199.859 |
HOME ID | CLIMATE | TOTAL SQUARE FEET |
---|---|---|
13623 | Cold/Very Cold | 1881 |
11055 | Cold/Very Cold | 2455 |
10460 | Marine | 800 |
12200 | Hot-Dry/Mixed-Dry | 3800 |
12524 | Marine | 1110 |
10332 | Hot-Dry/Mixed-Dry | 1630 |
12106 | Cold/Very Cold | 5184 |
11718 | Mixed-Humid | 1812 |
14500 | Marine | 2195 |
13193 | Marine | 847 |
* Data Source: Residential Energy Consumption Survey (RECS)(link is external), U.S. Energy Information Administration (accessed Nov. 15th, 2021)
Note that these two DataFrames share some of the same cases, identified by "HOME ID" (a numeric idenifier unique to each home). In the exercise below, start by seeing what happens when you perform an 'inner' merge:
How does the resulting DataFrame change when you peform an 'outer' merge? A 'left' merge? A 'right' merge? Pay attention to which rows are kept and where NA values are inserted.