EME 210
Data Analytics for Energy Systems

Data Manipulation: Merging DataFrames

PrintPrint

  Read It: Merging DataFrames

Often, one can perform more interesting and meaningful analyses by using larger datasets, with more variables and/or cases. One way of creating new, sizeable DataFrames is by merging multiple smaller DataFrames. The merge function in Pandas makes the task or merging two DataFrames (let's call them LeftDataFrame and RightDataFrame) relatively easy:

LeftDataFrame.merge(RightDataFrame, how='inner', on=key)

Here, we are just focusing on just the main inputs to the merge function, and one can view the documentation to see the other optional inputs. To start, "key" is a variable (or set of variables) common to both DataFrames, and that we want to use as the basis for merging the two DataFrames. Usually, "key" identifies the unique cases in each DataFrame. The argument "how" indicates which rows to keep from the DataFrames. To explain this further, let's examine each of the four possible methods for "how"

1.) how = 'inner'

To begin, consider both DataFrames as a Venn diagram, where each circle represents the values of the "key" variable contained within that DataFrame:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "Inner." Results described in text below.
Merging Datasets Figure 1
Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 

The 'inner' merge method will only keep the key values that are common to both DataFrames, or, in other words, the intersection of the two circles. The following cartoon depicts how this method would work on two example DataFrames:

Two dataframes being merged: LeftDataFrame, depicting keys 1, 2, and 3 and VAR_X x1, x2, and x3. RightDataFrame depicting keys 1, 2, 4 and Var_Y y1, y2, y3. Merged shows only keys 1 and 2 and Vars x1, x2 and y1, y1. Results described in text below.
Merging Datasets Figure 2
Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 

Note that key values of 3 and 4 are dropped because 3 is not contained in RightDataFrame and 4 is not contained in LeftDataFrame. Only 1 and 2 are retained since they are common to both DataFrames. Further, note that the resultant DataFrame contains all variables (columns) from each of the two input DataFrames (and "key" is not repeated).

2.) how = 'outer'

In contrast to the 'inner' method, 'outer' keeps all the key values from both DataFrames, or, in other words, the union of the two circles in the Venn diagram:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "Outer." Results described in text above.
Merging Datasets Figure 3
Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 

With 'outer' applied, our cartoon with the two example DataFrames now looks like this:

Two dataframes being merged: LeftDataFrame, depicting keys 1, 2, and 3 and VAR_X x1, x2, and x3. RightDataFrame depicting keys 1, 2, 4 and Var_Y y1, y2, y3. Merged shows keys 1, 2, 3, 4 and Var_X  x1, x2, x3, NA and Var_Y y1, y2, NA, y3. Results described in text below.
Merging Datasets Figure 4
Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 

Now key values of 3 and 4 are retained, and NA values are inserted where either Var_X or Var_Y are not observed.

As a special use case, if one wishes to simply "stack" two DataFrames (say, with identical variable names but different rows/keys), they can use 'outer' and leave the on argument blank (defaults to all columns). 

3.) how = 'left'

The 'left' method keeps all the rows with keys from the LeftDataFrame:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "left." Results described in text above.
Merging Datasets Figure 5
Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 

And inserts NA values where variables from the RightDataFrame are not observed:

Two dataframes being merged: LeftDataFrame, depicting keys 1, 2, 3 and VAR_X x1, x2, and x3. RightDataFrame depicting keys 1, 2, 4 and Var_Y y1, y2, y3. Merged shows keys 1, 2, 3 and Var_X  x1, x2, x3 and Var_Y y1, y2, NA. Results described in text above.
Merging Datasets Figure 6
Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 

4.) how = 'right'

In contrast, the 'right' method keeps all the rows with keys from the RightDataFrame:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "right." Results described in text above.
Merging Datasets Figure 7
Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 

And inserts NA values where variables from the LeftDataFrame are not observed:

Two dataframes being merged: LeftDataFrame, depicting keys 1, 2, 3 and VAR_X x1, x2, and x3. RightDataFrame depicting keys 1, 2, 4 and Var_Y y1, y2, y3. Merged shows keys 1, 2, 4 and Var_X  x1, x2, NA and Var_Y y1, y2, y3. Results described in text above.
Merging Datasets Figure 8
Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 

  Watch It: Video - Datatypes in Python (13:43 minutes)

Click here for a transcript.

Hi. In this video we're going to cover merging data frames. So, we're going to continue with our state energy ranking data from the Energy Information Administration. In this video, we're going to merge the natural gas data frame ngdf, with the total energy data frame tedf. Let's exemplify this first with the inner merge. This is the default method. And we'll do this by creating a new data frame, which is called DF, generically PD. To reference pandas, merge these two objects: ngdf, tedf. We want to merge the natural gas with the total energy. We also need to specify on which variable to merge. So, let's do this by state, that is, it's going to match Texas and natural gas with Texas, and total energy Pennsylvania in the first, Pennsylvania the second, and so on, and so forth. So it's going to pair them off on a basis of state. Let's see what this looks like. Here we go. So we've got first the rankings. We've got rank underscore X. That's interesting. What does that mean? Well, these are the rankings from the natural gas data frame. That's our first data frame. So it gives it the subscript X as opposed to rank Y. So this is the rank from the second one. Note that these two variables were called the same thing. So it automatically appended these X and the Y to distinguish the two. Then we have state. This was our variable, our key variable, that we're performing the merge on. So we have all the states here, and then we've got our natural gas and our total energy. And now this miraculous function has automatically paired the values according to what state they belong to.

However, there's a problem with this. We scroll down, we don't have all the states. This stops at 33. Note that the indexes for this start at zero. So we've got 33 states here. There really ought to be 51, 50 states, plus DC. So what happened? Why'd you get rid of those? Well it only kept the states that were equal in natural gas and total energy. Recall that in the natural gas data frame we had some states that had their full names in there. Let's try this a different way. Let's try it with the outer join. So say it's DF again PDF for pandas dot merge. And again, we'll have our ngdf, or tedf, on equals state. And now we'll say, how is outer. So we're going to do an outer merge. Let's see what result we get here. So we still have our rank X, our rank y, that all looks the same. But now we've got down to 69 states. Well, that's strange. That's too many. I'll click on this thing to just get a better inspection of the output here. What's going on? If we scroll down, we need to display some more values here. We see that we've got these states by name here, that don't have any data, no natural gas production, no total energy production. And then we've got their corresponding acronyms. So how did this happen? Well, the natural gas production data set had these states listed out in their full name because they didn't produce any natural gas, so it listed them differently for whatever reason. Whereas the total energy data set does have data for these states and uses their two-letter abbreviation. And the merge function can't associate these two things because it's just not that intelligent. It doesn't recognize that Hawaii is abbreviated HI. We would need to tell it that. So, what do we do? Well, this will be for the next video, logic and sub setting. We'll solve that problem there. For now, that's it for merging.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

 Try It: DataCamp - Merge Two DataFrames

In the exercise below, you are given two DataFrames:

HOME ID DIVISION KWH
10460 Pacific 3491.900
10787 East North Central 6195.942
11055 Mountain North 6976.000
14870 Pacific 10979.658
12200 Mountain South 19472.628
12228 South Atlantic 23645.160
10934 East South Central 19123.754
10731 Middle Atlantic 3982.231
13623 East North Central 9457.710
12524 Pacific 15199.859
HOME ID CLIMATE TOTAL SQUARE FEET
13623 Cold/Very Cold 1881
11055 Cold/Very Cold 2455
10460 Marine 800
12200 Hot-Dry/Mixed-Dry 3800
12524 Marine 1110
10332 Hot-Dry/Mixed-Dry 1630
12106 Cold/Very Cold 5184
11718 Mixed-Humid 1812
14500 Marine 2195
13193 Marine 847

* Data Source: Residential Energy Consumption Survey (RECS)(link is external), U.S. Energy Information Administration (accessed Nov. 15th, 2021)

Note that these two DataFrames share some of the same cases, identified by "HOME ID" (a numeric idenifier unique to each home). In the exercise below, start by seeing what happens when you perform an 'inner' merge:

How does the resulting DataFrame change when you peform an 'outer' merge? A 'left' merge? A 'right' merge? Pay attention to which rows are kept and where NA values are inserted.


 Assess It: Check Your Knowledge

Knowledge Check

FAQ