Data Manipulation: Merging DataFrames

Read It: Merging DataFrames

Often, one can perform more interesting and meaningful analyses by using larger datasets, with more variables and/or cases. One way of creating new, sizeable DataFrames is by merging multiple smaller DataFrames. The merge function in Pandas makes the task or merging two DataFrames (let's call them LeftDataFrame and RightDataFrame) relatively easy:

1	`LeftDataFrame.merge(RightDataFrame, how='inner', on=key)`

Here, we are just focusing on just the main inputs to the merge function, and one can view the documentation(link is external) to see the other optional inputs. To start, "key" is a variable (or set of variables) common to both DataFrames, and that we want to use as the basis for merging the two DataFrames. Usually, "key" identifies the unique cases in each DataFrame. The argument "how" indicates which rows to keep from the DataFrames. To explain this further, let's examine each of the four possible methods for "how"

1.) how = 'inner'

To begin, consider both DataFrames as a Venn diagram, where each circle represents the values of the "key" variable contained within that DataFrame:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "Inner." Results described in text below.

Merging Datasets Figure 1

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

The 'inner' merge method will only keep the key values that are common to both DataFrames, or, in other words, the intersection of the two circles. The following cartoon depicts how this method would work on two example DataFrames:

Two dataframes being merged: LeftDataFrame, depicting keys 1, 2, and 3 and VAR_X x1, x2, and x3. RightDataFrame depicting keys 1, 2, 4 and Var_Y y1, y2, y3. Merged shows only keys 1 and 2 and Vars x1, x2 and y1, y1. Results described in text below.

Merging Datasets Figure 2

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

Note that key values of 3 and 4 are dropped because 3 is not contained in RightDataFrame and 4 is not contained in LeftDataFrame. Only 1 and 2 are retained since they are common to both DataFrames. Further, note that the resultant DataFrame contains all variables (columns) from each of the two input DataFrames (and "key" is not repeated).

2.) how = 'outer'

In contrast to the 'inner' method, 'outer' keeps all the key values from both DataFrames, or, in other words, the union of the two circles in the Venn diagram:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "Outer." Results described in text above.

Merging Datasets Figure 3

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

With 'outer' applied, our cartoon with the two example DataFrames now looks like this:

Merging Datasets Figure 4

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

Now key values of 3 and 4 are retained, and NA values are inserted where either Var_X or Var_Y are not observed.

As a special use case, if one wishes to simply "stack" two DataFrames (say, with identical variable names but different rows/keys), they can use 'outer' and leave the on argument blank (defaults to all columns).

3.) how = 'left'

The 'left' method keeps all the rows with keys from the LeftDataFrame:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "left." Results described in text above.

Merging Datasets Figure 5

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

And inserts NA values where variables from the RightDataFrame are not observed:

Two dataframes being merged: LeftDataFrame, depicting keys 1, 2, 3 and VAR_X x1, x2, and x3. RightDataFrame depicting keys 1, 2, 4 and Var_Y y1, y2, y3. Merged shows keys 1, 2, 3 and Var_X x1, x2, x3 and Var_Y y1, y2, NA. Results described in text above.

Merging Datasets Figure 6

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

4.) how = 'right'

In contrast, the 'right' method keeps all the rows with keys from the RightDataFrame:

Venn Diagram: two circles; first labeled "LeftDataFrame," second labeled "RightDataFrame." The circles overlap in a central portion labeled "right." Results described in text above.

Merging Datasets Figure 7

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

And inserts NA values where variables from the LeftDataFrame are not observed:

Merging Datasets Figure 8

Watch It: Video - Datatypes in Python (13:43 minutes)

Click here for a transcript.

Hi. In this video we're going to cover merging data frames. So, we're going to continue with our state energy ranking data from the Energy Information Administration. In this video, we're going to merge the natural gas data frame ngdf, with the total energy data frame tedf. Let's exemplify this first with the inner merge. This is the default method. And we'll do this by creating a new data frame, which is called DF, generically PD. To reference pandas, merge these two objects: ngdf, tedf. We want to merge the natural gas with the total energy. We also need to specify on which variable to merge. So, let's do this by state, that is, it's going to match Texas and natural gas with Texas, and total energy Pennsylvania in the first, Pennsylvania the second, and so on, and so forth. So it's going to pair them off on a basis of state. Let's see what this looks like. Here we go. So we've got first the rankings. We've got rank underscore X. That's interesting. What does that mean? Well, these are the rankings from the natural gas data frame. That's our first data frame. So it gives it the subscript X as opposed to rank Y. So this is the rank from the second one. Note that these two variables were called the same thing. So it automatically appended these X and the Y to distinguish the two. Then we have state. This was our variable, our key variable, that we're performing the merge on. So we have all the states here, and then we've got our natural gas and our total energy. And now this miraculous function has automatically paired the values according to what state they belong to.

However, there's a problem with this. We scroll down, we don't have all the states. This stops at 33. Note that the indexes for this start at zero. So we've got 33 states here. There really ought to be 51, 50 states, plus DC. So what happened? Why'd you get rid of those? Well it only kept the states that were equal in natural gas and total energy. Recall that in the natural gas data frame we had some states that had their full names in there. Let's try this a different way. Let's try it with the outer join. So say it's DF again PDF for pandas dot merge. And again, we'll have our ngdf, or tedf, on equals state. And now we'll say, how is outer. So we're going to do an outer merge. Let's see what result we get here. So we still have our rank X, our rank y, that all looks the same. But now we've got down to 69 states. Well, that's strange. That's too many. I'll click on this thing to just get a better inspection of the output here. What's going on? If we scroll down, we need to display some more values here. We see that we've got these states by name here, that don't have any data, no natural gas production, no total energy production. And then we've got their corresponding acronyms. So how did this happen? Well, the natural gas production data set had these states listed out in their full name because they didn't produce any natural gas, so it listed them differently for whatever reason. Whereas the total energy data set does have data for these states and uses their two-letter abbreviation. And the merge function can't associate these two things because it's just not that intelligent. It doesn't recognize that Hawaii is abbreviated HI. We would need to tell it that. So, what do we do? Well, this will be for the next video, logic and sub setting. We'll solve that problem there. For now, that's it for merging.

Try It: DataCamp - Merge Two DataFrames

In the exercise below, you are given two DataFrames:

HOME ID	DIVISION	KWH
10460	Pacific	3491.900
10787	East North Central	6195.942
11055	Mountain North	6976.000
14870	Pacific	10979.658
12200	Mountain South	19472.628
12228	South Atlantic	23645.160
10934	East South Central	19123.754
10731	Middle Atlantic	3982.231
13623	East North Central	9457.710
12524	Pacific	15199.859

HOME ID	CLIMATE	TOTAL SQUARE FEET
13623	Cold/Very Cold	1881
11055	Cold/Very Cold	2455
10460	Marine	800
12200	Hot-Dry/Mixed-Dry	3800
12524	Marine	1110
10332	Hot-Dry/Mixed-Dry	1630
12106	Cold/Very Cold	5184
11718	Mixed-Humid	1812
14500	Marine	2195
13193	Marine	847

* Data Source: Residential Energy Consumption Survey (RECS)(link is external), U.S. Energy Information Administration (accessed Nov. 15th, 2021)

Note that these two DataFrames share some of the same cases, identified by "HOME ID" (a numeric idenifier unique to each home). In the exercise below, start by seeing what happens when you perform an 'inner' merge:

How does the resulting DataFrame change when you peform an 'outer' merge? A 'left' merge? A 'right' merge? Pay attention to which rows are kept and where NA values are inserted.

Read It: Merging DataFrames

Watch It: Video - Datatypes in Python (13:43 minutes)

Try It: DataCamp - Merge Two DataFrames

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Navigation

EMS

Programs

Related Links