The following short example demonstrates how pandas can be used to merge two data frames based on a common key, so to perform a join operation in database terms. Let’s say we have two tables, one listing capitals of states in the U.S. and one listing populations for each state. For simplicity we just define data frames for these with just entries for two states, Washington and Oregon:
df1 = pd.DataFrame( {'state': ['Washington', 'Oregon'], 'capital': ['Olympia', 'Salem']} ) print(df1) df2 = pd.DataFrame( {'name': ['Washington', 'Oregon'], 'population':[7288000, 4093000]} ) print(df2)
The two data frames produced by this code look like this:
capital | state | |
---|---|---|
0 | Olympia | Washington |
1 | Salem | Oregon |
name | population | |
---|---|---|
0 | Washington | 7288000 |
1 | Oregon | 4093000 |
We here use a new way of creating a data frame, namely from a dictionary that has entries for each of the columns where the keys are the column names (‘state’ and ‘capital’ in the case of df1, and ‘name’ and ‘population’ in case of df2) and the values stored are lists of the values for that column. We now invoke the merge(...) method on df1 and provide df2 as the first parameter meaning that a new data frame will be produced by merging df1 and df2. We further have to specify which columns should be used as keys for the join operation. Since the two columns containing the state names are called differently, we have to provide the name for df1 through the ‘left_on’ argument and the name for df2 through the ‘right_on’ argument.
merged = df1.merge(df2, left_on='state', right_on='name') merged
The joined data frame produced by the code will look like this:
capital | state | name | population | |
---|---|---|---|---|
0 | Olympia | Washington | Washington | 7288000 |
1 | Salem | Oregon | Oregon | 4093000 |
As you see, the data frames have been merged correctly. However, we do not want two columns with the state names, so, as a last step, we remove the column called ‘name’ with the previously mentioned drop(...) method. As explained, we have to use the keyword argument axis=1 to indicate that we want to drop a column, not a row.
newMerged = merged.drop('name', axis=1) newMerged
Result:
capital | state | population | |
---|---|---|---|
0 | Olympia | Washington | 7288000 |
1 | Salem | Oregon | 4093000 |
If you print out variable merged, you will see that it still contains the 'name' column. That is because drop(...) doesn't change the original data frame but rather produces a copy with the column/row removed.