SoFunction
Updated on 2025-03-10

Implementation of Pandas merge merge operation

The merge() function provided by Pandas can perform efficient merge operations, which is very similar to the join usage of SQL relational databases. It is not difficult to understand literally. Merge translates as "merge", which refers to joining two DataFrame data tables according to specified rules and finally splicing them into a new DataFrame data table.

The method format of the merge() function is as follows:

(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=('_x', '_y'), copy=True)

Parameter descriptions are shown in the following table:

Parameter name illustrate
left/right Two different DataFrame objects.
on Specify the key used for joining (i.e. the name of the column label). This key must exist in the left and right DataFrames at the same time. If it is not specified and other parameters are not specified, the intersection of the column names of the two DataFrames will be used as the joining key.
left_on Specifies the column name in the DataFrame on the left that serves as the connection key. This parameter is very useful when the label names of the left and right columns are different, but the meanings of the expression are the same.
right_on Specifies the column name in the DataFrame on the left that serves as the connection key.
left_index Boolean parameter, default to False. If True, use the row index of the DataFrame on the left as the join key. If the DataFrame has a multi-layer index (MultiIndex), the number of layers must be equal to the number of join keys.
right_index Boolean parameter, default to False. If True, use the row index of the DataFrame on the left as the join key.
how The merge type to be performed takes the value from {'left', 'right', 'outer', 'inner'}, and defaults to the "inner" connection.
sort Boolean parameter, default to True, it sorts the merged data; if set to False, sorts according to the parameter value given by how.
suffixes Tuples of strings. When the same column name exists in the left and right DataFrame, this parameter can append the suffix name to the same column name, which defaults to ('_x','_y').
copy The default is True, which means copying the data.

Note: The merge() of the Pandas library supports various internal and external connections, and similarly to it is the join() function (the default is left connection).

Prepare two different DataFrames, read from the file here:

document:

EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
7369,SMITH,CLERK,7902.0,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698.0,1981-02-20,1600,300.0,30
7521,WARD,SALESMAN,7698.0,1981-02-22,1250,500.0,30
7566,JONES,MANAGER,7839.0,1981-04-02,2975,,20
7654,MARTIN,SALESMAN,7698.0,1981-09-28,1250,1400.0,30
7698,BLAKE,MANAGER,7839.0,1981-05-01,2850,,30
7782,CLARK,MANAGER,7839.0,1981-06-09,2450,,10
7788,SCOTT,ANALYST,7566.0,1987-04-19,3000,,20
7839,KING,PRESIDENT,,1981-11-17,5000,,10
7844,TURNER,SALESMAN,7698.0,1981-09-08,1500,0.0,30
7876,ADAMS,CLERK,7788.0,1987-05-23,1100,,20
7900,JAMES,CLERK,7698.0,1981-12-03,950,,30
7902,FORD,ANALYST,7566.0,1981-12-03,3000,,20
7934,MILLER,CLERK,7782.0,1982-01-23,1300,,10

document:

DEPTNO,DNAME,LOC
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

Here are two different DataFrames and merge them:

import pandas as pd
import numpy as np

df_emp  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')

print("DF data in employee table:\n",df_emp)
print("DF data in department table:\n",df_dept)

The output is as follows:

DF data in employee table:
     EMPNO   ENAME        JOB     MGR    HIREDATE   SAL    COMM  DEPTNO
0    7369   SMITH      CLERK  7902.0  1980-12-17   800     NaN      20
1    7499   ALLEN   SALESMAN  7698.0  1981-02-20  1600   300.0      30
2    7521    WARD   SALESMAN  7698.0  1981-02-22  1250   500.0      30
3    7566   JONES    MANAGER  7839.0  1981-04-02  2975     NaN      20
4    7654  MARTIN   SALESMAN  7698.0  1981-09-28  1250  1400.0      30
5    7698   BLAKE    MANAGER  7839.0  1981-05-01  2850     NaN      30
6    7782   CLARK    MANAGER  7839.0  1981-06-09  2450     NaN      10
7    7788   SCOTT    ANALYST  7566.0  1987-04-19  3000     NaN      20
8    7839    KING  PRESIDENT     NaN  1981-11-17  5000     NaN      10
9    7844  TURNER   SALESMAN  7698.0  1981-09-08  1500     0.0      30
10   7876   ADAMS      CLERK  7788.0  1987-05-23  1100     NaN      20
11   7900   JAMES      CLERK  7698.0  1981-12-03   950     NaN      30
12   7902    FORD    ANALYST  7566.0  1981-12-03  3000     NaN      20
13   7934  MILLER      CLERK  7782.0  1982-01-23  1300     NaN      10
DF data in department table:
    DEPTNO       DNAME       LOC
0      10  ACCOUNTING  NEW YORK
1      20    RESEARCH    DALLAS
2      30       SALES   CHICAGO
3      40  OPERATIONS    BOSTON

1) Merge operations on a single key

Specify a connection key through the on parameter, and then merge the above DataFrame:

import pandas as pd
import numpy as np

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
# Specify the merged primary key via onprint((df_emp_left,df_dept_right,on="DEPTNO"))

Output result:

    EMPNO   ENAME        JOB     MGR  ...    COMM  DEPTNO       DNAME       LOC
0    7782   CLARK    MANAGER  7839.0  ...     NaN      10  ACCOUNTING  NEW YORK
1    7839    KING  PRESIDENT     NaN  ...     NaN      10  ACCOUNTING  NEW YORK
2    7934  MILLER      CLERK  7782.0  ...     NaN      10  ACCOUNTING  NEW YORK
3    7369   SMITH      CLERK  7902.0  ...     NaN      20    RESEARCH    DALLAS
4    7566   JONES    MANAGER  7839.0  ...     NaN      20    RESEARCH    DALLAS
5    7788   SCOTT    ANALYST  7566.0  ...     NaN      20    RESEARCH    DALLAS
6    7876   ADAMS      CLERK  7788.0  ...     NaN      20    RESEARCH    DALLAS
7    7902    FORD    ANALYST  7566.0  ...     NaN      20    RESEARCH    DALLAS
8    7499   ALLEN   SALESMAN  7698.0  ...   300.0      30       SALES   CHICAGO
9    7521    WARD   SALESMAN  7698.0  ...   500.0      30       SALES   CHICAGO
10   7654  MARTIN   SALESMAN  7698.0  ...  1400.0      30       SALES   CHICAGO
11   7698   BLAKE    MANAGER  7839.0  ...     NaN      30       SALES   CHICAGO
12   7844  TURNER   SALESMAN  7698.0  ...     0.0      30       SALES   CHICAGO
13   7900   JAMES      CLERK  7698.0  ...     NaN      30       SALES   CHICAGO

[14 rows x 10 columns]

2) Merge operations on multiple keys

The following example is to specify multiple keys to merge the above two DataFrame objects: (custom data is used here)

import pandas as pd
left = ({
   'id':[1,2,3,4],
   'Name': ['Smith', 'Maiki', 'Hunter', 'Hilen'],
   'subject_id':['sub1','sub2','sub4','sub6']})
right = ({
    'id':[1,2,3,4],
   'Name': ['Bill', 'Lucy', 'Jack', 'Mike'],
   'subject_id':['sub2','sub4','sub3','sub6']})
print((left,right,on=['id','subject_id']))

Output result:

   id Name_x subject_id Name_y
0   4  Hilen       sub6   Mike

Merge using how parameters

passhowParameters can determine which keys to include in the DataFrame. If the keys that are not stored in the left and right tables, the corresponding value of the key after the merge is NaN. To facilitate everyone's learning, we summarized the how parameter and equivalent SQL statements:

Merge method SQL equivalent describe
left LEFT OUTER JOIN Use the key of the object on the left
right RIGHT OUTER JOIN Use the key of the object on the right
outer FULL OUTER JOIN Use union of all keys on the left and right sides
inner INNER JOIN Use the intersection of left and right keys

1) left join

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
# For demonstration, the department of employee 7369 was deleted here and run as follows:print((df_emp_left,df_dept_right,on="DEPTNO",how="left"))

Output result: It can be found that DEPTNO, DNAME, and LOC of 7369 are all NaN

    EMPNO   ENAME        JOB     MGR  ...    COMM  DEPTNO       DNAME       LOC
0    7369   SMITH      CLERK  7902.0  ...     NaN     NaN         NaN       NaN
1    7499   ALLEN   SALESMAN  7698.0  ...   300.0    30.0       SALES   CHICAGO
2    7521    WARD   SALESMAN  7698.0  ...   500.0    30.0       SALES   CHICAGO
3    7566   JONES    MANAGER  7839.0  ...     NaN    20.0    RESEARCH    DALLAS
4    7654  MARTIN   SALESMAN  7698.0  ...  1400.0    30.0       SALES   CHICAGO
5    7698   BLAKE    MANAGER  7839.0  ...     NaN    30.0       SALES   CHICAGO
6    7782   CLARK    MANAGER  7839.0  ...     NaN    10.0  ACCOUNTING  NEW YORK
7    7788   SCOTT    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
8    7839    KING  PRESIDENT     NaN  ...     NaN    10.0  ACCOUNTING  NEW YORK
9    7844  TURNER   SALESMAN  7698.0  ...     0.0    30.0       SALES   CHICAGO
10   7876   ADAMS      CLERK  7788.0  ...     NaN    20.0    RESEARCH    DALLAS
11   7900   JAMES      CLERK  7698.0  ...     NaN    30.0       SALES   CHICAGO
12   7902    FORD    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
13   7934  MILLER      CLERK  7782.0  ...     NaN    10.0  ACCOUNTING  NEW YORK

[14 rows x 10 columns]

2) right join

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
# Department 40 Because there are no employees, all the information corresponding to employees is NaNprint((df_emp_left,df_dept_right,on="DEPTNO",how="right"))

Output result: (Since department 40 has no employees, all the information corresponding to employees is NaN)

     EMPNO   ENAME        JOB     MGR  ...    COMM  DEPTNO       DNAME       LOC
0   7782.0   CLARK    MANAGER  7839.0  ...     NaN      10  ACCOUNTING  NEW YORK
1   7839.0    KING  PRESIDENT     NaN  ...     NaN      10  ACCOUNTING  NEW YORK
2   7934.0  MILLER      CLERK  7782.0  ...     NaN      10  ACCOUNTING  NEW YORK
3   7369.0   SMITH      CLERK  7902.0  ...     NaN      20    RESEARCH    DALLAS
4   7566.0   JONES    MANAGER  7839.0  ...     NaN      20    RESEARCH    DALLAS
5   7788.0   SCOTT    ANALYST  7566.0  ...     NaN      20    RESEARCH    DALLAS
6   7876.0   ADAMS      CLERK  7788.0  ...     NaN      20    RESEARCH    DALLAS
7   7902.0    FORD    ANALYST  7566.0  ...     NaN      20    RESEARCH    DALLAS
8   7499.0   ALLEN   SALESMAN  7698.0  ...   300.0      30       SALES   CHICAGO
9   7521.0    WARD   SALESMAN  7698.0  ...   500.0      30       SALES   CHICAGO
10  7654.0  MARTIN   SALESMAN  7698.0  ...  1400.0      30       SALES   CHICAGO
11  7698.0   BLAKE    MANAGER  7839.0  ...     NaN      30       SALES   CHICAGO
12  7844.0  TURNER   SALESMAN  7698.0  ...     0.0      30       SALES   CHICAGO
13  7900.0   JAMES      CLERK  7698.0  ...     NaN      30       SALES   CHICAGO
14     NaN     NaN        NaN     NaN  ...     NaN      40  OPERATIONS    BOSTON

[15 rows x 10 columns]

3) outer join (union)

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
# Department 40 Because there are no employees, all the information of the corresponding employees is NaN. And because 7369 does not have a corresponding department, all the information of the department is Nanprint((df_emp_left,df_dept_right,on="DEPTNO",how="outer"))

Output result:

     EMPNO   ENAME        JOB     MGR  ...    COMM  DEPTNO       DNAME       LOC
0   7369.0   SMITH      CLERK  7902.0  ...     NaN     NaN         NaN       NaN
1   7499.0   ALLEN   SALESMAN  7698.0  ...   300.0    30.0       SALES   CHICAGO
2   7521.0    WARD   SALESMAN  7698.0  ...   500.0    30.0       SALES   CHICAGO
3   7654.0  MARTIN   SALESMAN  7698.0  ...  1400.0    30.0       SALES   CHICAGO
4   7698.0   BLAKE    MANAGER  7839.0  ...     NaN    30.0       SALES   CHICAGO
5   7844.0  TURNER   SALESMAN  7698.0  ...     0.0    30.0       SALES   CHICAGO
6   7900.0   JAMES      CLERK  7698.0  ...     NaN    30.0       SALES   CHICAGO
7   7566.0   JONES    MANAGER  7839.0  ...     NaN    20.0    RESEARCH    DALLAS
8   7788.0   SCOTT    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
9   7876.0   ADAMS      CLERK  7788.0  ...     NaN    20.0    RESEARCH    DALLAS
10  7902.0    FORD    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
11  7782.0   CLARK    MANAGER  7839.0  ...     NaN    10.0  ACCOUNTING  NEW YORK
12  7839.0    KING  PRESIDENT     NaN  ...     NaN    10.0  ACCOUNTING  NEW YORK
13  7934.0  MILLER      CLERK  7782.0  ...     NaN    10.0  ACCOUNTING  NEW YORK
14     NaN     NaN        NaN     NaN  ...     NaN    40.0  OPERATIONS    BOSTON

[15 rows x 10 columns]

4) inner join (intersection)

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
# Department 40 has no employees, and because 7369 has no corresponding department, so department 40 and employee 7369 are not displayedprint((df_emp_left,df_dept_right,on="DEPTNO",how="inner"))

Output result:

    EMPNO   ENAME        JOB     MGR  ...    COMM  DEPTNO       DNAME       LOC
0    7499   ALLEN   SALESMAN  7698.0  ...   300.0    30.0       SALES   CHICAGO
1    7521    WARD   SALESMAN  7698.0  ...   500.0    30.0       SALES   CHICAGO
2    7654  MARTIN   SALESMAN  7698.0  ...  1400.0    30.0       SALES   CHICAGO
3    7698   BLAKE    MANAGER  7839.0  ...     NaN    30.0       SALES   CHICAGO
4    7844  TURNER   SALESMAN  7698.0  ...     0.0    30.0       SALES   CHICAGO
5    7900   JAMES      CLERK  7698.0  ...     NaN    30.0       SALES   CHICAGO
6    7566   JONES    MANAGER  7839.0  ...     NaN    20.0    RESEARCH    DALLAS
7    7788   SCOTT    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
8    7876   ADAMS      CLERK  7788.0  ...     NaN    20.0    RESEARCH    DALLAS
9    7902    FORD    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
10   7782   CLARK    MANAGER  7839.0  ...     NaN    10.0  ACCOUNTING  NEW YORK
11   7839    KING  PRESIDENT     NaN  ...     NaN    10.0  ACCOUNTING  NEW YORK
12   7934  MILLER      CLERK  7782.0  ...     NaN    10.0  ACCOUNTING  NEW YORK

Note: When a and b perform inline operations (b) does not equal (a).

5) Use the join function

If the two tables left and rightThe primary key name is the same, you can use the join function.

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
print(df_emp_left.join(df_dept_right,how="outer",rsuffix='1'))

Running results:

    EMPNO   ENAME        JOB     MGR  ... DEPTNO  DEPTNO1       DNAME       LOC
0    7369   SMITH      CLERK  7902.0  ...    NaN     10.0  ACCOUNTING  NEW YORK
1    7499   ALLEN   SALESMAN  7698.0  ...   30.0     20.0    RESEARCH    DALLAS
2    7521    WARD   SALESMAN  7698.0  ...   30.0     30.0       SALES   CHICAGO
3    7566   JONES    MANAGER  7839.0  ...   20.0     40.0  OPERATIONS    BOSTON
4    7654  MARTIN   SALESMAN  7698.0  ...   30.0      NaN         NaN       NaN
5    7698   BLAKE    MANAGER  7839.0  ...   30.0      NaN         NaN       NaN
6    7782   CLARK    MANAGER  7839.0  ...   10.0      NaN         NaN       NaN
7    7788   SCOTT    ANALYST  7566.0  ...   20.0      NaN         NaN       NaN
8    7839    KING  PRESIDENT     NaN  ...   10.0      NaN         NaN       NaN
9    7844  TURNER   SALESMAN  7698.0  ...   30.0      NaN         NaN       NaN
10   7876   ADAMS      CLERK  7788.0  ...   20.0      NaN         NaN       NaN
11   7900   JAMES      CLERK  7698.0  ...   30.0      NaN         NaN       NaN
12   7902    FORD    ANALYST  7566.0  ...   20.0      NaN         NaN       NaN
13   7934  MILLER      CLERK  7782.0  ...   10.0      NaN         NaN       NaN

This is the end of this article about the implementation of Pandas merge merge operation. For more related Pandas merge merge content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!