Implementation of Pandas merge merge operation

The merge() function provided by Pandas can perform efficient merge operations, which is very similar to the join usage of SQL relational databases. It is not difficult to understand literally. Merge translates as "merge", which refers to joining two DataFrame data tables according to specified rules and finally splicing them into a new DataFrame data table.

The method format of the merge() function is as follows:

(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=('_x', '_y'), copy=True)

Parameter descriptions are shown in the following table:

Parameter name	illustrate
left/right	Two different DataFrame objects.
on	Specify the key used for joining (i.e. the name of the column label). This key must exist in the left and right DataFrames at the same time. If it is not specified and other parameters are not specified, the intersection of the column names of the two DataFrames will be used as the joining key.
left_on	Specifies the column name in the DataFrame on the left that serves as the connection key. This parameter is very useful when the label names of the left and right columns are different, but the meanings of the expression are the same.
right_on	Specifies the column name in the DataFrame on the left that serves as the connection key.
left_index	Boolean parameter, default to False. If True, use the row index of the DataFrame on the left as the join key. If the DataFrame has a multi-layer index (MultiIndex), the number of layers must be equal to the number of join keys.
right_index	Boolean parameter, default to False. If True, use the row index of the DataFrame on the left as the join key.
how	The merge type to be performed takes the value from {'left', 'right', 'outer', 'inner'}, and defaults to the "inner" connection.
sort	Boolean parameter, default to True, it sorts the merged data; if set to False, sorts according to the parameter value given by how.
suffixes	Tuples of strings. When the same column name exists in the left and right DataFrame, this parameter can append the suffix name to the same column name, which defaults to ('_x','_y').
copy	The default is True, which means copying the data.

Note: The merge() of the Pandas library supports various internal and external connections, and similarly to it is the join() function (the default is left connection).

Prepare two different DataFrames, read from the file here:

document:

EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
7369,SMITH,CLERK,7902.0,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698.0,1981-02-20,1600,300.0,30
7521,WARD,SALESMAN,7698.0,1981-02-22,1250,500.0,30
7566,JONES,MANAGER,7839.0,1981-04-02,2975,,20
7654,MARTIN,SALESMAN,7698.0,1981-09-28,1250,1400.0,30
7698,BLAKE,MANAGER,7839.0,1981-05-01,2850,,30
7782,CLARK,MANAGER,7839.0,1981-06-09,2450,,10
7788,SCOTT,ANALYST,7566.0,1987-04-19,3000,,20
7839,KING,PRESIDENT,,1981-11-17,5000,,10
7844,TURNER,SALESMAN,7698.0,1981-09-08,1500,0.0,30
7876,ADAMS,CLERK,7788.0,1987-05-23,1100,,20
7900,JAMES,CLERK,7698.0,1981-12-03,950,,30
7902,FORD,ANALYST,7566.0,1981-12-03,3000,,20
7934,MILLER,CLERK,7782.0,1982-01-23,1300,,10

document:

DEPTNO,DNAME,LOC
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

Here are two different DataFrames and merge them:

import pandas as pd
import numpy as np

df_emp  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')

print("DF data in employee table:\n",df_emp)
print("DF data in department table:\n",df_dept)

The output is as follows:

DF data in employee table:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
0 7369 SMITH CLERK 7902.0 1980-12-17 800 NaN 20
1 7499 ALLEN SALESMAN 7698.0 1981-02-20 1600 300.0 30
2 7521 WARD SALESMAN 7698.0 1981-02-22 1250 500.0 30
3 7566 JONES MANAGER 7839.0 1981-04-02 2975 NaN 20
4 7654 MARTIN SALESMAN 7698.0 1981-09-28 1250 1400.0 30
5 7698 BLAKE MANAGER 7839.0 1981-05-01 2850 NaN 30
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
7 7788 SCOTT ANALYST 7566.0 1987-04-19 3000 NaN 20
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
9 7844 TURNER SALESMAN 7698.0 1981-09-08 1500 0.0 30
10 7876 ADAMS CLERK 7788.0 1987-05-23 1100 NaN 20
11 7900 JAMES CLERK 7698.0 1981-12-03 950 NaN 30
12 7902 FORD ANALYST 7566.0 1981-12-03 3000 NaN 20
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10
DF data in department table:
DEPTNO DNAME LOC
0 10 ACCOUNTING NEW YORK
1 20 RESEARCH DALLAS
2 30 SALES CHICAGO
3 40 OPERATIONS BOSTON

1) Merge operations on a single key

Specify a connection key through the on parameter, and then merge the above DataFrame:

import pandas as pd
import numpy as np

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
# Specify the merged primary key via onprint((df_emp_left,df_dept_right,on="DEPTNO"))

Output result:

EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7782 CLARK MANAGER 7839.0 ... NaN 10 ACCOUNTING NEW YORK
1 7839 KING PRESIDENT NaN ... NaN 10 ACCOUNTING NEW YORK
2 7934 MILLER CLERK 7782.0 ... NaN 10 ACCOUNTING NEW YORK
3 7369 SMITH CLERK 7902.0 ... NaN 20 RESEARCH DALLAS
4 7566 JONES MANAGER 7839.0 ... NaN 20 RESEARCH DALLAS
5 7788 SCOTT ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
6 7876 ADAMS CLERK 7788.0 ... NaN 20 RESEARCH DALLAS
7 7902 FORD ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
8 7499 ALLEN SALESMAN 7698.0 ... 300.0 30 SALES CHICAGO
9 7521 WARD SALESMAN 7698.0 ... 500.0 30 SALES CHICAGO
10 7654 MARTIN SALESMAN 7698.0 ... 1400.0 30 SALES CHICAGO
11 7698 BLAKE MANAGER 7839.0 ... NaN 30 SALES CHICAGO
12 7844 TURNER SALESMAN 7698.0 ... 0.0 30 SALES CHICAGO
13 7900 JAMES CLERK 7698.0 ... NaN 30 SALES CHICAGO

[14 rows x 10 columns]

2) Merge operations on multiple keys

The following example is to specify multiple keys to merge the above two DataFrame objects: (custom data is used here)

import pandas as pd
left = ({
   'id':[1,2,3,4],
   'Name': ['Smith', 'Maiki', 'Hunter', 'Hilen'],
   'subject_id':['sub1','sub2','sub4','sub6']})
right = ({
    'id':[1,2,3,4],
   'Name': ['Bill', 'Lucy', 'Jack', 'Mike'],
   'subject_id':['sub2','sub4','sub3','sub6']})
print((left,right,on=['id','subject_id']))

Output result:

id Name_x subject_id Name_y
0 4 Hilen sub6 Mike

Merge using how parameters

passhowParameters can determine which keys to include in the DataFrame. If the keys that are not stored in the left and right tables, the corresponding value of the key after the merge is NaN. To facilitate everyone's learning, we summarized the how parameter and equivalent SQL statements:

Merge method	SQL equivalent	describe
left	LEFT OUTER JOIN	Use the key of the object on the left
right	RIGHT OUTER JOIN	Use the key of the object on the right
outer	FULL OUTER JOIN	Use union of all keys on the left and right sides
inner	INNER JOIN	Use the intersection of left and right keys

1) left join

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
# For demonstration, the department of employee 7369 was deleted here and run as follows:print((df_emp_left,df_dept_right,on="DEPTNO",how="left"))

Output result: It can be found that DEPTNO, DNAME, and LOC of 7369 are all NaN

EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7369 SMITH CLERK 7902.0 ... NaN NaN NaN NaN
1 7499 ALLEN SALESMAN 7698.0 ... 300.0 30.0 SALES CHICAGO
2 7521 WARD SALESMAN 7698.0 ... 500.0 30.0 SALES CHICAGO
3 7566 JONES MANAGER 7839.0 ... NaN 20.0 RESEARCH DALLAS
4 7654 MARTIN SALESMAN 7698.0 ... 1400.0 30.0 SALES CHICAGO
5 7698 BLAKE MANAGER 7839.0 ... NaN 30.0 SALES CHICAGO
6 7782 CLARK MANAGER 7839.0 ... NaN 10.0 ACCOUNTING NEW YORK
7 7788 SCOTT ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
8 7839 KING PRESIDENT NaN ... NaN 10.0 ACCOUNTING NEW YORK
9 7844 TURNER SALESMAN 7698.0 ... 0.0 30.0 SALES CHICAGO
10 7876 ADAMS CLERK 7788.0 ... NaN 20.0 RESEARCH DALLAS
11 7900 JAMES CLERK 7698.0 ... NaN 30.0 SALES CHICAGO
12 7902 FORD ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
13 7934 MILLER CLERK 7782.0 ... NaN 10.0 ACCOUNTING NEW YORK

[14 rows x 10 columns]

2) right join

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
# Department 40 Because there are no employees, all the information corresponding to employees is NaNprint((df_emp_left,df_dept_right,on="DEPTNO",how="right"))

Output result: (Since department 40 has no employees, all the information corresponding to employees is NaN)

EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7782.0 CLARK MANAGER 7839.0 ... NaN 10 ACCOUNTING NEW YORK
1 7839.0 KING PRESIDENT NaN ... NaN 10 ACCOUNTING NEW YORK
2 7934.0 MILLER CLERK 7782.0 ... NaN 10 ACCOUNTING NEW YORK
3 7369.0 SMITH CLERK 7902.0 ... NaN 20 RESEARCH DALLAS
4 7566.0 JONES MANAGER 7839.0 ... NaN 20 RESEARCH DALLAS
5 7788.0 SCOTT ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
6 7876.0 ADAMS CLERK 7788.0 ... NaN 20 RESEARCH DALLAS
7 7902.0 FORD ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
8 7499.0 ALLEN SALESMAN 7698.0 ... 300.0 30 SALES CHICAGO
9 7521.0 WARD SALESMAN 7698.0 ... 500.0 30 SALES CHICAGO
10 7654.0 MARTIN SALESMAN 7698.0 ... 1400.0 30 SALES CHICAGO
11 7698.0 BLAKE MANAGER 7839.0 ... NaN 30 SALES CHICAGO
12 7844.0 TURNER SALESMAN 7698.0 ... 0.0 30 SALES CHICAGO
13 7900.0 JAMES CLERK 7698.0 ... NaN 30 SALES CHICAGO
14 NaN NaN NaN NaN ... NaN 40 OPERATIONS BOSTON

[15 rows x 10 columns]

3) outer join (union)

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
# Department 40 Because there are no employees, all the information of the corresponding employees is NaN. And because 7369 does not have a corresponding department, all the information of the department is Nanprint((df_emp_left,df_dept_right,on="DEPTNO",how="outer"))

Output result:

EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7369.0 SMITH CLERK 7902.0 ... NaN NaN NaN NaN
1 7499.0 ALLEN SALESMAN 7698.0 ... 300.0 30.0 SALES CHICAGO
2 7521.0 WARD SALESMAN 7698.0 ... 500.0 30.0 SALES CHICAGO
3 7654.0 MARTIN SALESMAN 7698.0 ... 1400.0 30.0 SALES CHICAGO
4 7698.0 BLAKE MANAGER 7839.0 ... NaN 30.0 SALES CHICAGO
5 7844.0 TURNER SALESMAN 7698.0 ... 0.0 30.0 SALES CHICAGO
6 7900.0 JAMES CLERK 7698.0 ... NaN 30.0 SALES CHICAGO
7 7566.0 JONES MANAGER 7839.0 ... NaN 20.0 RESEARCH DALLAS
8 7788.0 SCOTT ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
9 7876.0 ADAMS CLERK 7788.0 ... NaN 20.0 RESEARCH DALLAS
10 7902.0 FORD ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
11 7782.0 CLARK MANAGER 7839.0 ... NaN 10.0 ACCOUNTING NEW YORK
12 7839.0 KING PRESIDENT NaN ... NaN 10.0 ACCOUNTING NEW YORK
13 7934.0 MILLER CLERK 7782.0 ... NaN 10.0 ACCOUNTING NEW YORK
14 NaN NaN NaN NaN ... NaN 40.0 OPERATIONS BOSTON

[15 rows x 10 columns]

4) inner join (intersection)

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
# Department 40 has no employees, and because 7369 has no corresponding department, so department 40 and employee 7369 are not displayedprint((df_emp_left,df_dept_right,on="DEPTNO",how="inner"))

Output result:

EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7499 ALLEN SALESMAN 7698.0 ... 300.0 30.0 SALES CHICAGO
1 7521 WARD SALESMAN 7698.0 ... 500.0 30.0 SALES CHICAGO
2 7654 MARTIN SALESMAN 7698.0 ... 1400.0 30.0 SALES CHICAGO
3 7698 BLAKE MANAGER 7839.0 ... NaN 30.0 SALES CHICAGO
4 7844 TURNER SALESMAN 7698.0 ... 0.0 30.0 SALES CHICAGO
5 7900 JAMES CLERK 7698.0 ... NaN 30.0 SALES CHICAGO
6 7566 JONES MANAGER 7839.0 ... NaN 20.0 RESEARCH DALLAS
7 7788 SCOTT ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
8 7876 ADAMS CLERK 7788.0 ... NaN 20.0 RESEARCH DALLAS
9 7902 FORD ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
10 7782 CLARK MANAGER 7839.0 ... NaN 10.0 ACCOUNTING NEW YORK
11 7839 KING PRESIDENT NaN ... NaN 10.0 ACCOUNTING NEW YORK
12 7934 MILLER CLERK 7782.0 ... NaN 10.0 ACCOUNTING NEW YORK

Note: When a and b perform inline operations (b) does not equal (a).

5) Use the join function

If the two tables left and rightThe primary key name is the same, you can use the join function.

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\')
print(df_emp_left.join(df_dept_right,how="outer",rsuffix='1'))

Running results:

EMPNO ENAME JOB MGR ... DEPTNO DEPTNO1 DNAME LOC
0 7369 SMITH CLERK 7902.0 ... NaN 10.0 ACCOUNTING NEW YORK
1 7499 ALLEN SALESMAN 7698.0 ... 30.0 20.0 RESEARCH DALLAS
2 7521 WARD SALESMAN 7698.0 ... 30.0 30.0 SALES CHICAGO
3 7566 JONES MANAGER 7839.0 ... 20.0 40.0 OPERATIONS BOSTON
4 7654 MARTIN SALESMAN 7698.0 ... 30.0 NaN NaN NaN
5 7698 BLAKE MANAGER 7839.0 ... 30.0 NaN NaN NaN
6 7782 CLARK MANAGER 7839.0 ... 10.0 NaN NaN NaN
7 7788 SCOTT ANALYST 7566.0 ... 20.0 NaN NaN NaN
8 7839 KING PRESIDENT NaN ... 10.0 NaN NaN NaN
9 7844 TURNER SALESMAN 7698.0 ... 30.0 NaN NaN NaN
10 7876 ADAMS CLERK 7788.0 ... 20.0 NaN NaN NaN
11 7900 JAMES CLERK 7698.0 ... 30.0 NaN NaN NaN
12 7902 FORD ANALYST 7566.0 ... 20.0 NaN NaN NaN
13 7934 MILLER CLERK 7782.0 ... 10.0 NaN NaN NaN

This is the end of this article about the implementation of Pandas merge merge operation. For more related Pandas merge merge content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!