The merge, join, and concat methods of the Pandas package can accomplish data merging and splicing. The merge method is mainly based on the common columns of the two dataframes, the join method is mainly based on the indexes of the two dataframes, and the concat method is a row or column splicing of the series or dataframe. concat method is to perform row splicing or column splicing on a series or dataframe.
1. Merge method
The merge method of pandas joins two dataframes based on common columns. the main parameters of the merge method:
- left/right: left/right position of the dataframe.
- how: how to merge data. left: merge data based on left dataframe column; right: merge data based on right dataframe column; outer: merge data outside the column (take concatenation); inner: merge data inside the column (take intersection); default is 'inner'.
- on: the name of the column to merge, this parameter needs to ensure that the two dataframes have the same column name.
- left_on/right_on: column names for left/right dataframe merge, also indexes, arrays and lists.
- left_index/right_index: whether to use index as the column name for data merge, True means yes.
- sort: sort the dataframe according to its merged keys, default.
- suffixes: If there is the same column and the column is not used as a merged column, you can set the suffix name of the column through the suffixes, usually tuple and list type.
merges selects how two dataframes are connected by setting the how parameter, there are inner connection, outer connection, left connection, right connection, the following example introduces the meaning of connection.
1.1 Internal connections
how='inner', the linking method of dataframe is inner join, we can understand that based on the intersection of common columns to connect, the parameter on sets the common column name of the connection.
# Single-column inline connections # Define df1 import pandas as pd import numpy as np df1 = ({'alpha':['A','B','B','C','D','E'],'feature1':[1,1,2,3,3,1], 'feature2':['low','medium','medium','high','low','high']}) # Define df2 df2 = ({'alpha':['A','A','B','F'],'pazham':['apple','orange','pine','pear'], 'kilo':['high','low','high','medium'],'price':([5,6,5,7])}) # print(df1) # print(df2) # common column alpha-based inner join df3 = (df1,df2,how='inner',on='alpha') df3
Take the intersection of the common column alpha values to join.
1.2 External connections
how='outer', the linking method of dataframe is outer join, we can understand the connection based on the concatenation of common columns, the parameter on sets the common column name of the connection.
# Single-column outer links # Define df1 df1 = ({'alpha':['A','B','B','C','D','E'],'feature1':[1,1,2,3,3,1], 'feature2':['low','medium','medium','high','low','high']}) # Define df2 df2 = ({'alpha':['A','A','B','F'],'pazham':['apple','orange','pine','pear'], 'kilo':['high','low','high','medium'],'price':([5,6,5,7])}) # common column alpha-based inner join df4 = (df1,df2,how='outer',on='alpha') df4
If there is no identical column between the two dataframes except for the connection column set by on, the value of the column is set to NaN.
1.3 Left connection
how='left', dataframe link way for the left connection, we can understand based on the left position of the dataframe columns to connect, the parameter on set the connection of the common column name.
# Single-column left joins # Define df1 df1 = ({'alpha':['A','B','B','C','D','E'],'feature1':[1,1,2,3,3,1], 'feature2':['low','medium','medium','high','low','high']}) # Define df2 df2 = ({'alpha':['A','A','B','F'],'pazham':['apple','orange','pine','pear'], 'kilo':['high','low','high','medium'],'price':([5,6,5,7])}) # Left joins based on common column alpha df5 = (df1,df2,how='left',on='alpha') df5
Since the join column alpha of df2 has two 'A' values, the left-joined df5 has two 'A' values, and if there are no identical columns between the two dataframes except for the join column set by on, the value of that column is set to NaN.
1.4 Right Connection
how='right', dataframe link way for the left connection, we can understand based on the right position of the dataframe columns to connect, the parameter on set the connection of the common column name.
# Single-column right joins # Define df1 df1 = ({'alpha':['A','B','B','C','D','E'],'feature1':[1,1,2,3,3,1], 'feature2':['low','medium','medium','high','low','high']}) # Define df2 df2 = ({'alpha':['A','A','B','F'],'pazham':['apple','orange','pine','pear'], 'kilo':['high','low','high','medium'],'price':([5,6,5,7])}) # Right joins based on common column alpha df6 = (df1,df2,how='right',on='alpha') df6
Since the join column alpha of df1 has two 'B' values, the right-joined df6 has two 'B' values. If there is no identical column between the two dataframes other than the join column set by on, the value of that column is set to NaN.
1.5 Multi-column based joining algorithm
The algorithm for multi-column joins is consistent with single-column joins, this section only introduces multi-column based inner joins and right joins, readers can code their own and follow the diagrammatic approach given in this paper to understand the outer joins and left joins.
Inner joins with multiple columns:
# Inner joins with multiple columns # Define df1 df1 = ({'alpha':['A','B','B','C','D','E'],'beta':['a','a','b','c','c','e'], 'feature1':[1,1,2,3,3,1],'feature2':['low','medium','medium','high','low','high']}) # Define df2 df2 = ({'alpha':['A','A','B','F'],'beta':['d','d','b','f'],'pazham':['apple','orange','pine','pear'], 'kilo':['high','low','high','medium'],'price':([5,6,5,7])}) # Endoconnections based on common columns alpha and beta df7 = (df1,df2,on=['alpha','beta'],how='inner') df7
Right joining of multiple columns:
# Right-joining of multiple columns # Define df1 df1 = ({'alpha':['A','B','B','C','D','E'],'beta':['a','a','b','c','c','e'], 'feature1':[1,1,2,3,3,1],'feature2':['low','medium','medium','high','low','high']}) # Define df2 df2 = ({'alpha':['A','A','B','F'],'beta':['d','d','b','f'],'pazham':['apple','orange','pine','pear'], 'kilo':['high','low','high','medium'],'price':([5,6,5,7])}) print(df1) print(df2) # Right connections based on common columns alpha and beta df8 = (df1,df2,on=['alpha','beta'],how='right') df8
1.6 Index-based linking methods
Previously introduced the column-based connection method, merge method can also be based on index connection dataframe.
# Column and index based right joins # Define df1 df1 = ({'alpha':['A','B','B','C','D','E'],'beta':['a','a','b','c','c','e'], 'feature1':[1,1,2,3,3,1],'feature2':['low','medium','medium','high','low','high']}) # Define df2 df2 = ({'alpha':['A','A','B','F'],'pazham':['apple','orange','pine','pear'], 'kilo':['high','low','high','medium'],'price':([5,6,5,7])},index=['d','d','b','f']) print(df1) print(df2) # Based on df1's beta column and df2's index join df9 = (df1,df2,how='inner',left_on='beta',right_index=True) df9
Illustrate the inner join method of index and column:
Set the parameter suffixes to modify the suffixes of the same columns except for concatenated columns.
# Based on df1's alpha column and df2's index inner join df9 = (df1,df2,how='inner',left_on='beta',right_index=True,suffixes=('_df1','_df2')) df9
2. The join method
join method is based on index connection dataframe, merge method is based on column connection, join method has inner join, outer join, left join and right join, consistent with merge.
index is connected to index:
caller = ({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']}) other = ({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']}) print(caller)print(other)# lsuffix and rsuffix set the connection suffixes (other,lsuffix='_caller', rsuffix='_other',how='inner')
join can also be based on columns:
caller = ({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']}) other = ({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']}) print(caller) print(other) # Join based on key columns caller.set_index('key').join(other.set_index('key'),how='inner')
Therefore, join and merge are connected in a similar way, so we won't expand on the join method here, and suggest using the merge method.
3. The concat method
The concat method is a splice function with row splice and column splice, the default is row splice, the splice method defaults to outer splice (concatenation), and the object of the splice is the pandas data type.
3.1 Splice methods for series types
Row splicing:
df1 = ([1.1,2.2,3.3],index=['i1','i2','i3']) df2 = ([4.4,5.5,6.6],index=['i2','i3','i4']) print(df1) print(df2) # Row splicing ([df1,df2])
The rows are spliced if they have the same index, and to distinguish between the indexes, we define the grouping of the indexes at the outermost level.
# Grouping of row splices ([df1,df2],keys=['fea1','fea2'])
Column splicing:
Splice as a concatenation by default:
# Column splicing, defaults to concatenation # ([df1,df2],axis=1)
Splice by intersection:
# Columns spliced with inner joins (intersections) ([df1,df2],axis=1,join='inner')
Sets the column name of the column splice:
# Columns spliced with inner joins (intersections) ([df1,df2],axis=1,join='inner',keys=['fea1','fea2'])
Splices the specified index:
# Specify the column splice at index [i1,i2,i3]. ([df1,df2],axis=1,join_axes=[['i1','i2','i3']])
3.2 Splice methods for dataframe types
Row splicing:
df1 = ({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']}) df2 = ({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']}) print(df1) print(df2) # Row splicing ([df1,df2])
Column splicing:
# Column splicing ([df1,df2],axis=1)
If a column splice or row splice has duplicate column and row names, an error is reported:
# Determine if there is a duplicate column name, if so, report an error ([df1,df2],axis=1,verify_integrity = True)
ValueError: Indexes have overlapping values: ['key']
4. Summary
merge and join methods can basically achieve the same function, it is recommended to use merge.
to this article on pandas data merging and splicing of the realization of the article is introduced to this, more related pandas data merging and splicing content please search for my previous articles or continue to browse the following related articles I hope that you will support me more in the future!