SoFunction
Updated on 2024-10-29

Implementation of pandas data merging and splicing

The merge, join, and concat methods of the Pandas package can accomplish data merging and splicing. The merge method is mainly based on the common columns of the two dataframes, the join method is mainly based on the indexes of the two dataframes, and the concat method is a row or column splicing of the series or dataframe. concat method is to perform row splicing or column splicing on a series or dataframe.

1. Merge method

The merge method of pandas joins two dataframes based on common columns. the main parameters of the merge method:

  • left/right: left/right position of the dataframe.
  • how: how to merge data. left: merge data based on left dataframe column; right: merge data based on right dataframe column; outer: merge data outside the column (take concatenation); inner: merge data inside the column (take intersection); default is 'inner'.
  • on: the name of the column to merge, this parameter needs to ensure that the two dataframes have the same column name.
  • left_on/right_on: column names for left/right dataframe merge, also indexes, arrays and lists.
  • left_index/right_index: whether to use index as the column name for data merge, True means yes.
  • sort: sort the dataframe according to its merged keys, default.
  • suffixes: If there is the same column and the column is not used as a merged column, you can set the suffix name of the column through the suffixes, usually tuple and list type.

merges selects how two dataframes are connected by setting the how parameter, there are inner connection, outer connection, left connection, right connection, the following example introduces the meaning of connection.

1.1 Internal connections

how='inner', the linking method of dataframe is inner join, we can understand that based on the intersection of common columns to connect, the parameter on sets the common column name of the connection.

# Single-column inline connections
# Define df1
import pandas as pd
import numpy as np

df1 = ({'alpha':['A','B','B','C','D','E'],'feature1':[1,1,2,3,3,1],
            'feature2':['low','medium','medium','high','low','high']})
# Define df2
df2 = ({'alpha':['A','A','B','F'],'pazham':['apple','orange','pine','pear'],
            'kilo':['high','low','high','medium'],'price':([5,6,5,7])})
# print(df1)
# print(df2)
# common column alpha-based inner join
df3 = (df1,df2,how='inner',on='alpha')
df3

Take the intersection of the common column alpha values to join.

1.2 External connections

how='outer', the linking method of dataframe is outer join, we can understand the connection based on the concatenation of common columns, the parameter on sets the common column name of the connection.

# Single-column outer links
# Define df1
df1 = ({'alpha':['A','B','B','C','D','E'],'feature1':[1,1,2,3,3,1],
                'feature2':['low','medium','medium','high','low','high']})
# Define df2
df2 = ({'alpha':['A','A','B','F'],'pazham':['apple','orange','pine','pear'],
                        'kilo':['high','low','high','medium'],'price':([5,6,5,7])})
# common column alpha-based inner join
df4 = (df1,df2,how='outer',on='alpha')
df4

If there is no identical column between the two dataframes except for the connection column set by on, the value of the column is set to NaN.

1.3 Left connection

how='left', dataframe link way for the left connection, we can understand based on the left position of the dataframe columns to connect, the parameter on set the connection of the common column name.

# Single-column left joins
# Define df1
df1 = ({'alpha':['A','B','B','C','D','E'],'feature1':[1,1,2,3,3,1],
    'feature2':['low','medium','medium','high','low','high']})
# Define df2
df2 = ({'alpha':['A','A','B','F'],'pazham':['apple','orange','pine','pear'],
                        'kilo':['high','low','high','medium'],'price':([5,6,5,7])})
# Left joins based on common column alpha
df5 = (df1,df2,how='left',on='alpha')
df5

Since the join column alpha of df2 has two 'A' values, the left-joined df5 has two 'A' values, and if there are no identical columns between the two dataframes except for the join column set by on, the value of that column is set to NaN.

1.4 Right Connection

how='right', dataframe link way for the left connection, we can understand based on the right position of the dataframe columns to connect, the parameter on set the connection of the common column name.

# Single-column right joins
# Define df1
df1 = ({'alpha':['A','B','B','C','D','E'],'feature1':[1,1,2,3,3,1],
'feature2':['low','medium','medium','high','low','high']})
# Define df2
df2 = ({'alpha':['A','A','B','F'],'pazham':['apple','orange','pine','pear'],
                        'kilo':['high','low','high','medium'],'price':([5,6,5,7])})
# Right joins based on common column alpha
df6 = (df1,df2,how='right',on='alpha')
df6

Since the join column alpha of df1 has two 'B' values, the right-joined df6 has two 'B' values. If there is no identical column between the two dataframes other than the join column set by on, the value of that column is set to NaN.

1.5 Multi-column based joining algorithm

The algorithm for multi-column joins is consistent with single-column joins, this section only introduces multi-column based inner joins and right joins, readers can code their own and follow the diagrammatic approach given in this paper to understand the outer joins and left joins.

Inner joins with multiple columns:

# Inner joins with multiple columns
# Define df1
df1 = ({'alpha':['A','B','B','C','D','E'],'beta':['a','a','b','c','c','e'],
                    'feature1':[1,1,2,3,3,1],'feature2':['low','medium','medium','high','low','high']})
# Define df2
df2 = ({'alpha':['A','A','B','F'],'beta':['d','d','b','f'],'pazham':['apple','orange','pine','pear'],
                        'kilo':['high','low','high','medium'],'price':([5,6,5,7])})
# Endoconnections based on common columns alpha and beta
df7 = (df1,df2,on=['alpha','beta'],how='inner')
df7

Right joining of multiple columns:

# Right-joining of multiple columns
# Define df1
df1 = ({'alpha':['A','B','B','C','D','E'],'beta':['a','a','b','c','c','e'],
                    'feature1':[1,1,2,3,3,1],'feature2':['low','medium','medium','high','low','high']})
# Define df2
df2 = ({'alpha':['A','A','B','F'],'beta':['d','d','b','f'],'pazham':['apple','orange','pine','pear'],
                        'kilo':['high','low','high','medium'],'price':([5,6,5,7])})
print(df1)
print(df2)

# Right connections based on common columns alpha and beta
df8 = (df1,df2,on=['alpha','beta'],how='right')
df8

1.6 Index-based linking methods

Previously introduced the column-based connection method, merge method can also be based on index connection dataframe.

# Column and index based right joins
# Define df1
df1 = ({'alpha':['A','B','B','C','D','E'],'beta':['a','a','b','c','c','e'],
                    'feature1':[1,1,2,3,3,1],'feature2':['low','medium','medium','high','low','high']})
# Define df2
df2 = ({'alpha':['A','A','B','F'],'pazham':['apple','orange','pine','pear'],
                        'kilo':['high','low','high','medium'],'price':([5,6,5,7])},index=['d','d','b','f'])
print(df1)
print(df2)

# Based on df1's beta column and df2's index join
df9 = (df1,df2,how='inner',left_on='beta',right_index=True)
df9

Illustrate the inner join method of index and column:

Set the parameter suffixes to modify the suffixes of the same columns except for concatenated columns.

# Based on df1's alpha column and df2's index inner join
df9 = (df1,df2,how='inner',left_on='beta',right_index=True,suffixes=('_df1','_df2'))
df9

2. The join method

join method is based on index connection dataframe, merge method is based on column connection, join method has inner join, outer join, left join and right join, consistent with merge.

index is connected to index:

caller = ({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = ({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']})
print(caller)print(other)# lsuffix and rsuffix set the connection suffixes
(other,lsuffix='_caller', rsuffix='_other',how='inner')

join can also be based on columns:

caller = ({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = ({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']})
print(caller)
print(other)

# Join based on key columns
caller.set_index('key').join(other.set_index('key'),how='inner')

Therefore, join and merge are connected in a similar way, so we won't expand on the join method here, and suggest using the merge method.

3. The concat method

The concat method is a splice function with row splice and column splice, the default is row splice, the splice method defaults to outer splice (concatenation), and the object of the splice is the pandas data type.

3.1 Splice methods for series types

Row splicing:

df1 = ([1.1,2.2,3.3],index=['i1','i2','i3'])
df2 = ([4.4,5.5,6.6],index=['i2','i3','i4'])
print(df1)
print(df2)

# Row splicing
([df1,df2])

The rows are spliced if they have the same index, and to distinguish between the indexes, we define the grouping of the indexes at the outermost level.

# Grouping of row splices
([df1,df2],keys=['fea1','fea2'])

Column splicing:

Splice as a concatenation by default:

# Column splicing, defaults to concatenation #
([df1,df2],axis=1)

Splice by intersection:

# Columns spliced with inner joins (intersections)
([df1,df2],axis=1,join='inner')

Sets the column name of the column splice:

# Columns spliced with inner joins (intersections)
([df1,df2],axis=1,join='inner',keys=['fea1','fea2'])

Splices the specified index:

# Specify the column splice at index [i1,i2,i3].
([df1,df2],axis=1,join_axes=[['i1','i2','i3']])

3.2 Splice methods for dataframe types

Row splicing:

df1 = ({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
df2 = ({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']})
print(df1)
print(df2)

# Row splicing
([df1,df2])

Column splicing:

# Column splicing
([df1,df2],axis=1)

If a column splice or row splice has duplicate column and row names, an error is reported:

# Determine if there is a duplicate column name, if so, report an error
([df1,df2],axis=1,verify_integrity = True)

ValueError: Indexes have overlapping values: ['key']

4. Summary

merge and join methods can basically achieve the same function, it is recommended to use merge.

to this article on pandas data merging and splicing of the realization of the article is introduced to this, more related pandas data merging and splicing content please search for my previous articles or continue to browse the following related articles I hope that you will support me more in the future!