Detailed explanation of PANDAS data merging and reshaping (join/merge chapter)

existPrevious article.In the article, I compiled a description of the use of the concat method, which is commonly used in pandas for data merging and reshaping. Here, I'll move on to the join and merge methods, which are also often used in pandas.

merge

The merge method of pandas provides an SQL-like memory linking operation, and the official website documentation mentions that its performance will be more efficient than data manipulation in other open source languages (e.g. R).

A comparison with SQL statements can be seenhere are

Parameters of merge

on: column name, join used to align the name of the column, use this parameter must ensure that the left table and the right table used to align the column have the same column name.

left_on: left table aligned columns, either column names or arrays of the same length as the dataframe.

right_on: right table aligned columns, either column names or arrays of the same length as the dataframe.

left_index/ right_index: if True haunted with index as alignment key

HOW: Methods of data fusion.

sort: according to the dataframe merge keys in dictionary order, default is yes, if set false can improve the performance.

merge's default merge method:
merge is used for internal table merging based on index-on-index and index-on-column(s), but merging based on index is the default.

1.1 Methods for merging composite keys

When using merge, you can select multiple keys as a composite that can be aligned and merged.

1.1.1 Specifying data merge aligned columns by on

In [41]: left = ({'key1': ['K0', 'K0', 'K1', 'K2'],
  ....:           'key2': ['K0', 'K1', 'K0', 'K1'],
  ....:           'A': ['A0', 'A1', 'A2', 'A3'],
  ....:           'B': ['B0', 'B1', 'B2', 'B3']})
  ....: 

In [42]: right = ({'key1': ['K0', 'K1', 'K1', 'K2'],
  ....:            'key2': ['K0', 'K0', 'K0', 'K0'],
  ....:            'C': ['C0', 'C1', 'C2', 'C3'],
  ....:            'D': ['D0', 'D1', 'D2', 'D3']})
  ....: 

In [43]: result = (left, right, on=['key1', 'key2'])

The inner method is used by default if you don't specify how.

HOW methods are available:

left

Keep all data in the left table only

In [44]: result = (left, right, how='left', on=['key1', 'key2'])

right

Keep all data in the right table only

In [45]: result = (left, right, how='right', on=['key1', 'key2'])

outer

Retain all information from both tables

In [46]: result = (left, right, how='outer', on=['key1', 'key2'])

inner

Keep only the information in the common parts of the two tables

In [47]: result = (left, right, how='inner', on=['key1', 'key2'])

1.2 indicator

The v0.17.0 version of pandas also supports an indicator parameter, which, if set to True, adds a column called '_merge' to the output. The _merge column can take three values

left_only Only in the left table
right_only In the right table only
Both. In both tables.

1.3 The join method

The dataframe's built-in join method is a fast way to merge. It defaults to index as the aligned column.

1.3.1 how parameters

The how parameter in join, like the how parameter in merge, is used to specify the rules for retaining data for table merges.

See the previous how section for details.

1.3.2 on parameter

In practice, if the index value of the right table is the value of a column in the left table, then you can merge the index of the right table with the column of the left table in a flexible way by aligning the index of the right table with the column of the left table.

ex 1

In [59]: left = ({'A': ['A0', 'A1', 'A2', 'A3'],
  ....:           'B': ['B0', 'B1', 'B2', 'B3'],
  ....:           'key': ['K0', 'K1', 'K0', 'K1']})
  ....: 

In [60]: right = ({'C': ['C0', 'C1'],
  ....:            'D': ['D0', 'D1']},
  ....:            index=['K0', 'K1'])
  ....: 

In [61]: result = (right, on='key')

1.3.3 suffix suffix parameters

If the process of merging with a table encounters a column with the same name in both tables, but the value is different, and both want to keep it when merging, you can use suffixes to add suffixes to the duplicate column names in each table.

In [79]: result = (left, right, on='k', suffixes=['_l', '_r'])

* There are also lsuffix and rsuffix which specify the suffix of the left table and the suffix of the right table, respectively.

1.4 Combining multiple dataframes

When combining multiple dataframes at a time, you can pass in a list or tuple of dataframes, joining multiple at a time, solving multiple problems at once~.

In [83]: right2 = ({'v': [7, 8, 9]}, index=['K1', 'K1', 'K2'])

In [84]: result = ([right, right2])

1.5 Updating the table's nan value

1.5.1 combine_first

If a table has a nan value that can be found in the same location (same index and same column) in another table, the data can be updated with combine_first

1.5.2 update

If you want to use the data in one table to update the data in another table, you can use update to do so.

1.5.3 Difference between combine_first and update

Using combine_first will update only the nan values in the left table. Update, on the other hand, will update all values in the left table that can be found in the right table (the two tables are positioned relative to each other).

Example code reference source -official website

This is the whole content of this article.