Detailed explanation of the usage case of pandas groupby in Python

👋 Welcome to the Python Advanced Learning Journey! Today, we will dive into the very powerful pandas librarygroupby()function.groupby()Functions play a key role in data analysis and data cleaning, which can help us easily group, aggregate and transform data.

1. Why do you need groupby()?

When processing large amounts of data, we often needGroup data by one or more characteristics, in order to better understand the structure and relationships of the data. For example, we may want to group data by year, region, or product category and aggregate each group, such as summing, average, maximum, etc. At this time,groupby()Functions appear very useful.

2. Basic usage of groupby()

First, we need to import the pandas library and create a sample dataset. Then, we can usegroupby()Functions group data by specified columns.

import numpy as np
import pandas as pd
# Create a simple DataFramedata = {
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
    'C': (8),
    'D': (8)
}
df = (data)
# Use groupby to group by column 'A'grouped = ('A')
# Print grouped GroupBy objectsprint(grouped)

Output:

< object at 0x000002B2C070B8E0>

The above code will group DataFrames by the value of column 'A' and return a GroupBy object. We can further perform aggregation operation on this object.

III. Aggregation operation

GroupBy objects provide a variety of aggregate functions, such assum()、mean()、max()wait. We can use these functions to perform aggregation operations on each group.

import numpy as np
import pandas as pd
# Create a simple DataFramedata = {
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
    'C': (8),
    'D': (8)
}
df = (data)
# Use groupby to group by column 'A'grouped = ('A')
# Print grouped objectsprint(grouped)
# Calculate the average value for each groupmean_grouped = ()
print(mean_grouped)
# Calculate the sum of each groupsum_grouped = ()
print(sum_grouped)

Output:

C D
A
bar 0.658173 -0.225388
foo 0.778100 -0.164148
C D
A
bar 1.97452 -0.676164
foo 3.89050 -0.820740

In addition to the built-in aggregate function, we can also useagg()Functions apply custom aggregate functions. For example, we can calculate the standard deviation of each group:

import numpy as np
import pandas as pd
# Create a simple DataFramedata = {
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
    'C': (8),
    'D': (8)
}
df = (data)
# Use groupby to group by column 'A'grouped = ('A')
# Print grouped objectsprint(grouped)
# Calculate the standard deviation of each groupstd_grouped = ()
print(std_grouped)

Output:

< object at 0x000002B2F480B880>
C D
A
bar 0.101229 0.274698
foo 0.996597 0.812362

4. Advanced usage and skills

In addition to basic grouping and aggregation operations,groupby()Many advanced features are also provided, such as applying custom functions, converting data, etc.

🔧 Apply custom functions

We can useapply()Methods apply custom functions to each group. For example, we can define a function to calculate the difference between the maximum and minimum values of each group:

import numpy as np
import pandas as pd
# Create a simple DataFramedata = {
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
    'C': (8),
    'D': (8)
}
df = (data)
# Use groupby to group by column 'A'grouped = ('A')
# Print grouped objectsprint(grouped)
# Define a custom function to calculate the difference between the maximum and minimum values of each groupdef range_diff(group):
    return () - ()
# Use apply() to apply custom functionsdiff_grouped = (range_diff)
print(diff_grouped)

Output:

< object at 0x000002ACBD83AA60>
C D
A
bar 2.497695 1.086924
foo 2.826518 2.063781

🔄 Data conversion

groupby()Also providedtransform()Method, used to broadcast the results of the aggregate operation to each row of the original data. This is very useful in data conversion.

import numpy as np
import pandas as pd
# Create a simple DataFramedata = {
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
    'C': (8),
    'D': (8)
}
df = (data)
# Use groupby to group by column 'A'grouped = ('A')
# Print grouped objectsprint(grouped)
# Use transform() method to broadcast the average value of each group to each row of the original datamean_transformed = grouped['C'].transform('mean')
print(mean_transformed)
# Add the converted average to the original DataFramedf['C_mean'] = mean_transformed
print(df)

Output:

< object at 0x00000188A56DA8E0>
0 0.344876
1 -1.358760
2 0.344876
3 -1.358760
4 0.344876
5 -1.358760
6 0.344876
7 0.344876
Name: C, dtype: float64
A C D C_mean
0 foo 0.783914 -1.027288 0.344876
1 bar -2.072893 -0.972087 -1.358760
2 foo 0.035637 -0.315908 0.344876
3 bar -1.953068 0.409697 -1.358760
4 foo 0.576048 -0.258289 0.344876
5 bar -0.050318 -1.115734 -1.358760
6 foo 0.093456 0.106227 0.344876
7 foo 0.235322 1.365150 0.344876

🔍 Filter data

In addition to aggregation and conversion, we can also usefilter()Method filter out groups that meet the conditions according to the conditions.

import numpy as np
import pandas as pd
# Create a simple DataFramedata = {
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
    'C': (8),
    'D': (8)
}
df = (data)
# Use groupby to group by column 'A'grouped = ('A')
# Print grouped objectsprint(grouped)
# Use filter() method to filter out groups that meet the conditions (for example, the size of the group is greater than 3)filtered_groups = (lambda x: len(x) &gt; 3)
print(filtered_groups)

Output:

< object at 0x0000015ADE2FA940>
A C D
0 foo 1.967217 0.005976
2 foo 0.950149 0.098143
4 foo 0.568101 1.461587
6 foo -1.905337 -1.106591
7 foo -0.168686 0.692850

5. Practical case application

Finally, let's demonstrate how to apply it through a practical casegroupby()Functions are used for data analysis and cleaning.

Suppose we have a DataFrame containing sales data, which contains columns such as date, region, product name, sales, etc. We want to group data by region and product name and calculate the total sales for each group.

import numpy as np
import pandas as pd
# Create a DataFrame containing sales datasales_data = {
    'date': pd.date_range(start='2023-01-01', periods=100),
    'region': (['North', 'South', 'East', 'West'], size=100),
    'product': (['Product A', 'Product B', 'Product C'], size=100),
    'sales': (100) * 1000
}
df_sales = (sales_data)
# Group data by region and product name and calculate total salesgrouped_sales = df_sales.groupby(['region', 'product'])['sales'].sum().reset_index()
# Print grouped salesprint(grouped_sales)

Output:

region product sales
0 East Product A 2728.679432
1 East Product B 1847.966730
2 East Product C 4518.356763
3 North Product A 5882.374531
4 North Product B 5519.364196
5 North Product C 4229.953852
6 South Product A 5303.784425
7 South Product B 2321.080682
8 South Product C 4239.002167
9 West Product A 1689.650513
10 West Product B 4002.790867
11 West Product C 4894.553548

In this case, we first created a DataFrame containing sales data. Then, we usegroupby()Functions group data by region and product name and usesum()The function calculates the total sales of each group. Finally, we usereset_index()The function converts the result to a new DataFrame and prints it out.

6. Summary

groupby()Functions are a very powerful tool in the pandas library, which allows us to group data by one or more features and aggregate, transform and filter each group. By masteringgroupby()The usage of functions allows us to process and analyze large amounts of data more efficiently, thereby gaining insight into the internal structure and relationships of the data. Hope this blog can help you better understand and apply itgroupby()function!

7. Looking forward to making progress with you

This is the end of this article about the detailed explanation of the usage of pandas groupby() in Python. For more related content on the usage of pandas groupby(), please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!