In some cases, you may need to perform stratified random sampling by groupings after multiple columns. pandas provides flexible data manipulation methods that you can usegroupby
andapply
Combination of methodssample
to achieve this requirement. Specifically, you can group by multiple columns first and then randomly sample each group.
Sample data
First, create a data containing two columns DataFrame:
import pandas as pd # Create a sample DataFramedata = { 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah', 'Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'], 'age': [25, 30, 35, 40, 45, 50, 55, 60, 25, 30, 35, 40, 45, 50, 55, 60], 'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego'], 'department': ['HR', 'Finance', 'IT', 'Marketing', 'Sales', 'R&D', 'Admin', 'HR', 'Finance', 'IT', 'Marketing', 'Sales', 'R&D', 'Admin', 'HR', 'Finance'] } df = (data) print(df) #Output:# name age city department # 0 Alice 25 New York HR # 1 Bob 30 Los Angeles Finance # 2 Charlie 35 Chicago IT # 3 David 40 Houston Marketing # 4 Eve 45 Phoenix Sales # 5 Frank 50 Philadelphia R&D # 6 Grace 55 San Antonio Admin # 7 Hannah 60 San Diego HR # 8 Alice 25 New York Finance # 9 Bob 30 Los Angeles IT # 10 Charlie 35 Chicago Marketing # 11 David 40 Houston Sales # 12 Eve 45 Phoenix R&D # 13 Frank 50 Philadelphia Admin # 14 Grace 55 San Antonio HR # 15 Hannah 60 San Diego Finance
Group by two columns and perform stratified random sampling
Assume that you want to presscity
anddepartment
Columns are grouped and a random sample is taken from each group. You can do this:
import pandas as pd # Create a sample DataFramedata = { 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah', 'Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'], 'age': [25, 30, 35, 40, 45, 50, 55, 60, 25, 30, 35, 40, 45, 50, 55, 60], 'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego'], 'department': ['HR', 'Finance', 'IT', 'Marketing', 'Sales', 'R&D', 'Admin', 'HR', 'Finance', 'IT', 'Marketing', 'Sales', 'R&D', 'Admin', 'HR', 'Finance'] } df = (data) print(df) # Group by 'city' and 'department' columns and randomly draw 1 row for each groupsampled_df = (['city', 'department']).apply(lambda x: (n=1, random_state=42)).reset_index(drop=True) print(sampled_df)
Specific steps
-
Group by multiple columns:use
groupby(['city', 'department'])
according tocity
anddepartment
Two columns are grouped. -
Random sampling for each group:use
apply
andlambda
Function calls to each groupsample(n=1)
Randomly draw a row.random_state
Parameters are used to set random seeds to ensure that the results are reproduced. -
Reset index:use
reset_index(drop=True)
Reset the index to avoid retaining index information for grouping keys.
Output example
The output may vary depending on the sample data. Here is an example of possible output:
name age city department
0 Alice 25 Los Angeles Finance
1 Charlie 35 Chicago IT
2 Frank 50 Philadelphia R&D
3 Hannah 60 San Diego Finance
4 Bob 30 Houston Marketing
5 Grace 55 San Antonio HR
6 Alice 25 New York HR
7 Eve 45 Phoenix Sales
8 David 40 Houston Sales
9 Charlie 35 Chicago Marketing
10 Hannah 60 San Diego HR
11 Grace 55 San Antonio Admin
12 Bob 30 Los Angeles IT
13 David 40 New York Finance
14 Eve 45 Phoenix R&D
15 Frank 50 Philadelphia Admin
This way, you can easily group multiple columns in a DataFrame and perform stratified random sampling from each group. This technique is very useful in data analysis and machine learning and can help you obtain representative small samples from large data sets for analysis.
This is the end of this article about the implementation of pandas hierarchical random sampling according to a certain 2 columns. For more related pandas hierarchical random sampling content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!