Implementation of pandas for stratified random sampling according to certain 2 columns

In some cases, you may need to perform stratified random sampling by groupings after multiple columns. pandas provides flexible data manipulation methods that you can usegroupbyandapplyCombination of methodssampleto achieve this requirement. Specifically, you can group by multiple columns first and then randomly sample each group.

Sample data

First, create a data containing two columns DataFrame:

import pandas as pd

# Create a sample DataFramedata = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah', 
             'Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'age': [25, 30, 35, 40, 45, 50, 55, 60, 25, 30, 35, 40, 45, 50, 55, 60],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego',
             'New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego'],
    'department': ['HR', 'Finance', 'IT', 'Marketing', 'Sales', 'R&amp;D', 'Admin', 'HR',
                   'Finance', 'IT', 'Marketing', 'Sales', 'R&amp;D', 'Admin', 'HR', 'Finance']
}
df = (data)

print(df)
#Output:#        name  age         city department
# 0     Alice   25     New York         HR
# 1       Bob   30  Los Angeles    Finance
# 2   Charlie   35      Chicago         IT
# 3     David   40      Houston  Marketing
# 4       Eve   45      Phoenix      Sales
# 5     Frank   50  Philadelphia        R&amp;D
# 6     Grace   55   San Antonio      Admin
# 7    Hannah   60     San Diego         HR
# 8     Alice   25     New York    Finance
# 9       Bob   30  Los Angeles         IT
# 10  Charlie   35      Chicago  Marketing
# 11    David   40      Houston      Sales
# 12      Eve   45      Phoenix        R&amp;D
# 13    Frank   50  Philadelphia      Admin
# 14    Grace   55   San Antonio         HR
# 15   Hannah   60     San Diego    Finance

Group by two columns and perform stratified random sampling

Assume that you want to presscityanddepartmentColumns are grouped and a random sample is taken from each group. You can do this:

import pandas as pd

# Create a sample DataFramedata = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah', 
             'Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'age': [25, 30, 35, 40, 45, 50, 55, 60, 25, 30, 35, 40, 45, 50, 55, 60],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego',
             'New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego'],
    'department': ['HR', 'Finance', 'IT', 'Marketing', 'Sales', 'R&amp;D', 'Admin', 'HR',
                   'Finance', 'IT', 'Marketing', 'Sales', 'R&amp;D', 'Admin', 'HR', 'Finance']
}
df = (data)

print(df)
# Group by 'city' and 'department' columns and randomly draw 1 row for each groupsampled_df = (['city', 'department']).apply(lambda x: (n=1, random_state=42)).reset_index(drop=True)

print(sampled_df)

Specific steps

Group by multiple columns:usegroupby(['city', 'department'])according tocityanddepartmentTwo columns are grouped.
Random sampling for each group:useapplyandlambdaFunction calls to each groupsample(n=1)Randomly draw a row.random_stateParameters are used to set random seeds to ensure that the results are reproduced.
Reset index:usereset_index(drop=True)Reset the index to avoid retaining index information for grouping keys.

Output example

The output may vary depending on the sample data. Here is an example of possible output:

name age city department
0 Alice 25 Los Angeles Finance
1 Charlie 35 Chicago IT
2 Frank 50 Philadelphia R&D
3 Hannah 60 San Diego Finance
4 Bob 30 Houston Marketing
5 Grace 55 San Antonio HR
6 Alice 25 New York HR
7 Eve 45 Phoenix Sales
8 David 40 Houston Sales
9 Charlie 35 Chicago Marketing
10 Hannah 60 San Diego HR
11 Grace 55 San Antonio Admin
12 Bob 30 Los Angeles IT
13 David 40 New York Finance
14 Eve 45 Phoenix R&D
15 Frank 50 Philadelphia Admin

This way, you can easily group multiple columns in a DataFrame and perform stratified random sampling from each group. This technique is very useful in data analysis and machine learning and can help you obtain representative small samples from large data sets for analysis.

This is the end of this article about the implementation of pandas hierarchical random sampling according to a certain 2 columns. For more related pandas hierarchical random sampling content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!