SoFunction
Updated on 2025-03-04

Implementation of pandas for stratified random sampling according to certain 2 columns

In some cases, you may need to perform stratified random sampling by groupings after multiple columns. pandas provides flexible data manipulation methods that you can usegroupbyandapplyCombination of methodssampleto achieve this requirement. Specifically, you can group by multiple columns first and then randomly sample each group.

Sample data

First, create a data containing two columns DataFrame:

import pandas as pd

# Create a sample DataFramedata = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah', 
             'Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'age': [25, 30, 35, 40, 45, 50, 55, 60, 25, 30, 35, 40, 45, 50, 55, 60],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego',
             'New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego'],
    'department': ['HR', 'Finance', 'IT', 'Marketing', 'Sales', 'R&D', 'Admin', 'HR',
                   'Finance', 'IT', 'Marketing', 'Sales', 'R&D', 'Admin', 'HR', 'Finance']
}
df = (data)

print(df)
#Output:#        name  age         city department
# 0     Alice   25     New York         HR
# 1       Bob   30  Los Angeles    Finance
# 2   Charlie   35      Chicago         IT
# 3     David   40      Houston  Marketing
# 4       Eve   45      Phoenix      Sales
# 5     Frank   50  Philadelphia        R&D
# 6     Grace   55   San Antonio      Admin
# 7    Hannah   60     San Diego         HR
# 8     Alice   25     New York    Finance
# 9       Bob   30  Los Angeles         IT
# 10  Charlie   35      Chicago  Marketing
# 11    David   40      Houston      Sales
# 12      Eve   45      Phoenix        R&D
# 13    Frank   50  Philadelphia      Admin
# 14    Grace   55   San Antonio         HR
# 15   Hannah   60     San Diego    Finance

Group by two columns and perform stratified random sampling

Assume that you want to presscityanddepartmentColumns are grouped and a random sample is taken from each group. You can do this:

import pandas as pd

# Create a sample DataFramedata = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah', 
             'Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'age': [25, 30, 35, 40, 45, 50, 55, 60, 25, 30, 35, 40, 45, 50, 55, 60],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego',
             'New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego'],
    'department': ['HR', 'Finance', 'IT', 'Marketing', 'Sales', 'R&D', 'Admin', 'HR',
                   'Finance', 'IT', 'Marketing', 'Sales', 'R&D', 'Admin', 'HR', 'Finance']
}
df = (data)

print(df)
# Group by 'city' and 'department' columns and randomly draw 1 row for each groupsampled_df = (['city', 'department']).apply(lambda x: (n=1, random_state=42)).reset_index(drop=True)

print(sampled_df)

Specific steps

  • Group by multiple columns:usegroupby(['city', 'department'])according tocityanddepartmentTwo columns are grouped.
  • Random sampling for each group:useapplyandlambdaFunction calls to each groupsample(n=1)Randomly draw a row.random_stateParameters are used to set random seeds to ensure that the results are reproduced.
  • Reset index:usereset_index(drop=True)Reset the index to avoid retaining index information for grouping keys.

Output example

The output may vary depending on the sample data. Here is an example of possible output:

       name  age         city department
0     Alice   25  Los Angeles    Finance
1    Charlie   35      Chicago         IT
2     Frank   50  Philadelphia        R&D
3     Hannah   60     San Diego    Finance
4       Bob   30      Houston  Marketing
5     Grace   55   San Antonio         HR
6     Alice   25     New York         HR
7       Eve   45      Phoenix      Sales
8     David   40      Houston      Sales
9    Charlie   35      Chicago  Marketing
10    Hannah   60     San Diego         HR
11    Grace   55   San Antonio      Admin
12      Bob   30  Los Angeles         IT
13    David   40     New York    Finance
14     Eve   45      Phoenix        R&D
15    Frank   50  Philadelphia      Admin

This way, you can easily group multiple columns in a DataFrame and perform stratified random sampling from each group. This technique is very useful in data analysis and machine learning and can help you obtain representative small samples from large data sets for analysis.

This is the end of this article about the implementation of pandas hierarchical random sampling according to a certain 2 columns. For more related pandas hierarchical random sampling content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!