Implementation method of pandas sampling

During data analysis and processing, it is often necessary to randomly sample the data in order to obtain representative small samples or perform data splitting. pandas provides a very convenient way to implement random sampling.

Basic usage: DataFrame's sample method

Pandas' DataFrame providessampleThe method is used for random sampling, and the following explains its basic usage and common parameters:

Sample data

First, create an example DataFrame:

import pandas as pd

# Create a sample DataFramedata = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'age': [25, 30, 35, 40, 45, 50, 55, 60],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego']
}
df = (data)
print(df)

Example 1: Randomly draw a specified number of rows

Can be usednParameters specify the number of rows extracted:

# Randomly extract 3 rows of datasampled_df = (n=3)

print(sampled_df)

Example 2: Random sampling in proportion

Can be usedfracParameters specify the sample ratio, for examplefrac=0.5Indicates the random number of rows drawn by 50%:

# Randomly draw 50% of rowssampled_df = (frac=0.5)

print(sampled_df)

Example 3: Specify random seeds during random sampling

In order to ensure that the results of each sampling are the same, you can userandom_stateParameters specify random seeds:

# Randomly draw 3 rows of data and specify random seedssampled_df = (n=3, random_state=1)

print(sampled_df)

Example 4: Random sampling by row or column

By default,sampleIt is to sample by row (axis=0), can also be setaxis=1Come and sample according to columns:

# Randomly select 2 columns by columnsampled_df = (n=2, axis=1)

print(sampled_df)

Example 5: Return whether to put the sample back

By default,sampleIt is not to put back the sampling, that is, a sample will only be drawn once. Can be setreplace=TrueCome and do a re-sampling:

# Perform a re-return sampling and randomly select 10 samplessampled_df = (n=10, replace=True, random_state=1)

print(sampled_df)

Example 6: Random sampling of stratified by a column

Sometimes you need to perform stratified random sampling according to the value of a certain column, you can usegroupbyandapplyCombination of methodssampleTo achieve:

# Press the 'city' column for random sampling, and 1 row is randomly selected for each city.sampled_df = ('city').apply(lambda x: (n=1, random_state=1)).reset_index(drop=True)

print(sampled_df)

Summarize

pandas provides rich parameters and functions to makesampleThe method can meet the needs of various random sampling, including specifying the sampling quantity, sampling proportionally, setting up random seeds, having or not re-sampling, and sampling in columns and stratified. These features are very useful in data analysis and processing, helping to quickly acquire representative small samples for analysis.

This is the end of this article about the implementation method of pandas sampling. For more related pandas sampling content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!