During data analysis and processing, it is often necessary to randomly sample the data in order to obtain representative small samples or perform data splitting. pandas provides a very convenient way to implement random sampling.
Basic usage: DataFrame's sample method
Pandas' DataFrame providessample
The method is used for random sampling, and the following explains its basic usage and common parameters:
Sample data
First, create an example DataFrame:
import pandas as pd # Create a sample DataFramedata = { 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'], 'age': [25, 30, 35, 40, 45, 50, 55, 60], 'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego'] } df = (data) print(df)
Example 1: Randomly draw a specified number of rows
Can be usedn
Parameters specify the number of rows extracted:
# Randomly extract 3 rows of datasampled_df = (n=3) print(sampled_df)
Example 2: Random sampling in proportion
Can be usedfrac
Parameters specify the sample ratio, for examplefrac=0.5
Indicates the random number of rows drawn by 50%:
# Randomly draw 50% of rowssampled_df = (frac=0.5) print(sampled_df)
Example 3: Specify random seeds during random sampling
In order to ensure that the results of each sampling are the same, you can userandom_state
Parameters specify random seeds:
# Randomly draw 3 rows of data and specify random seedssampled_df = (n=3, random_state=1) print(sampled_df)
Example 4: Random sampling by row or column
By default,sample
It is to sample by row (axis=0
), can also be setaxis=1
Come and sample according to columns:
# Randomly select 2 columns by columnsampled_df = (n=2, axis=1) print(sampled_df)
Example 5: Return whether to put the sample back
By default,sample
It is not to put back the sampling, that is, a sample will only be drawn once. Can be setreplace=True
Come and do a re-sampling:
# Perform a re-return sampling and randomly select 10 samplessampled_df = (n=10, replace=True, random_state=1) print(sampled_df)
Example 6: Random sampling of stratified by a column
Sometimes you need to perform stratified random sampling according to the value of a certain column, you can usegroupby
andapply
Combination of methodssample
To achieve:
# Press the 'city' column for random sampling, and 1 row is randomly selected for each city.sampled_df = ('city').apply(lambda x: (n=1, random_state=1)).reset_index(drop=True) print(sampled_df)
Summarize
pandas provides rich parameters and functions to makesample
The method can meet the needs of various random sampling, including specifying the sampling quantity, sampling proportionally, setting up random seeds, having or not re-sampling, and sampling in columns and stratified. These features are very useful in data analysis and processing, helping to quickly acquire representative small samples for analysis.
This is the end of this article about the implementation method of pandas sampling. For more related pandas sampling content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!