SoFunction
Updated on 2025-03-10

Implementation of data preprocessing by Python pandas

Case: Preprocessing of e-commerce customer data

This case will preprocess data of e-commerce customer data by using the pandas library. Data preprocessing is an important step in performing effective data analysis, with the goal of processing data into a state suitable for analysis. This case is divided into several steps to gradually demonstrate how to import data, view data set information, process missing values, convert data formats, deduplication, data standardization, and data export.

Test data

Here are the test data we will process, taking e-commerce customer orders as an example:

CustomerID OrderDate Product Quantity Price
C001 2024-01-10 Smartphone 2 500.00
C002 2024-01-12 Laptop computer 1 1200.00
C003 NaN earphone 3 50.00
C001 2024-01-15 Tablet PC 1 300.00
C004 2024-01-18 Laptop computer 1 1200.00
C005 2024-01-20 Smartphone 2 NaN
C001 2024-01-15 Tablet PC 1 300.00
C006 2024-01-22 earphone 2 100.00
C007 2024-01-25 Smart watch 1 200.00
C008 2024-01-28 Tablet PC 3 900.00
C009 2024-01-30 Laptop computer 1 1500.00
C010 2024-02-01 Smartphone 4 2000.00
C011 2024-02-05 earphone 5 250.00
C012 2024-02-08 Tablet PC 1 300.00
C013 2024-02-10 Smart watch 2 400.00
C014 2024-02-12 Laptop computer 2 2400.00
C015 2024-02-15 Smartphone 1 1000.00
C016 2024-02-18 Tablet PC 2 600.00
C017 2024-02-20 Smart watch 3 600.00
C018 2024-02-22 earphone 1 50.00
C019 2024-02-25 Smartphone 2 1200.00
C020 2024-02-28 Laptop computer 1 1300.00

Step 1: Data Import

First, use pandas to import the CSV file and load the data.

import pandas as pd

# Import CSV filesdf = pd.read_csv('ecommerce_customers.csv')

Analysis: First import the pandas library,pd.read_csv()Method is used to load a CSV file and store it in the namedfin the data frame.

Step 2: View the basic information of the dataset

Use pandas' various methods to view the structure and content of the data to have a preliminary understanding of the data.

# View the first few lines of the dataprint(())

# View data type and column informationprint(())

# Simple statistical analysis of numerical columnsprint(())

Analysis

(): View the first few lines of the data to help quickly understand the basic structure of the data.

Running results

   CustomerID   OrderDate Product  Quantity   Price
0       C001  2024-01-10     Smartphone         2   500.0
1       C002  2024-01-12   Laptop         1  1200.0
2       C003             NaN                     3    50.0
3       C001  2024-01-15   Tablet PC         1   300.0
4       C004  2024-01-18   Laptop         1  1200.0

passhead()Method, we can quickly view the first 5 rows of data and understand the general content of the data.

(): Displays the columns, number of non-null values ​​and data types of the data set to help understand the overall situation of the data set.

Running results

<class ''>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CustomerID  20 non-null     object 
 1   OrderDate   19 non-null     object 
 2   Product     20 non-null     object 
 3   Quantity    20 non-null     int64  
 4   Price       19 non-null     float64
dtypes: float64(1), int64(1), object(3)
memory usage: 928.0+ bytes

passinfo()Method, we can see the name, data type, and the number of non-null values ​​in each column in the dataset, which is very helpful for subsequent judgment of whether there are missing values.

(): Perform statistical analysis of numerical columns and provide information such as mean, standard deviation, etc.

Running results

        Quantity        Price
count  20.000000    19.000000  
mean    2.000000   810.526316  
std     1.055597   714.053229
min     1.000000    50.000000
25%     1.000000   300.000000
50%     2.000000  1000.000000
75%     3.000000  1200.000000
max     5.000000  2400.000000

passdescribe()Method, we can see the numeric column (such asQuantityandPrice) statistical information, such as mean, median, standard deviation, etc., is very important for the distribution and characteristic analysis of the data.

`count` represents the number of non-null values ​​in the dataset. For the `Quantity` column, all 20 values ​​are non-empty; for the `Price` column, there are only 19 non-empty values, indicating that there is one missing value.

`mean` represents the average value of the column. The average purchase volume for the `Quantity` column is 2; the average price for the `Price` column is 810.53. Step 3: Process the missing values.

`min` represents the minimum value in the column. The minimum purchase volume for the `Quantity` column is 1; the minimum price for the `Price` column is 50.

`25%` represents the first quartile, that is, 25% of the data points are less than or equal to this value. For the `Quantity` column, the first quartile is 1; for the `Price` column, this value is 300.

`50%` represents the median, i.e. 50% of the data points are less than or equal to this value. The median number of the `Quantity` column is 2; the median number of the `Price` column is 1000.

`75%` represents the third quartile, i.e. 75% of the data points are less than or equal to this value. The third quartile of the `Quantity` column is 3; the third quartile of the `Price` column is 1200.

`max` represents the maximum value in the column. The maximum purchase volume for the `Quantity` column is 5; the maximum price for the `Price` column is 2400.

Process missing values ​​in the data. In this case,OrderDateandPriceThere are missing values ​​in the column.

# View missing valuesdef missing_values(df):
    return ().sum()

print(missing_values(df))

# Fill in missing price columns to replace them with average valuesdf['Price'].fillna(df['Price'].mean(), inplace=True)

# Fill in missing order dates to replace them with specific datesdf['OrderDate'].fillna('2024-01-01', inplace=True)

Analysis

  • ().sum(): Count the number of missing values ​​in each column to help identify vacant items in the data.
  • df['Price'].fillna(df['Price'].mean(), inplace=True): Use mean fillPriceMissing values ​​of the column ensure data integrity. inplace=True is a parameter that specifies whether the operation is modified on the original data frame. After setting inplace=True, the operation will directly change the original data frame without returning a new data frame. This helps save memory, but needs to be used with caution, as the modification of the data is irreversible.
  • df['OrderDate'].fillna('2024-01-01', inplace=True):WillOrderDateThe missing values ​​of the column are filled with default dates2024-01-01

Step 4: Data Format Conversion

WillOrderDateThe column is converted from a string to a date type for subsequent time series analysis.

# Convert OrderDate to date typedf['OrderDate'] = pd.to_datetime(df['OrderDate'])

Analysispd.to_datetime(df['OrderDate'])WillOrderDateThe columns are converted to date-time format, which facilitates subsequent date-related analysis, such as sorting or time series analysis.

Step 5: Process the repeat value

Find and delete duplicate values ​​in the data set to ensure the accuracy of the data.

# Find duplicate linesduplicates = df[()]
print(duplicates)

# Delete duplicate linesdf.drop_duplicates(inplace=True)

Analysis

  • (): Tag the duplicate rows in the dataset.
  • df.drop_duplicates(inplace=True): Delete all duplicate rows to ensure that the dataset does not contain duplicate records, thereby improving the accuracy of the analysis results.

Step 6: Data Standardization

rightPriceColumns are standardized for subsequent analysis and modeling.

from  import MinMaxScaler

scaler = MinMaxScaler()
df['Price'] = scaler.fit_transform(df[['Price']])

Analysis

  • ImportMinMaxScalerUsed to scale data.
  • scaler.fit_transform(df[['Price']])WillPriceThe column values ​​are scaled between 0 and 1, and the normalized data is more suitable for machine learning models.

Step 7: Save the preprocessed data

Finally, the preprocessed data is saved as a new CSV file for subsequent analysis.

# Export as a new filedf.to_csv('ecommerce_customers_cleaned.csv', index=False)

Analysisdf.to_csv()Method saves the processed data as a new CSV file.index=FalseIndicates that the index column is not saved.

Summarize

Through the above steps, we have performed data preprocessing on the e-commerce customer data set, covering data import, viewing, missing value processing, data format conversion, duplicate value deletion, data standardization and data export. Through this case, students can understand the key steps and techniques of data preprocessing and prepare for subsequent data analysis and modeling.

This is the end of this article about the implementation of Python pandas data preprocessing. For more related pandas data preprocessing content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!