Implementation of data preprocessing by Python pandas

Case: Preprocessing of e-commerce customer data

This case will preprocess data of e-commerce customer data by using the pandas library. Data preprocessing is an important step in performing effective data analysis, with the goal of processing data into a state suitable for analysis. This case is divided into several steps to gradually demonstrate how to import data, view data set information, process missing values, convert data formats, deduplication, data standardization, and data export.

Test data

Here are the test data we will process, taking e-commerce customer orders as an example:

CustomerID	OrderDate	Product	Quantity	Price
C001	2024-01-10	Smartphone	2	500.00
C002	2024-01-12	Laptop computer	1	1200.00
C003	NaN	earphone	3	50.00
C001	2024-01-15	Tablet PC	1	300.00
C004	2024-01-18	Laptop computer	1	1200.00
C005	2024-01-20	Smartphone	2	NaN
C001	2024-01-15	Tablet PC	1	300.00
C006	2024-01-22	earphone	2	100.00
C007	2024-01-25	Smart watch	1	200.00
C008	2024-01-28	Tablet PC	3	900.00
C009	2024-01-30	Laptop computer	1	1500.00
C010	2024-02-01	Smartphone	4	2000.00
C011	2024-02-05	earphone	5	250.00
C012	2024-02-08	Tablet PC	1	300.00
C013	2024-02-10	Smart watch	2	400.00
C014	2024-02-12	Laptop computer	2	2400.00
C015	2024-02-15	Smartphone	1	1000.00
C016	2024-02-18	Tablet PC	2	600.00
C017	2024-02-20	Smart watch	3	600.00
C018	2024-02-22	earphone	1	50.00
C019	2024-02-25	Smartphone	2	1200.00
C020	2024-02-28	Laptop computer	1	1300.00

Step 1: Data Import

First, use pandas to import the CSV file and load the data.

import pandas as pd

# Import CSV filesdf = pd.read_csv('ecommerce_customers.csv')

Analysis: First import the pandas library,pd.read_csv()Method is used to load a CSV file and store it in the namedfin the data frame.

Step 2: View the basic information of the dataset

Use pandas' various methods to view the structure and content of the data to have a preliminary understanding of the data.

# View the first few lines of the dataprint(())

# View data type and column informationprint(())

# Simple statistical analysis of numerical columnsprint(())

Analysis：

(): View the first few lines of the data to help quickly understand the basic structure of the data.

Running results：

CustomerID OrderDate Product Quantity Price
0 C001 2024-01-10 Smartphone 2 500.0
1 C002 2024-01-12 Laptop 1 1200.0
2 C003 NaN 3 50.0
3 C001 2024-01-15 Tablet PC 1 300.0
4 C004 2024-01-18 Laptop 1 1200.0

passhead()Method, we can quickly view the first 5 rows of data and understand the general content of the data.

(): Displays the columns, number of non-null values and data types of the data set to help understand the overall situation of the data set.

Running results：

<class ''>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 20 non-null object
1 OrderDate 19 non-null object
2 Product 20 non-null object
3 Quantity 20 non-null int64
4 Price 19 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 928.0+ bytes

passinfo()Method, we can see the name, data type, and the number of non-null values in each column in the dataset, which is very helpful for subsequent judgment of whether there are missing values.

(): Perform statistical analysis of numerical columns and provide information such as mean, standard deviation, etc.

Running results：

Quantity Price
count 20.000000 19.000000
mean 2.000000 810.526316
std 1.055597 714.053229
min 1.000000 50.000000
25% 1.000000 300.000000
50% 2.000000 1000.000000
75% 3.000000 1200.000000
max 5.000000 2400.000000

passdescribe()Method, we can see the numeric column (such asQuantityandPrice) statistical information, such as mean, median, standard deviation, etc., is very important for the distribution and characteristic analysis of the data.

`count` represents the number of non-null values in the dataset. For the `Quantity` column, all 20 values are non-empty; for the `Price` column, there are only 19 non-empty values, indicating that there is one missing value.

`mean` represents the average value of the column. The average purchase volume for the `Quantity` column is 2; the average price for the `Price` column is 810.53. Step 3: Process the missing values.

`min` represents the minimum value in the column. The minimum purchase volume for the `Quantity` column is 1; the minimum price for the `Price` column is 50.

`25%` represents the first quartile, that is, 25% of the data points are less than or equal to this value. For the `Quantity` column, the first quartile is 1; for the `Price` column, this value is 300.

`50%` represents the median, i.e. 50% of the data points are less than or equal to this value. The median number of the `Quantity` column is 2; the median number of the `Price` column is 1000.

`75%` represents the third quartile, i.e. 75% of the data points are less than or equal to this value. The third quartile of the `Quantity` column is 3; the third quartile of the `Price` column is 1200.

`max` represents the maximum value in the column. The maximum purchase volume for the `Quantity` column is 5; the maximum price for the `Price` column is 2400.

Process missing values in the data. In this case,OrderDateandPriceThere are missing values in the column.

# View missing valuesdef missing_values(df):
    return ().sum()

print(missing_values(df))

# Fill in missing price columns to replace them with average valuesdf['Price'].fillna(df['Price'].mean(), inplace=True)

# Fill in missing order dates to replace them with specific datesdf['OrderDate'].fillna('2024-01-01', inplace=True)

Analysis：

().sum(): Count the number of missing values in each column to help identify vacant items in the data.
df['Price'].fillna(df['Price'].mean(), inplace=True): Use mean fillPriceMissing values of the column ensure data integrity. inplace=True is a parameter that specifies whether the operation is modified on the original data frame. After setting inplace=True, the operation will directly change the original data frame without returning a new data frame. This helps save memory, but needs to be used with caution, as the modification of the data is irreversible.
df['OrderDate'].fillna('2024-01-01', inplace=True):WillOrderDateThe missing values of the column are filled with default dates2024-01-01。

Step 4: Data Format Conversion

WillOrderDateThe column is converted from a string to a date type for subsequent time series analysis.

# Convert OrderDate to date typedf['OrderDate'] = pd.to_datetime(df['OrderDate'])

Analysis：pd.to_datetime(df['OrderDate'])WillOrderDateThe columns are converted to date-time format, which facilitates subsequent date-related analysis, such as sorting or time series analysis.

Step 5: Process the repeat value

Find and delete duplicate values in the data set to ensure the accuracy of the data.

# Find duplicate linesduplicates = df[()]
print(duplicates)

# Delete duplicate linesdf.drop_duplicates(inplace=True)

Analysis：

(): Tag the duplicate rows in the dataset.
df.drop_duplicates(inplace=True): Delete all duplicate rows to ensure that the dataset does not contain duplicate records, thereby improving the accuracy of the analysis results.

Step 6: Data Standardization

rightPriceColumns are standardized for subsequent analysis and modeling.

from  import MinMaxScaler

scaler = MinMaxScaler()
df['Price'] = scaler.fit_transform(df[['Price']])

Analysis：

ImportMinMaxScalerUsed to scale data.
scaler.fit_transform(df[['Price']])WillPriceThe column values are scaled between 0 and 1, and the normalized data is more suitable for machine learning models.

Step 7: Save the preprocessed data

Finally, the preprocessed data is saved as a new CSV file for subsequent analysis.

# Export as a new filedf.to_csv('ecommerce_customers_cleaned.csv', index=False)

Analysis：df.to_csv()Method saves the processed data as a new CSV file.index=FalseIndicates that the index column is not saved.

Summarize

Through the above steps, we have performed data preprocessing on the e-commerce customer data set, covering data import, viewing, missing value processing, data format conversion, duplicate value deletion, data standardization and data export. Through this case, students can understand the key steps and techniques of data preprocessing and prepare for subsequent data analysis and modeling.

This is the end of this article about the implementation of Python pandas data preprocessing. For more related pandas data preprocessing content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!