Basic methods of using Python for data cleaning and storage

1. The significance and role of data cleaning

Data cleaning refers to converting the crawled data into a structured, accurate and consistent format, removing invalid data, processing missing values, normalizing formats, analyzing data, etc. Whether it is further analysis, model training or database storage, the cleaned data will be more efficient and accurate.

FAQs for data obtained by crawlers:

Missing data: Some fields may be empty or inconsistent in format.
Noise data: may contain useless information such as advertisements, comments, etc.
The format is not standardized: such as date format, text encoding, inconsistent case, etc.

In this article, we will use some sample code to show how to clean and store data.

2. Commonly used data cleaning library

Python provides a variety of libraries for data cleaning, the most commonly used ones are:

Pandas: Provides flexible data operation functions, which can be used for data cleaning, conversion and processing.
re(regular expression library): Used for string pattern matching and data cleaning.
BeautifulSoup: Used to parse HTML and clean web page data.
NumPy: Used to process numerical data, fill in missing values, etc.

Install Pandas and BeautifulSoup:

pip install pandas beautifulsoup4

3. Sample dataset

Suppose we crawl the product information of an e-commerce website, the data is as follows:

Product Name	price	Number of comments	Release date
Apple iPhone 12	$999	200	January 1, 2023
Samsung Galaxy S21	Price missing	150	none
Xiaomi Mi 11	$699	Missing comments	March 5, 2023
Sony Xperia 5	$799	100	2023/05/10
OPPO Find X3	Price error	50	2022-12-01

As you can see, the crawled data contains the following problems:

Some prices and reviews are missing.
Date formats are inconsistent.
Some price fields are incorrectly formatted.
Non-standard information such as "lack of price" and "lack of comments" needs to be converted into missing values.

We will demonstrate the cleaning steps based on this dataset.

4. Data cleaning steps

4.1 Loading data into Pandas DataFrame

Convert data to Pandas' DataFrame format for easy operation.

import pandas as pd

# Create sample datadata = {
    'Product Name': ['Apple iPhone 12', 'Samsung Galaxy S21', 'Xiaomi Mi 11', 'Sony Xperia 5', 'OPPO Find X3'],
    'price': ['$999', 'Lack of price', '$699', '$799', 'The price is wrong'],
    'Comments': [200, 150, 'Comment Missing', 100, 50],
    'Release date': ['January 1, 2023', 'none', 'March 5, 2023', '2023/05/10', '2022-12-01']
}
df = (data)
print(df)

The output is:

Product Name Price Number of comments �
0 Apple iPhone 12 $999 200 January 1, 2023
1 Samsung Galaxy S21 missing price 150 None
2 Xiaomi Mi 11 $699 Missing comments March 5, 2023
3 Sony Xperia 5 $799 100 2023/05/10
4 OPPO Find X3 Price error 50 2022-12-01

4.2 Handling missing values

Missing values can be processed by deleting rows or filling them. First convert the "Missing Price" and "Missing Comments" fields into missing values (NaNs) that Pandas recognizes.

import numpy as np

# Convert "lack price" and "lack comment" to NaNdf['price'] = df['price'].replace(['Lack of price', 'The price is wrong'], )
df['Comments'] = df['Comments'].replace('Comment Missing', )

print(df)

4.3 Clean the price field

The price field contains$Symbol, and type is a string. We can remove the symbol and convert it to a numeric type for subsequent analysis.

# Remove the $ sign and convert it to a floating point numberdf['price'] = df['price'].('$', '').astype(float)
print(df)

4.4 Processing date format

The release date includes multiple formats. Can be usedpd.to_datetimeThe function is formatted.

# Unified date formatdf['Release date'] = pd.to_datetime(df['Release date'], errors='coerce')
print(df)

Here we useerrors='coerce'Options to handle incorrectly formatted dates, such as "None", will convert them to missing values.

4.5 Fill in missing values

We can fill in missing values according to our needs, such as replacing the missing value of the price with the mean, or filling in a specific value.

# Fill in missing values with average pricedf['price'].fillna(df['price'].mean(), inplace=True)

# The number of comments filled is 0df['Comments'].fillna(0, inplace=True)

print(df)

5. Data storage

After cleaning, the data can be stored in different file formats for subsequent analysis and use. Common data storage formats include CSV, Excel, SQL database, and JSON.

5.1 Save as a CSV file

CSV files are a common format for data storage and are suitable for small-scale data.

df.to_csv('cleaned_data.csv', index=False)

5.2 Save as an Excel file

Excel format is a good choice if the data needs further operation or is available for viewing.

df.to_excel('cleaned_data.xlsx', index=False)

5.3 Save to SQL database

SQL databases are more suitable for large-scale data, and you can use SQLite, MySQL, etc. Here is an example of using SQLite to store data:

import sqlite3

# Connect to SQLite database (it will be automatically created if the file does not exist)conn = ('')
df.to_sql('products', conn, if_exists='replace', index=False)
()

5.4 Save as a JSON file

JSON files are often used for data transmission and storage, suitable for nested structures of data.

df.to_json('cleaned_data.json', orient='records', force_ascii=False)

6. Best practices for data cleaning and storage

Unified format: Unify the format of the field as much as possible to facilitate subsequent processing and storage.
Handle missing values: Choose the appropriate filling strategy according to specific needs.
Keep the original data: A copy of the original data can be retained during the cleaning process to facilitate traceability.
Selection of data volume: Small-scale data can be stored as CSV or Excel, and large-scale data is recommended to be stored in the database.
Data Verification: After cleaning and storing, check the correctness of the data, such as counting the mean and checking whether the format meets expectations.

7. Summary

This article introduces the cleaning and storage methods of Python crawler data, starting from processing missing values, unified formats, cleaning prices and dates, and shows how to save the cleaned data into multiple formats. By mastering these data cleaning and storage techniques, the quality and usability of crawler data can be greatly improved.

This is the article about the basic methods of using Python for data cleaning and storage. For more related Python data cleaning and storage content, please search for my previous articles or continue browsing the following related articles. I hope everyone will support me in the future!