Use BeautifulSoup and Pandas to crawl and clean web data

In data analysis and machine learning projects, data acquisition, cleaning and processing are very critical steps. Today, we will use a practical case to demonstrate how to use the Beautiful Soup library in Python to crawl web page data, and use the Pandas library to clean and process data. This case is not only suitable for beginners, but also helps friends with certain experience to quickly master these two powerful tools.

1. Preparation

Before you start, make sure that your Python environment has requests, beautifulsoup4 and pandas libraries installed. You can install them with the following command:

pip install requests beautifulsoup4 pandas

In addition, we need to crawl the data of a web page as an example. For simplicity, we chose a public news website page.

2. Crawl web page data

First, we need to use the requests library to get the HTML content of the web page. Then, use Beautiful Soup to parse the HTML and extract the data we are interested in.

import requests
from bs4 import BeautifulSoup
 
# Landing Web URLurl = '/news'  # Replace with the actual URL 
# Send HTTP request to get web page contentresponse = (url)
response.raise_for_status()  # Check whether the request is successful 
# Use Beautiful Soup to parse HTMLsoup = BeautifulSoup(, '')

Suppose we want to extract the news title, release time and text content. By checking the HTML structure of the web page, we found that this information is contained in a specific HTML tag.

# Extract news titles, release time and text contentarticles = []
for article in ('.news-article'):  # Assume that news articles have class="news-article"    title = article.select_one('').()
    publish_time = article.select_one('.publish-time').()
    content = article.select_one('.content').()
    ({
        'title': title,
        'publish_time': publish_time,
        'content': content
    })

3. Data cleaning

The captured data usually contains some unnecessary information, such as extra spaces, HTML tag residues, special characters, etc. We need to clean this data.

import pandas as pd
 
# Convert data to DataFramedf = (articles)
 
# Print the first few lines of data to viewprint(())
 
# Data cleaning steps# 1. Remove spaces before and after the string (has been processed at extraction)# 2. Replace special characters (for example, line breaks as spaces)df['content'] = df['content'].('\n', ' ')
 
# 3. Delete missing values or invalid data (assuming that data with empty titles or empty content is invalid)df = (subset=['title', 'content'])
 
# 4. Unified time format (assuming the release time is "YYYY-MM-DD HH:MM:SS" format)# Here we assume that the release time is already in string format and the format is unified. If you need to convert the format, you can use pd.to_datetime()# df['publish_time'] = pd.to_datetime(df['publish_time'], format='%Y-%m-%d %H:%M:%S')
 
# Print and view the cleaned dataprint(())

4. Data processing

After data cleaning, we may need to perform some additional processing, such as data conversion, data merging, data grouping, etc.

# Data processing steps# 1. Extract the date part of the release date (if required)# df['publish_date'] = df['publish_time'].
 
# 2. Statistics the number of news for each release date (if required)# daily_counts = df['publish_date'].value_counts().reset_index()
# daily_counts.columns = ['publish_date', 'count']
# print(daily_counts)
 
# 3. Filter news by keywords (for example, keep only news containing the keyword "epidemic")keyword = 'epidemic'
filtered_df = df[df['content'].(keyword, na=False, case=False)]
 
# Print filtered data to viewprint(filtered_df.head())

5. Save data

After processing the data, we may need to save it to a file for subsequent use. Pandas provides a variety of methods to save data, such as saving as CSV files, Excel files, etc.

# Save data as CSV filecsv_file_path = 'cleaned_news_data.csv'
df.to_csv(csv_file_path, index=False, encoding='utf-8-sig')
 
# Save data as an Excel fileexcel_file_path = 'cleaned_news_data.xlsx'
df.to_excel(excel_file_path, index=False, engine='openpyxl')

6. Complete code example

For the convenience of everyone's understanding and operation, the following is a complete code example. Make sure to replace the url variable with the actual webpage URL and adjust the Beautiful Soup's selector based on the actual HTML structure.

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# Landing web URL (please replace it with the actual URL)url = '/news'
 
# Send HTTP request to get web page contentresponse = (url)
response.raise_for_status()
 
# Use Beautiful Soup to parse HTMLsoup = BeautifulSoup(, '')
 
# Extract news titles, release time and text contentarticles = []
for article in ('.news-article'):  # Assume that news articles have class="news-article"    title = article.select_one('').()
    publish_time = article.select_one('.publish-time').()
    content = article.select_one('.content').()
    ({
        'title': title,
        'publish_time': publish_time,
        'content': content
    })
 
# Convert data to DataFramedf = (articles)
 
# Data cleaning stepsdf['content'] = df['content'].('\n', ' ')
df = (subset=['title', 'content'])
 
# Data processing steps (example: filter news based on keywords)keyword = 'epidemic'
filtered_df = df[df['content'].(keyword, na=False, case=False)]
 
# Save data as CSV files and Excel filescsv_file_path = 'cleaned_news_data.csv'
excel_file_path = 'cleaned_news_data.xlsx'
df.to_csv(csv_file_path, index=False, encoding='utf-8-sig')
df.to_excel(excel_file_path, index=False, engine='openpyxl')
 
# Print filtered data to viewprint(filtered_df.head())

7. Summary

Through this article, we learned how to use Beautiful Soup for web data crawling and use Pandas for data cleaning and processing. The combination of these two libraries can greatly improve our efficiency in processing web page data. In actual projects, you may need to adjust the code according to the specific web page structure and data requirements. Hopefully this practical case will help you better master these two tools and play their role in your data analytics and machine learning projects.

This is the article about using BeautifulSoup and Pandas for web data crawling and cleaning. For more related BeautifulSoup Pandas data crawling and cleaning, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!