Automated data crawling and storage using Python

1. Preparation phase: Determine the target and install the tool

1. Determine the target website

The first step in data crawling is to clarify the website you want to obtain the data. Suppose you are interested in the price of a certain e-commerce platform, then that platform is your target website. After selecting a target, you need to analyze the structure and data distribution of the website and determine the type of data that needs to be crawled, such as product name, price, sales, etc.

2. Install Python and necessary libraries

Before you start writing a crawler, make sure that your computer has a Python environment installed. Next, you need to install some third-party libraries for data crawling. Commonly used libraries include:

requests: used to send HTTP requests and get web page content.
BeautifulSoup: used to parse web content and extract required data.
pandas: Very useful for data processing and storage, especially when saving data as an Excel file.

You can install these libraries through the pip command:

pip install requests beautifulsoup4 pandas

2. Write a crawler program: send requests and parse web pages

1. Send HTTP request

Using the requests library, you can easily send HTTP requests to the target website to get the HTML content of the web page. Here is a simple example:

import requests
 
url = ''  # The URL of the target websiteresponse = (url)
 
# Check whether the request is successfulif response.status_code == 200:
    print('The request was successful!  ')
    html_content =   # Get HTML content of the web pageelse:
    print(f'Request failed，Status code：{response.status_code}')

2. Analyze the content of the web page

After getting the HTML content, you need to use the BeautifulSoup library to parse it and extract the required data. Here is an example of parsing web titles:

from bs4 import BeautifulSoup
 
soup = BeautifulSoup(html_content, '')
title =   # Extract the web page titleprint(f'Web page title：{title}')

Of course, in practical applications, you may need to extract more complex data, such as product lists, price information, etc. At this time, you need to locate and extract data based on the HTML structure of the web page, using the methods provided by BeautifulSoup (such as find, find_all, etc.).

3. Dealing with anti-crawler mechanism: responding to challenges and strategies

In order to protect their own data, many websites will set up anti-crawler mechanisms, such as verification code verification, IP ban, etc. So, when writing a crawler program, you need to take some steps to deal with these challenges.

1. Set the request header

By setting the appropriate request header, you can simulate the behavior of the browser, thereby bypassing some simple anti-crawler mechanisms. Here is an example of setting the request header:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}
response = (url, headers=headers)

2. Use proxy IP

If your crawler program frequently visits the same website, your IP may be blocked. To solve this problem, you can use the proxy IP to hide your real IP address. Here is an example of using a proxy IP:

proxies = {
    'http': 'http://your-proxy-server:port',
    'https': 'https://your-proxy-server:port',
}
response = (url, proxies=proxies)

Note that using a proxy IP may require additional costs, and the quality of the proxy IP will also affect the efficiency and stability of the crawler. Therefore, be sure to consider carefully when choosing a proxy IP.

4. Data storage and processing: Save and analyze data

After extracting the required data, you need to store it for subsequent analysis and use. Python provides a variety of data storage methods, including text files, databases, Excel files, etc.

1. Store as a text file

You can save the extracted data as text files, such as CSV, JSON, etc. Here is an example saved as a CSV file:

import csv
 
data = [
    ['Product Name', 'price', 'Sales'],
    ['Product A', '100 yuan', '100 pieces'],
    ['Commodity B', '200 yuan', '50 pieces'],
]
 
with open('Product Data.csv', 'w', newline='', encoding='utf-8') as csvfile:
    csvwriter = (csvfile)
    (data)

2. Store as a database

If you need to store a large amount of data and want to perform efficient data queries and analysis, a database is a good choice. Python supports a variety of database management systems, such as MySQL, PostgreSQL, etc. Here is an example of storing data to a MySQL database:

import 
 
# Connect to MySQL databaseconn = (
    host='your-database-host',
    user='your-database-user',
    password='your-database-password',
    database='your-database-name'
)
 
cursor = ()
 
# Create a table (if it does not exist yet)('''
 CREATE TABLE IF NOT EXISTS Product Data (
     id INT AUTO_INCREMENT PRIMARY KEY,
     Product name VARCHAR(255),
     Price VARCHAR(255),
     Sales volume INT
 )
 ''')
 
# Insert datadata = [
    ('Product A', '100 yuan', 100),
    ('Commodity B', '200 yuan', 50),
]
 
('''
 INSERT INTO Product Data (Product Name, Price, Sales) VALUES (%s, %s, %s)
 ''', data)
 
# Submit transaction and close connection()
()
()

3. Store as an Excel file

If you want to save your data as an Excel file for more intuitive data analysis and visualization, you can use the pandas library. Here is an example of storing data as an Excel file:

import pandas as pd
 
data = {
    'Product Name': ['Product A', 'Commodity B'],
    'price': ['100 yuan', '200 yuan'],
    'Sales': [100, 50],
}
 
df = (data)
df.to_excel('Product data.xlsx', index=False)

5. Practical case: Crawling the price of e-commerce platform products

In order to give you a better understanding of how to use Python to automatically crawl and store data, the following is a practical case: crawl product price information on an e-commerce platform and save it as an Excel file.

1. Analyze the target website

Suppose your target website is an e-commerce platform, and you need to crawl the price information of a certain product category on the platform. First, you need to analyze the HTML structure of the website and determine the HTML tags and attributes of product name, price and other information.

2. Write a crawler program

Based on the analysis results, you can write a crawler program to grab data. Here is a simple example:

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# The URL of the target website (assuming it is a list page for a certain product category)url = '/category'
 
# Set request headerheaders = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}
 
# Send HTTP request and get web page contentresponse = (url, headers=headers)
if response.status_code == 200:
    html_content = 
else:
    print('Request failed')
    exit()
 
# parse the content of the web page and extract the datasoup = BeautifulSoup(html_content, '')
products = soup.find_all('div', class_='product-item')  # Assume that product information is included in the div tag with class 'product-item' 
data = []
for product in products:
    name = ('h2', class_='product-name').()  # Extract product name    price = ('span', class_='product-price').()  # Extract product price    ([name, price])
 
# Save data as an Excel filedf = (data, columns=['Product Name', 'price'])
df.to_excel('Commodity price data.xlsx', index=False)
 
print('The data was crawled and saved successfully!  ')

3. Run the crawler program

Save the above code as a Python file (such as) and run it on the command line:

python

After running the crawler program, you should see an Excel file named "Product Price Data.xlsx" in the current directory, which contains the product name and price information crawled from the target website.

6. Optimization and maintenance: Improve crawler efficiency and stability

Add exception handling

During the network request and data analysis, various abnormal situations may be encountered, such as network timeout, request blocked, HTML structure changes, etc. To enhance the robustness of the crawler, you should add exception handling logic to your code so that it can be handled gracefully when encountering problems, rather than causing the entire program to crash.

try:
    # Send HTTP request and get web page content    response = (url, headers=headers, timeout=10)  # Set the timeout to 10 seconds    response.raise_for_status()  # If the response status code is not 200, an HTTPError exception will be raised    html_content = 
except  as e:
    print(f"An error occurred in the request：{e}")
    exit()

Using multi-threading or asynchronous IO

Single-threaded crawlers can be very slow when crawling large amounts of data, because each request needs to wait for the server to respond. To improve efficiency, you can consider using multithreading or asynchronous IO to send requests concurrently. Python's threading library and asyncio library provide support for multi-threading and asynchronous programming, respectively.

Regular updates and maintenance

The HTML structure and anti-crawler mechanism of the target website may change over time. Therefore, you need to regularly check and update your crawler program to make sure it continues to work properly.

Comply with laws and regulations and website terms

Before performing data crawling, be sure to understand and comply with relevant laws and regulations and the terms of use of the website. Some websites may explicitly prohibit automated data crawling, or have specific restrictions on the use and sharing of data.

7. Summary

Through this article, you should have mastered the basic skills of using Python to automate data crawling and storage. From goal determination and tool installation in the preparation stage, to writing crawler programs, handling anti-crawler mechanisms, data storage and processing, to practical cases and optimization and maintenance, every step is crucial. Hopefully these knowledge and skills can help you go further on the road to data crawling and provide strong support for data analysis and decision-making.

Remember, data crawling is only the first step in data analysis and mining. Subsequent data cleaning, analysis, visualization and other work are equally important. Only by using these skills in a comprehensive way can you extract valuable information from massive Internet data and bring real value to your business or research.

The above is the detailed content of using Python for automated data crawling and storage. For more information about Python data crawling and storage, please follow my other related articles!