Use Python crawler framework to get data for specified areas in HTML pages

introduction

In today's Internet age, data has become a valuable resource. Whether it is conducting market analysis, public opinion monitoring, or academic research, obtaining data on web pages is a very important step. As a powerful and easy-to-learn programming language, Python provides a variety of crawler frameworks to help us obtain web page data efficiently. This article will introduce in detail how to use the Python crawler framework to obtain data in specified areas in HTML pages, and use code examples to show the specific implementation process.

1. Introduction to the crawler framework

There are many popular crawler frameworks in Python, such as Scrapy, BeautifulSoup, Requests, etc. These frameworks have their own characteristics and are suitable for different scenarios.

1.1 Scrapy

Scrapy is a powerful crawler framework suitable for large-scale data crawling tasks. It provides a complete crawler solution, including request scheduling, data extraction, data storage and other functions. The advantages of Scrapy are efficient and scalable, but the learning curve is relatively steep.

1.2 BeautifulSoup

BeautifulSoup is a Python library for parsing HTML and XML documents. It can automatically convert input documents into Unicode encoding and provides an easy-to-use API to traverse and search the document tree. The advantage of BeautifulSoup is that it is easy to get started and is suitable for small-scale data crawling tasks.

1.3 Requests

Requests is a Python library for sending HTTP requests. It simplifies the process of HTTP requests, making it very simple to send GET, POST and other requests. Requests are usually used in conjunction with BeautifulSoup to get web content and parse it.

2. Get data for the specified area in the HTML page

In practical applications, we usually only need to obtain data from a specific area of the web page, rather than the content of the entire web page. Below we will use a specific example to show how to use the Python crawler framework to obtain data for specified areas in HTML pages.

2.1 Landing web analysis

Suppose we need to obtain the title and content of an article from a news website. First, we need to analyze the HTML structure of the landing web page and find the HTML tag where the title and body are located.

For example, the HTML structure of a landing web page might be as follows:

&lt;html&gt;
&lt;head&gt;
    &lt;title&gt;News Title&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;div class="article"&gt;
        &lt;h1 class="title"&gt;News Title&lt;/h1&gt;
        &lt;div class="content"&gt;
            &lt;p&gt;This is the first paragraph of news。&lt;/p&gt;
            &lt;p&gt;This is the second paragraph of the news。&lt;/p&gt;
        &lt;/div&gt;
    &lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;

From the HTML code above, we can see that the title is located<h1 class="title">In the tag, the content of the text is located in<div class="content">in the tag.

2.2 Use Requests to get web page content

First, we need to use the Requests library to send HTTP requests to get the HTML content of the web page.

import requests

url = '/news/article'
response = (url)

if response.status_code == 200:
    html_content = 
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

2.3 Parsing HTML with BeautifulSoup

Next, we use the BeautifulSoup library to parse HTML content and extract the title and body.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, '')

# Extract titletitle = ('h1', class_='title').text

# Extract the textcontent_div = ('div', class_='content')
paragraphs = content_div.find_all('p')
content = '\n'.join([ for p in paragraphs])

print(f"title: {title}")
print(f"text: {content}")

2.4 Complete code example

Integrate the above steps together, the complete code is as follows:

import requests
from bs4 import BeautifulSoup

# Landing Web URLurl = '/news/article'

# Send HTTP request to get web page contentresponse = (url)

if response.status_code == 200:
    html_content = 
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
    exit()

# Use BeautifulSoup to parse HTMLsoup = BeautifulSoup(html_content, '')

# Extract titletitle = ('h1', class_='title').text

# Extract the textcontent_div = ('div', class_='content')
paragraphs = content_div.find_all('p')
content = '\n'.join([ for p in paragraphs])

print(f"title: {title}")
print(f"text: {content}")

2.5 Operation results

After running the above code, the program will output the title and text content of the article in the destination web page.

Title: News Title
Text: This is the first paragraph of the news.
This is the second paragraph of the news.

3. Handle dynamically loaded content

Some web page content is dynamically loaded through JavaScript, and HTML content obtained using the Requests library may not contain these dynamically loaded data. In this case, we can use the Selenium library to simulate browser behavior and get the complete web content.

3.1 Install Selenium

First, we need to install the Selenium library and the corresponding browser driver (such as ChromeDriver).

from selenium import webdriver
from bs4 import BeautifulSoup

# Configure the browser driver pathdriver_path = '/path/to/chromedriver'

# Create a browser instancedriver = (executable_path=driver_path)

# Open the landing pageurl = '/news/article'
(url)

# Get web contenthtml_content = driver.page_source

# Close the browser()

# Use BeautifulSoup to parse HTMLsoup = BeautifulSoup(html_content, '')

# Extract titletitle = ('h1', class_='title').text

# Extract the textcontent_div = ('div', class_='content')
paragraphs = content_div.find_all('p')
content = '\n'.join([ for p in paragraphs])

print(f"title: {title}")
print(f"text: {content}")

3.3 Operation results

After using Selenium to get dynamically loaded content, the program will output the complete title and body content.

4. Data storage

After getting the data, we usually need to store it in a file or database for subsequent analysis or use. Below we will show how to store the obtained data into a CSV file.

4.1 Store to CSV file

import csv

# datadata = {
    'title': title,
    'content': content
}

# Write to CSV filewith open('news_article.csv', 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['title', 'content']
    writer = (csvfile, fieldnames=fieldnames)

    ()
    (data)

4.2 Operation results

After running the above code, the program will generate a name callednews_article.csvfile containing the title and text content of the article.

5. Summary

This article details how to use the Python crawler framework to obtain data for specified areas in HTML pages. We first analyzed the HTML structure of the target web page, then used the Requests library to obtain the web page content, and parsed the HTML using the BeautifulSoup library to extract the required title and body content. For dynamically loaded content, we use the Selenium library to simulate browser behavior and obtain the complete web content. Finally, we store the retrieved data into the CSV file.

Through the study of this article, readers should be able to master the basic methods of using the Python crawler framework to obtain web page data, and be able to expand and optimize according to actual needs.

The above is the detailed content of using the Python crawler framework to obtain data in specified areas in HTML web pages. For more information about obtaining data in HTML specified areas in Python, please pay attention to my other related articles!