Practical guide to using Python to locate Chinese characters in Span tags

1. Preparation

Before you start, you need to make sure that the necessary Python libraries are installed. The requests library is used to send HTTP requests and get web page content; the BeautifulSoup library is used to parse HTML documents and extract required information.

These libraries can be installed using the following command:

pip install requests beautifulsoup4 lxml

2. Basic process

Send HTTP requests: Use the requests library to get the HTML content of the destination web page.
Parsing HTML: Use the BeautifulSoup library to parse HTML documents and build a DOM tree.
Positioning tags: Positioning to the tag in HTML through the selector.
Extract text: Extract text content from the tags that are located.

3. Code examples

Here is a simple example that demonstrates how to locate and extract text from tags.

import requests
from bs4 import BeautifulSoup
 
# Define the target URLurl = ''  # Replace with the actual URL 
# Send HTTP requestresponse = (url)
 
# Check whether the request is successfulif response.status_code == 200:
    # Get HTML content of the web page    html_content = 
    
    # parse HTML content    soup = BeautifulSoup(html_content, 'lxml')  # Can also use ''    
    # Find all <span> tags    spans = soup.find_all('span')
    
    # traverse and print the content of each <span> tag    for span in spans:
        print(span.get_text(strip=True))  # strip=True is used to remove possible whitespace characterselse:
    print("Request failed, status code:", response.status_code)

IV. Case Analysis

Suppose we want to crawl the content from a web page containing the following HTML structure:

<div class="container">
    <span class="title">Hello, World!</span>
    <p class="description">This is a sample description.</p>
</div>

Our goal is to extract the text content in , i.e. "Hello, World!".

Send HTTP request:

import requests
 
# Define the target URLurl = ''  # Replace with the actual URL 
# Send a requestresponse = (url)
 
# Check whether the request is successfulif response.status_code == 200:
    html_content = 
else:
    print("Request failed, status code:", response.status_code)
    html_content = None

Parses HTML and locates the tag:

from bs4 import BeautifulSoup
 
# parse HTML contentsoup = BeautifulSoup(html_content, '')
 
# Positioning to a specific <span> element (based on class name)span_element = ('span', class_='title')
 
# Check whether the specified <span> element is foundif span_element:
    span_text = span_element.get_text()
    print("Getted <span> content:", span_text)
else:
    print("No specified <span> element found")

Complete code:

import requests
from bs4 import BeautifulSoup
 
# Define the target URLurl = ''  # Replace with the actual URL 
# Send a requestresponse = (url)
 
# Check whether the request is successfulif response.status_code == 200:
    # parse HTML content    soup = BeautifulSoup(, '')
 
    # Positioning to a specific <span> element (based on class name)    span_element = ('span', class_='title')
 
    # Check whether the specified <span> element is found    if span_element:
        span_text = span_element.get_text()
        print("Getted <span> content:", span_text)
    else:
        print("No specified <span> element found")
else:
    print("Request failed, status code:", response.status_code)

5. Advanced skills

Handle multiple tags:

If there are multiple tags in the web page, you can use the find_all method to get all matching tags and iterate over them.

spans = soup.find_all('span')
for span in spans:
    print(span.get_text(strip=True))

Position according to other attributes:

In addition to class names, you can also locate according to other attributes of the tag (such as id, name, etc.).

span_element = ('span', id='my-span-id')

Combined with XPath:

For more complex HTML structures, you can use the XPath function provided by the lxml library for positioning. However, this usually requires more HTML and XPath knowledge.

from lxml import etree
 
# parse Element object with HTML content lxmltree = (html_content)
 
# Locate <span> elements using XPath expressionspan_elements = ('//span[@class="title"]')
 
# Extract text contentfor span in span_elements:
    print(())

Using Selenium:

For scenarios that require simulating user operations (such as clicks, inputs, etc.), the Selenium library can be used. Selenium supports multiple browsers and can locate elements through XPath, CSS selector, etc.

from selenium import webdriver
 
# Create a Chrome browser instancedriver = ()
 
# Open the web page('')
 
# Positioning <span> elements through XPathelement = driver.find_element_by_xpath('//span[@class="title"]')
 
# Print the text content of the elementprint()
 
# Close the browser()

6. Things to note

Legality and morality: When crawling web page data, be sure to abide by the website's agreements and relevant laws and regulations, and do not cause excessive load on the target website.
Exception handling: When writing crawler code, exception handling should be done well, such as network request failure, HTML parsing errors, etc.
Data cleaning: The extracted data may contain unnecessary whitespace characters, HTML tags, etc., and need to be cleaned and formatted.
Dynamic content: For content that is loaded dynamically through JavaScript, tools such as Selenium that can execute JavaScript may be required.

7. Summary

Through the introduction of this article, readers should have mastered how to use Python to locate and extract text from tags. This can be achieved easily whether it is using requests and BeautifulSoup for simple HTML parsing or using Selenium for complex web operations. I hope this article can help readers better apply these technologies in actual projects.

The above is the detailed content of the practical guide for using Python to locate Chinese text in Span tags. For more information about Python to locate Span text, please follow my other related articles!