Crawl hidden div content in web pages using Python

introduction

In this era of information explosion, data on the Internet is growing all the time. As data scientists or developers, we often need to extract valuable information from web pages. However, in order to improve user experience or protect data, many web pages will hide some content by default and will only be displayed under certain conditions. These hidden contents are usually contained within the <div> tag in HTML and are loaded dynamically via JavaScript. This article will introduce in detail how to use Python to crawl these hidden div content to help you become more handy in the data collection process.

Why do you need to crawl hidden div content?

In actual applications, hidden div content may contain key information, such as comments, user ratings, product details, etc. This information is crucial for scenarios such as data analysis, market research, competitive product analysis, etc. For example, if you are a CDA Data Analyst, you may encounter situations when conducting market research. You may encounter situations where you need to crawl user comments, which are often loaded dynamically through JavaScript after the page is loaded.

Environmental preparation

Before we start, we need to prepare some basic tools and libraries. The following are the recommended environment configurations:

Python: It is recommended to use Python 3.6 and above.
Requests: Used to send HTTP requests.
BeautifulSoup: Used to parse HTML documents.
Selenium: Used to simulate browser behavior and handle content loaded dynamically by JavaScript.
ChromeDriver: Selenium's WebDriver, used to control Chrome browser.

You can install the required libraries using the following command:

pip install requests beautifulsoup4 selenium

Also, make sure you have downloaded ChromeDriver that matches your Chrome browser version and add its path to your system's environment variables.

Basic method: Static HTML parsing

Using Requests and BeautifulSoup

First, we try to parse static HTML content using Requests and BeautifulSoup. This approach is suitable for content that does not require JavaScript loading.

import requests
from bs4 import BeautifulSoup

url = ''
response = (url)
soup = BeautifulSoup(, '')

# Find all div elementsdivs = soup.find_all('div')
for div in divs:
    print()

However, for hidden div content, this approach is often invalid because these contents do not exist in the initial HTML.

Advanced method: Dynamic content crawling

Using Selenium

Selenium is a powerful tool that simulates browser behavior and handles content loaded dynamically by JavaScript. Below we use a specific example to illustrate how to use Selenium to grab hidden div content.

Install Selenium

Make sure you have Selenium and ChromeDriver installed:

pip install selenium

Sample code

Suppose we want to crawl the comment content dynamically loaded through JavaScript in a web page. We can do this using Selenium.

from selenium import webdriver
from  import By
from  import WebDriverWait
from  import expected_conditions as EC

# Initialize WebDriverdriver = ()

# Open the landing pageurl = ''
(url)

# Wait for the page to loadtry:
    # Wait for a specific element to appear    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((, 'comments'))
    )
finally:
    # Get the page source code    page_source = driver.page_source
    ()

# parse the page source codesoup = BeautifulSoup(page_source, '')

# Find all comments divcomment_divs = soup.find_all('div', class_='comment')
for comment in comment_divs:
    print()

Key points explanation

Initialize WebDriver:We use()Initialize a Chrome browser instance.
Open the landing page:use(url)Method to open the landing page.
Wait for the page to load:useWebDriverWaitandexpected_conditionsCome and wait for a specific element to appear. This step is very important because it ensures that the page is fully loaded.
Get the page source code:usedriver.page_sourceGet the HTML source code for the current page.
Parse the page source code: Use BeautifulSoup to parse HTML source code, find and extract the required div content.

Handle complex situations

In practical applications, the structure of web pages may be more complex, for example, some content requires user interaction (such as clicking a button) to be displayed. At this time, we can use Selenium to simulate user operations to trigger these events.

Simulate user operations

Suppose we need to click a button to display the hidden comment content, we can use the following code:

from selenium import webdriver
from  import By
from  import WebDriverWait
from  import expected_conditions as EC

# Initialize WebDriverdriver = ()

# Open the landing pageurl = ''
(url)

# Wait for the button to appearbutton = WebDriverWait(driver, 10).until(
    EC.element_to_be_clickable((, 'show-comments-button'))
)

# Click the button()

# Wait for the comment content to appeartry:
    # Wait for a specific element to appear    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((, 'comments'))
    )
finally:
    # Get the page source code    page_source = driver.page_source
    ()

# parse the page source codesoup = BeautifulSoup(page_source, '')

# Find all comments divcomment_divs = soup.find_all('div', class_='comment')
for comment in comment_divs:
    print()

Key points explanation

Wait for the button to appear:useWebDriverWaitandelement_to_be_clickableCome wait for the button to appear and become clickable.
Click the button:use()Method simulates the user clicking a button.
Wait for comment content to appear: Use againWebDriverWaitandpresence_of_element_locatedLet's wait for the comments to appear.

Performance optimization

Performance optimization is very important when dealing with large-scale data crawling tasks. Here are some commonly used optimization techniques:

Use Headless Mode

Selenium supports headless mode, which means running the browser in the background without displaying the graphical interface. This can significantly increase crawling speed and reduce resource consumption.

from selenium import webdriver
from  import Options

# Set Chrome Optionschrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

# Initialize WebDriverdriver = (options=chrome_options)

# Open the landing pageurl = ''
(url)

# ...Other codes ...

Concurrent crawling

Using multi-threading or multi-process can significantly improve crawling efficiency. PythonThe module provides a convenient concurrent programming interface.

import 
from selenium import webdriver
from  import Options
from bs4 import BeautifulSoup

def fetch_comments(url):
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    driver = (options=chrome_options)
    (url)
    page_source = driver.page_source
    ()
    soup = BeautifulSoup(page_source, '')
    comment_divs = soup.find_all('div', class_='comment')
    return [ for comment in comment_divs]

urls = ['/page1', '/page2', '/page3']

with () as executor:
    results = list((fetch_comments, urls))

for result in results:
    for comment in result:
        print(comment)

Key points explanation

Set Chrome Options: Enable headless mode and disable GPU acceleration.
Define the crawling function：fetch_commentsThe function is responsible for opening a web page, obtaining the page source code, parsing and returning comment content.
Using ThreadPoolExecutor:usePerform multiple crawling tasks in parallel.

Data cleaning and storage

The captured data often needs to be further cleaned and stored. Python provides a variety of tools and libraries to help you complete these tasks.

Data cleaning

Using the Pandas library for data cleaning is very convenient. For example, suppose we have a set of comment data, we can clean it with the following code:

import pandas as pd

# Suppose we have crawled the comment datacomments = [
    {'text': 'Great product!', 'date': '2023-01-01'},
    {'text': 'Not so good.', 'date': '2023-01-02'},
    {'text': 'Excellent service!', 'date': '2023-01-03'}
]

# Convert data to DataFramedf = (comments)

# Clean the datadf['date'] = pd.to_datetime(df['date'])
df['text'] = df['text'].()

print(df)

Data storage

Store the cleaned data into a file or database. For example, you can save data as a CSV file:

df.to_csv('', index=False)

Or store the data in the SQLite database:

import sqlite3

conn = ('')
df.to_sql('comments', conn, if_exists='replace', index=False)
()

Conclusion

Through the introduction of this article, I believe you have mastered the method of crawling hidden div content in web pages using Python. Whether it is static HTML parsing or dynamic content crawling, there are corresponding tools and techniques to help you complete tasks efficiently.

The above is the detailed content of using Python to crawl hidden div content in web pages. For more information about crawling hidden div content in Python, please follow my other related articles!