Python + Chrome to grab AJAX dynamic data

Preface

In modern web development,AJAX（Asynchronous JavaScript and XML）Technology is widely used to dynamically load data, allowing web pages to update content without refreshing. However, this also brings challenges to traditional crawlers - use<font style="color:rgb(64, 64, 64);">requests</font> + <font style="color:rgb(64, 64, 64);">BeautifulSoup</font>Only the initial HTML can be obtained, but the dynamic data returned by AJAX cannot be captured.

Solution:

Selenium + ChromeDriver: Simulates browser behavior and waits for the AJAX data to be loaded before crawling.
Directly analyze AJAX requests: Capture the API interface through Chrome DevTools, usingrequestsDirect request data (more efficient).

This article will introduce in detail how Python + Chrome crawls AJAX dynamic data and provides complete implementation code for two methods.

1. Understand AJAX dynamic loading

1.1 How AJAX works

User access to the web page → The browser loads the initial HTML.
JavaScript initiates an AJAX request (usuallyfetchorXMLHttpRequest）。
The server returns JSON/XML data → front-end dynamic rendering to the page.

1.2 Problems with traditional crawlers

import requests
from bs4 import BeautifulSoup

response = ("")
soup = BeautifulSoup(, "")
# Only get initial HTML, no AJAX data can be obtained!

2. Method 1: Use Selenium + Chrome to emulate the browser

2.1 Environmental preparation

Install the necessary libraries

2.2 Example: Crawling dynamically loaded news list

Suppose the target website (such as Sina News) loads more news through AJAX.

from selenium import webdriver
from  import Service
from  import By
from  import WebDriverWait
from  import expected_conditions as EC
from  import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

# Set proxy informationproxyHost = ""
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"

# Configure Chrome Agentchrome_options = Options()
chrome_options.add_argument(f"--proxy-server=http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}")

# Start Chromedriver = (service=Service(ChromeDriverManager().install()), options=chrome_options)
("/")

# Wait for AJAX content to load (assuming the news list is rendered by AJAX)try:
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, ".news-item"))
    )
except:
    print("Timeout, news list not found")

# Extract news titles and linksnews_items = driver.find_elements(By.CSS_SELECTOR, ".news-item")
for item in news_items:
    title = item.find_element(By.CSS_SELECTOR, "a").text
    link = item.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
    print(f"title: {title}\nLink: {link}\n")

# Close the browser()

2.3 Key points description

WebDriverWait: Explicitly wait for the AJAX data rendering to complete.
EC.presence_of_element_located: Check whether the target element is loaded.
find_elements+ CSS/XPath: Position dynamically generated content.

3. Method 2: Directly crawl AJAX API data (more efficient)

3.1 Analyze AJAX Requests

Open the Chrome → F12 (Developer Tools) → Network tab.
Refresh the page and filter the XHR/fetch requests.
Find the API interface that returns the target data (usuallyjsonFormat).

3.2 Example: Crawling Douban Movie AJAX Data

Douban Movie Home Page Loads the list of popular movies through AJAX.

Step 1: Analyze the API

Open→ F12 → Network → Filter XHR.
Discover API:/j/search_subjects?...

Step 2: Use Python to directly request the API

import requests
import json

# Douban Movie AJAX APIurl = "/j/search_subjects?type=movie&amp;tag=Popular&amp;sort=recommend&amp;page_limit=20&amp;page_start=0"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = (url, headers=headers)
data = ()  # Directly parse JSON
# Extract movie informationfor movie in data["subjects"]:
    print(f"Movie name: {movie['title']}")
    print(f"score: {movie['rate']}")
    print(f"Link: {movie['url']}\n")

3.3 Advantages and limitations

Advantages: Fast speed, no need to load the full page.
Limitations: The API needs to be analyzed manually, and some interfaces may have encryption or authentication.

4. Summary

method	Applicable scenarios	advantage	shortcoming
Selenium	Complex dynamic rendering page	Can simulate full browser behavior	Slow speed, high resource usage
Direct request API	Structured data (such as JSON)	Efficient and fast	Manual analysis of interfaces may be limited

Best Practice Recommendations

Priority Analysis AJAX API: If the target website has a clear interface, direct requests are more efficient.
Selenium Alternative: Applicable to pages that cannot directly obtain APIs or require interaction.
Compliance: Avoid high-frequency requests and prevent banning.

This is the end of this article about two methods of crawling AJAX dynamic data in Python + Chrome. For more related content of crawling AJAX data in Python Chrome, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!