Preface
In modern web development,AJAX(Asynchronous JavaScript and XML)Technology is widely used to dynamically load data, allowing web pages to update content without refreshing. However, this also brings challenges to traditional crawlers - use<font style="color:rgb(64, 64, 64);">requests</font>
+ <font style="color:rgb(64, 64, 64);">BeautifulSoup</font>
Only the initial HTML can be obtained, but the dynamic data returned by AJAX cannot be captured.
Solution:
- Selenium + ChromeDriver: Simulates browser behavior and waits for the AJAX data to be loaded before crawling.
- Directly analyze AJAX requests: Capture the API interface through Chrome DevTools, using
requests
Direct request data (more efficient).
This article will introduce in detail how Python + Chrome crawls AJAX dynamic data and provides complete implementation code for two methods.
1. Understand AJAX dynamic loading
1.1 How AJAX works
- User access to the web page → The browser loads the initial HTML.
- JavaScript initiates an AJAX request (usually
fetch
orXMLHttpRequest
)。 - The server returns JSON/XML data → front-end dynamic rendering to the page.
1.2 Problems with traditional crawlers
import requests from bs4 import BeautifulSoup response = ("") soup = BeautifulSoup(, "") # Only get initial HTML, no AJAX data can be obtained!
2. Method 1: Use Selenium + Chrome to emulate the browser
2.1 Environmental preparation
Install the necessary libraries
2.2 Example: Crawling dynamically loaded news list
Suppose the target website (such as Sina News) loads more news through AJAX.
from selenium import webdriver from import Service from import By from import WebDriverWait from import expected_conditions as EC from import Options from webdriver_manager.chrome import ChromeDriverManager import time # Set proxy informationproxyHost = "" proxyPort = "5445" proxyUser = "16QMSOML" proxyPass = "280651" # Configure Chrome Agentchrome_options = Options() chrome_options.add_argument(f"--proxy-server=http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}") # Start Chromedriver = (service=Service(ChromeDriverManager().install()), options=chrome_options) ("/") # Wait for AJAX content to load (assuming the news list is rendered by AJAX)try: WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, ".news-item")) ) except: print("Timeout, news list not found") # Extract news titles and linksnews_items = driver.find_elements(By.CSS_SELECTOR, ".news-item") for item in news_items: title = item.find_element(By.CSS_SELECTOR, "a").text link = item.find_element(By.CSS_SELECTOR, "a").get_attribute("href") print(f"title: {title}\nLink: {link}\n") # Close the browser()
2.3 Key points description
-
WebDriverWait
: Explicitly wait for the AJAX data rendering to complete. -
EC.presence_of_element_located
: Check whether the target element is loaded. -
find_elements
+ CSS/XPath: Position dynamically generated content.
3. Method 2: Directly crawl AJAX API data (more efficient)
3.1 Analyze AJAX Requests
- Open the Chrome → F12 (Developer Tools) → Network tab.
- Refresh the page and filter the XHR/fetch requests.
- Find the API interface that returns the target data (usually
json
Format).
3.2 Example: Crawling Douban Movie AJAX Data
Douban Movie Home Page Loads the list of popular movies through AJAX.
Step 1: Analyze the API
- Open
→ F12 → Network → Filter XHR.
- Discover API:
/j/search_subjects?...
Step 2: Use Python to directly request the API
import requests import json # Douban Movie AJAX APIurl = "/j/search_subjects?type=movie&tag=Popular&sort=recommend&page_limit=20&page_start=0" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" } response = (url, headers=headers) data = () # Directly parse JSON # Extract movie informationfor movie in data["subjects"]: print(f"Movie name: {movie['title']}") print(f"score: {movie['rate']}") print(f"Link: {movie['url']}\n")
3.3 Advantages and limitations
- Advantages: Fast speed, no need to load the full page.
- Limitations: The API needs to be analyzed manually, and some interfaces may have encryption or authentication.
4. Summary
method | Applicable scenarios | advantage | shortcoming |
---|---|---|---|
Selenium | Complex dynamic rendering page | Can simulate full browser behavior | Slow speed, high resource usage |
Direct request API | Structured data (such as JSON) | Efficient and fast | Manual analysis of interfaces may be limited |
Best Practice Recommendations
- Priority Analysis AJAX API: If the target website has a clear interface, direct requests are more efficient.
- Selenium Alternative: Applicable to pages that cannot directly obtain APIs or require interaction.
- Compliance: Avoid high-frequency requests and prevent banning.
This is the end of this article about two methods of crawling AJAX dynamic data in Python + Chrome. For more related content of crawling AJAX data in Python Chrome, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!