SoFunction
Updated on 2025-03-05

Five ways to get web page data in Python

1. Use requests + BeautifulSoup

requestsis a very popular HTTP request library, andBeautifulSoupis a library for parsing HTML and XML documents. By combining these two libraries, you can easily obtain and parse web content.

Example: Get and parse web content

import requests
from bs4 import BeautifulSoup
 
# Send HTTP requesturl = ""
response = (url)
 
# Ensure the request is successfulif response.status_code == 200:
    # Use BeautifulSoup to parse web pages    soup = BeautifulSoup(, '')
    
    # Extract the title from the web page    title = 
    print(f"Web page title:{title}")
    
    # Extract all links in the webpage    for link in soup.find_all('a'):
        print(f"Link:{('href')}")
else:
    print("Web page request failed")

2. Use requests + lxml

lxmlIt is another powerful HTML/XML parsing library that supports XPath and CSS selector syntax, and is fast in parsing, suitable for parsing large-scale web content.

Example: Use requests and lxml to get data

import requests
from lxml import html
 
# Send HTTP requesturl = ""
response = (url)
 
# Ensure the request is successfulif response.status_code == 200:
    # Use lxml to parse web pages    tree = ()
    
    # Extract the title from the web page    title = ('//title/text()')
    print(f"Web page title:{title[0] if title else 'Unt title'}")
    
    # Extract all links    links = ('//a/@href')
    for link in links:
        print(f"Link:{link}")
else:
    print("Web page request failed")

3. Use Selenium + BeautifulSoup

When web page content is loaded dynamically through JavaScript, using static parsing methods such as requests and BeautifulSoup may not be able to obtain the complete data. At this time, Selenium can be used to simulate browser behavior, load web pages and obtain dynamically generated content. Selenium can control the browser, execute JavaScript scripts, and obtain the final rendered web page content.

Example: Use Selenium and BeautifulSoup to get dynamic web content

from selenium import webdriver
from bs4 import BeautifulSoup
import time
 
# Start WebDriverdriver = (executable_path="path/to/chromedriver")
 
# Visit the web pageurl = ""
(url)
 
# Wait for the page to load(3)
 
# Get the page source codehtml = driver.page_source
 
# Use BeautifulSoup to parse web pagessoup = BeautifulSoup(html, '')
 
# Extract the title from the web pagetitle = 
print(f"Web page title:{title}")
 
# Extract all links in the webpagefor link in soup.find_all('a'):
    print(f"Link:{('href')}")
 
# Close the browser()

4. Use Scrapy

Scrapy is a powerful Python crawler framework designed to crawl large amounts of web page data. It supports asynchronous requests, can handle multiple requests efficiently, and has built-in many crawler functions, such as request scheduling, downloader middleware, etc. Scrapy is the preferred tool for handling large-scale crawling tasks.

Example: Scrapy project structure

  • Create a Scrapy project:
scrapy startproject myproject
  • Create a crawler:
cd myproject
scrapy genspider example_spider 
  • Write crawler code:
import scrapy
 
class ExampleSpider():
    name = 'example_spider'
    start_urls = ['']
 
    def parse(self, response):
        # Extract the web page title        title = ('title::text').get()
        print(f"Web page title:{title}")
 
        # Extract all links        links = ('a::attr(href)').getall()
        for link in links:
            print(f"Link:{link}")
  • Running the crawler:
scrapy crawl example_spider

5. Use PyQuery

PyQueryis a jQuery-like library that provides a syntax similar to jQuery, which can be used to get web content very conveniently using CSS selectors.PyQueryUsedlxmllibrary, so it parses very fast.

Example: Use PyQuery to get data

from pyquery import PyQuery as pq
import requests
 
# Send HTTP requesturl = ""
response = (url)
 
# Use PyQuery to parse web pagesdoc = pq()
 
# Extract the web page titletitle = doc('title').text()
print(f"Web page title:{title}")
 
# Extract all links in the webpagefor link in doc('a').items():
    print(f"Link:{('href')}")

Summarize

Python provides a variety of ways to obtain web page data, each suitable for different scenarios:

  1. requests + BeautifulSoup: Suitable for simple static web crawling, easy to use.
  2. requests + lxml: Suitable for situations where efficient parsing large-scale web content is required, and supports XPath and CSS selectors.
  3. Selenium + BeautifulSoup: Suitable for crawling dynamic web pages (JavaScript rendering), simulates browser behavior to obtain dynamic data.
  4. Scrapy: Powerful crawler framework, suitable for large-scale web crawling tasks, supports asynchronous requests and advanced features.
  5. PyQuery: Based on jQuery syntax, suitable for rapid development, providing concise CSS selector syntax.

This is the end of this article about five ways to obtain web page data in Python. For more related contents of Python to obtain web page data, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!