1. Use requests + BeautifulSoup
requests
is a very popular HTTP request library, andBeautifulSoup
is a library for parsing HTML and XML documents. By combining these two libraries, you can easily obtain and parse web content.
Example: Get and parse web content
import requests from bs4 import BeautifulSoup # Send HTTP requesturl = "" response = (url) # Ensure the request is successfulif response.status_code == 200: # Use BeautifulSoup to parse web pages soup = BeautifulSoup(, '') # Extract the title from the web page title = print(f"Web page title:{title}") # Extract all links in the webpage for link in soup.find_all('a'): print(f"Link:{('href')}") else: print("Web page request failed")
2. Use requests + lxml
lxml
It is another powerful HTML/XML parsing library that supports XPath and CSS selector syntax, and is fast in parsing, suitable for parsing large-scale web content.
Example: Use requests and lxml to get data
import requests from lxml import html # Send HTTP requesturl = "" response = (url) # Ensure the request is successfulif response.status_code == 200: # Use lxml to parse web pages tree = () # Extract the title from the web page title = ('//title/text()') print(f"Web page title:{title[0] if title else 'Unt title'}") # Extract all links links = ('//a/@href') for link in links: print(f"Link:{link}") else: print("Web page request failed")
3. Use Selenium + BeautifulSoup
When web page content is loaded dynamically through JavaScript, using static parsing methods such as requests and BeautifulSoup may not be able to obtain the complete data. At this time, Selenium can be used to simulate browser behavior, load web pages and obtain dynamically generated content. Selenium can control the browser, execute JavaScript scripts, and obtain the final rendered web page content.
Example: Use Selenium and BeautifulSoup to get dynamic web content
from selenium import webdriver from bs4 import BeautifulSoup import time # Start WebDriverdriver = (executable_path="path/to/chromedriver") # Visit the web pageurl = "" (url) # Wait for the page to load(3) # Get the page source codehtml = driver.page_source # Use BeautifulSoup to parse web pagessoup = BeautifulSoup(html, '') # Extract the title from the web pagetitle = print(f"Web page title:{title}") # Extract all links in the webpagefor link in soup.find_all('a'): print(f"Link:{('href')}") # Close the browser()
4. Use Scrapy
Scrapy is a powerful Python crawler framework designed to crawl large amounts of web page data. It supports asynchronous requests, can handle multiple requests efficiently, and has built-in many crawler functions, such as request scheduling, downloader middleware, etc. Scrapy is the preferred tool for handling large-scale crawling tasks.
Example: Scrapy project structure
- Create a Scrapy project:
scrapy startproject myproject
- Create a crawler:
cd myproject scrapy genspider example_spider
- Write crawler code:
import scrapy class ExampleSpider(): name = 'example_spider' start_urls = [''] def parse(self, response): # Extract the web page title title = ('title::text').get() print(f"Web page title:{title}") # Extract all links links = ('a::attr(href)').getall() for link in links: print(f"Link:{link}")
- Running the crawler:
scrapy crawl example_spider
5. Use PyQuery
PyQuery
is a jQuery-like library that provides a syntax similar to jQuery, which can be used to get web content very conveniently using CSS selectors.PyQuery
Usedlxml
library, so it parses very fast.
Example: Use PyQuery to get data
from pyquery import PyQuery as pq import requests # Send HTTP requesturl = "" response = (url) # Use PyQuery to parse web pagesdoc = pq() # Extract the web page titletitle = doc('title').text() print(f"Web page title:{title}") # Extract all links in the webpagefor link in doc('a').items(): print(f"Link:{('href')}")
Summarize
Python provides a variety of ways to obtain web page data, each suitable for different scenarios:
-
requests
+BeautifulSoup
: Suitable for simple static web crawling, easy to use. -
requests
+lxml
: Suitable for situations where efficient parsing large-scale web content is required, and supports XPath and CSS selectors. -
Selenium
+BeautifulSoup
: Suitable for crawling dynamic web pages (JavaScript rendering), simulates browser behavior to obtain dynamic data. -
Scrapy
: Powerful crawler framework, suitable for large-scale web crawling tasks, supports asynchronous requests and advanced features. -
PyQuery
: Based on jQuery syntax, suitable for rapid development, providing concise CSS selector syntax.
This is the end of this article about five ways to obtain web page data in Python. For more related contents of Python to obtain web page data, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!