Detailed explanation of various ways and examples of using Python to crawl web page data

introduction

Web Scraping is a very important skill in the field of data science and web crawling. Python is a popular language for web crawling because it has a powerful third-party library that can simplify the process of web parsing and data extraction. This article will introduce several common methods of crawling web page data, and help you get started quickly with rich code examples.

1. Use requests and BeautifulSoup for web crawling

1.1 Installation dependencies

First, you need to installrequestsandbeautifulsoup4Library. These two libraries are used for web page request and web page parsing:

pip install requests beautifulsoup4

1.2 Basic usage

requestsThe library is used to send HTTP requests to get the HTML content of the web page, andBeautifulSoupUsed to parse HTML content and extract the required data.

import requests
from bs4 import BeautifulSoup

# Send HTTP GET requesturl = ""
response = (url)

# parse HTML contentsoup = BeautifulSoup(, '')

# Extract titletitle = 
print("Web title:", title)

# Extract all linkslinks = soup.find_all('a')  # Find all <a> tagsfor link in links:
    href = ('href')
    print("Link:", href)

Code parsing:

(url): Send a GET request and return the response object.
BeautifulSoup(, ''): parses the HTML content of the web page.
: Get the title of the web page.
soup.find_all('a'): Find all<a>Tags, usually used to crawl links.

1.3 Use BeautifulSoup to extract specific data

Suppose we crawl a web page with multiple entries and extract the title and link for each entry from it.

import requests
from bs4 import BeautifulSoup

url = "/"  # A simple web page with famous quotes and authorsresponse = (url)
soup = BeautifulSoup(, '')

# Extract each quote and authorquotes = soup.find_all('div', class_='quote')
for quote in quotes:
    text = ('span', class_='text').text
    author = ('small', class_='author').text
    print(f"saying: {text}, author: {author}")

Code parsing:

soup.find_all('div', class_='quote'): Find all those with famous quotes<div>Label.
('span', class_='text'): Find the text of each quote.
('small', class_='author'): Find the author's name.

2. Use requests and lxml for web crawling

2.1 Installation dependencies

lxmlis another powerful library for parsing HTML and XML. It's better thanBeautifulSoupMore efficient and suitable for handling large web content.

pip install requests lxml

2.2 Basic usage

import requests
from lxml import html

# Send a requesturl = "/"
response = (url)

# parse HTMLtree = ()

# Extract famous quotes and authorsquotes = ('//div[@class="quote"]')
for quote in quotes:
    text = ('.//span[@class="text"]/text()')[0]
    author = ('.//small[@class="author"]/text()')[0]
    print(f"saying: {text}, author: {author}")

Code parsing:

(): parse HTML content and return alxmlofElementObject.
('//div[@class="quote"]'): Use XPath to find all the quotes that contain famous quotes<div>Label.
('.//span[@class="text"]/text()'): Extract the text of famous quotes.

2.3 Advantages

lxmlXPath support is provided, allowing you to more flexibility in selecting and filtering page elements, especially for complex web structures.

3. Use Selenium to crawl dynamic web pages

Some web pages dynamically load content through JavaScript, usingrequestsandBeautifulSoupThis data may not be available. At this time,SeleniumIt can simulate browser behavior and help you crawl these dynamically loaded web page content.

3.1 Installation dependencies

You need to installseleniumand a browser driver (such as ChromeDriver).

pip install selenium

At the same time, you need to download and install a browser driver, such as ChromeDriver, and add its path to the environment variable. You can download it from the following URL:

ChromeDriver：//driver/

3.2 Basic usage

from selenium import webdriver
from  import By
import time

# Set the ChromeDriver pathdriver = (executable_path='/path/to/chromedriver')

# Open the web page("/js/")  # A dynamically loaded web page
# Wait for the page to load(2)

# Get page contentquotes = driver.find_elements(By.CLASS_NAME, 'quote')
for quote in quotes:
    text = quote.find_element(By.CLASS_NAME, 'text').text
    author = quote.find_element(By.CLASS_NAME, 'author').text
    print(f"saying: {text}, author: {author}")

# Close the browser()

Code parsing:

(executable_path='/path/to/chromedriver'): Start the Chrome browser and specify the driver path.
(url): Visit the web page.
driver.find_elements(By.CLASS_NAME, 'quote'): Find all withquoteElements of class names.
(2): Wait for the dynamic content of the web page to load.

3.3 Advantages

SeleniumCrawl dynamically loaded content, suitable for processing web pages rendered using JavaScript.
It can simulate real browser operations, such as clicking, scrolling, filling in forms, etc.

4. Use the Scrapy framework for web crawling

Scrapyis an advanced framework for crawling websites and extracting data, suitable for large-scale web crawling tasks. It provides more features such as concurrent requests, automatic handling of cookies, and error retry, etc.

4.1 Install Scrapy

pip install scrapy

4.2 Create a Scrapy project

scrapy startproject myspider

4.3 Write a Scrapy Spider

Suppose we want to crawl a simple web page and extract famous quotes and authors.

import scrapy

class QuotesSpider():
    name = "quotes"
    start_urls = ['/']

    def parse(self, response):
        for quote in (''):
            yield {
                'text': ('::text').get(),
                'author': ('::text').get(),
            }
        
        # Next page        next_page = (' a::attr(href)').get()
        if next_page:
            yield (next_page, )

4.4 Running Scrapy Spider

Run the following command in the project root directory:

scrapy crawl quotes

4.5 Advantages

ScrapyIt is a very powerful framework suitable for large-scale crawling tasks, supports multi-thread crawling, and can efficiently process large number of pages.
It has many built-in features such as paging processing, data storage (you can store data as JSON, CSV or database), error handling, etc.

5. Other crawling methods

5.1 Using the pyquery library

pyqueryis a library similar to jQuery, which provides a jQuery-like API to parse and manipulate HTML documents.

pip install pyquery

from pyquery import PyQuery as pq

# Get web contenturl = "/"
doc = pq(url)

# Extract famous quotes and authorsfor quote in doc('.quote').items():
    text = quote('.text').text()
    author = quote('.author').text()
    print(f"saying: {text}, author: {author}")

5.2 Using the requests-html library

requests-htmlIt's a combinationrequestsandPyQueryThe library, designed for web crawling, can handle JavaScript rendering.

pip install requests-html

from requests_html import HTMLSession

session = HTML

Session()
url = "/js/"
response = (url)

# Rendering JavaScript()

# Extract famous quotes and authorsquotes = ('.quote')
for quote in quotes:
    text = ('.text', first=True).text
    author = ('.author', first=True).text
    print(f"saying: {text}, author: {author}")

Summarize

Python provides a variety of powerful web crawling methods that are suitable for different types of web pages. Requests and BeautifulSoup are the most basic and simple combinations that are suitable for static web crawling; Selenium is a powerful tool for crawling dynamically loading web pages; Scrapy is a comprehensive framework for large-scale crawling tasks. Choosing the right tool allows you to efficiently crawl web page data and apply it to multiple fields such as data analysis and content aggregation.

This is the article about the detailed explanation of various ways and examples of using Python to crawl web data. For more related content of Python to crawl web data, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!