introduction
A web crawler is an automated program used to grab data from the Internet. Whether it is used for data analysis, market research, academic research, or web indexing of search engines, crawler technology plays an important role in modern Internet applications.
This article will use the requests library to explain how to perform basic web data crawling. requests is a simple and easy-to-use library in Python. It simplifies the process of network requests by encapsulating HTTP requests and is ideal for implementing network crawlers.
1. Install the requests library
First, if you have not installed the requests library, you can install it through pip:
pip install requests
2. Send a simple HTTP request
requests
The core function of the library is to send HTTP requests and get a response. Here is a simple example showing how to send a GET request to a web page and view the response content.
import requests # Send a GET requestresponse = ('') # Output response status codeprint("Status Code:", response.status_code) # Output web page content (HTML)print("Response Text:", )
explain:
-
(url)
: Send a GET request to the specified URL. -
response.status_code
: Returns the status code of the HTTP response (such as 200 means success). -
: Returns the HTML content of the web page.
3. Request a URL with parameters
Many times, web pages need to have query parameters for dynamic requests.requests
Parameters can be passed through dictionaries to easily construct the request URL.
import requests url = '/get' params = { 'name': 'John', 'age': 30 } # Send a GET request with query parametersresponse = (url, params=params) # Output the response URL and view the final requested URLprint("Requested URL:", ) # Output response contentprint("Response Text:", )
In this example, the key-value pairs in the params dictionary will be encoded as URL query parameters, ultimately forming URL /get?name=John&age=30.
4. Send a POST request
Some websites need to submit form data through POST requests. The requests library also supports sending POST requests and can pass data.
import requests url = '/post' data = { 'username': 'admin', 'password': '123456' } # Send a POST requestresponse = (url, data=data) # Output response contentprint("Response Text:", )
explain:
-
(url, data=data)
: Send a POST request to the URL and passdata
Parameters pass form data. - You can also use
json=data
Pass data in JSON format.
5. Processing request headers
Sometimes, specific request headers are required to be set when sending HTTP requests, such as user agents, authentication information, etc.requests
Can be passedheaders
Easy setting of parameters.
import requests url = '' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } # Send a GET request with header informationresponse = (url, headers=headers) # Output response contentprint("Response Text:", )
In this example,User-Agent
Simulates the browser's request, making the target website think that the request comes from the browser, not the crawler.
6. Process the response content
requests
The library supports a variety of response formats, such as HTML, JSON, pictures, etc. By checking the content type of the response, you can easily process different types of data.
6.1 Parsing JSON responses
Some websites return data in JSON format.requests
Provided.json()
Method to parse JSON data.
import requests url = '/posts' response = (url) # If the returned JSON data, you can use the .json() method to parse itjson_data = () print("JSON Data:", json_data)
6.2 Download the file (such as picture)
If the crawled content is a file, such as a picture, you can userequests
Library content
Attributes to process binary data.
import requests url = '/' response = (url) # Save the picture to localwith open('', 'wb') as file: ()
7. Exception handling
In userequests
When sending a request, you may encounter network problems, timeout, 404 errors, etc. To improve the robustness of the crawler, it is recommended to use exception handling to catch errors.
import requests try: response = ('', timeout=5) response.raise_for_status() # If the response status code is 4xx or 5xx, an exception is thrownexcept as e: print(f"Request failed: {e}")
8. Good practice among crawlers
Set reasonable request intervals: In order to avoid excessive pressure on the target server, you can set the request interval to avoid frequent requests.
import time (1) # pause 1 Second
Comply with the norms: Before crawling data, check the target website’s
File, make sure your crawlers comply with the crawler rules of the website.
Using a proxy: If frequent crawling requests result in banning, you can consider using a proxy pool to change the requested IP address.
Request header disguise: Simulate real browser requests to avoid being recognized as crawlers.
9. Summary
requests
The library is a very simple and easy-to-use HTTP request library in Python, suitable for most web page data crawling needs. In userequests
When using the library, you need to understand how to send GET/POST requests, how to pass parameters, process response data, and handle exceptions.
The above is the detailed tutorial on using Python's requests library for web page data crawling. For more information about Python's requests library web page data crawling, please pay attention to my other related articles!