SoFunction
Updated on 2025-04-14

Detailed explanation of the use of Python crawler HTTP proxy

1. Agent principle: put "invisible clothes" on the crawler

HTTP proxy is like a express transit station. Your crawler request will be sent to the proxy server first, and then forwarded to the target website by the proxy server. The target website only sees the IP address of the proxy server, not your real IP. The benefits of this "middleman" mechanism include:

  • Hide the real IP

Highly anonymous agents can completely hide your network identity, and the target website cannot recognize that you are using the agent.

  • Break through IP restrictions

When single IP access is restricted too frequently, the switch proxy can immediately restore access

  • Distributed acquisition

Through multiple local agents, the country's IP distribution can be achieved, and real users' access behavior can be simulated

2. Guide to selecting agent types

Agent Type Anonymous Target website recognition difficulty Applicable scenarios
Transparent Agent Low Easy to identify Only for simple network acceleration
Anonymous proxy middle Difficult to identify Mild data collection
High Concealed Agent high Almost unrecognizable High frequency acquisition, anti-climbing confrontation

3. Code practice: Three lines of code implement proxy settings

  • Basic version (requests library)
import requests
 
proxies = {
    "http": "http://123.123.123.123:8080",
    "https": "http://123.123.123.123:8080"
}
 
response = ("", proxies=proxies)
print()
  • Advanced version (Scrapy framework)
# 
DOWNLOADER_MIDDLEWARES = {
    '': 110,
    '': 100,
}
 
# 
class ProxyMiddleware:
    def process_request(self, request, spider):
        ['proxy'] = 'http://123.123.123.123:8080'

Key parameter description:

  • proxy: The proxy server address format must be http://ip:port
  • timeout: It is recommended to set a 10-20 second timeout to prevent it from being stuck
  • allow_redirects: Keep the proxy in effect when processing redirects

4. Agent pool management: Create an intelligent IP warehouse

Proxy verification mechanism

def check_proxy(proxy):
    try:
        response = ("/ip", proxies={"http": proxy}, timeout=5)
        return response.status_code == 200
    except:
        return False

Dynamic switching strategy

proxy_pool = [
    "http://ip1:port",
    "http://ip2:port",
    "http://ip3:port"
]
 
current_proxy = (proxy_pool)

Automatically retry the decorator

def retry(max_retries=3):
    def decorator(func):
        @(func)
        def wrapper(*args, **kwargs):
            for _ in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except :
                    continue
            return None
        return wrapper
    return decorator

5. Counter-climbing confrontation skills

Request header disguise

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Referer": "/"
}

Access frequency control

import time
import random
 
((1, 3))  # Waiting randomly1-3Second

Cookie persistence

session = ()
response = (url, proxies=proxies)
# Automatically carry subsequent requestscookie

6. Frequently Asked Questions Manual

Q1: The proxy returns 502/503 error

  • Check whether the proxy supports HTTPS protocol
  • Confirm whether the proxy server is alive
  • Try to replace proxy nodes in different regions

Q2: Access speed slows down

  • Test proxy server latency (ping < 100ms is preferred)
  • Increase the number of proxy pools (supposed to at least 10 nodes)
  • Enable asynchronous requests (aiohttp library)

Q3: Frequent switching is still blocked

  • Randomization using high anonymized agent + User-Agent
  • Add random request header parameters
  • Process verification codes in combination with coding platform

7. Performance optimization solution

Multi-threaded verification

from  import ThreadPoolExecutor
 
with ThreadPoolExecutor(max_workers=10) as executor:
    valid_proxies = list((check_proxy, proxy_list))

Cache valid proxy

import redis
 
r = (host='localhost', port=6379, db=0)
("valid_proxy", current_proxy, ex=300)  # cache5minute

Smart routing

def get_best_proxy(target_url):
    # Choose the same province agent according to the target website region    # Prioritize the use of recently verified proxy    pass

8. Guidelines for compliance use

  • Comply with the agreement of the target website
  • Control the acquisition frequency to avoid excessive pressure on the target server
  • Avoid collecting data involving user privacy
  • Retaining agent usage logs for checking

Conclusion: HTTP proxy is a necessary weapon for crawler engineers, but it is not a universal key. In actual development, various technologies such as request head disguise, access frequency control, verification code cracking are required. It is recommended to start with free agents, gradually master the skills of agent pool management, and then choose paid services based on specific needs. Remember, technology itself has no good or evil, and only by compliant use can it move forward steadily.

The above is the detailed explanation of the tutorial on using the Python crawler HTTP proxy. For more information about the use of the Python HTTP proxy, please follow my other related articles!