Preface
When using aiohttp to access multiple pages concurrently, it is obviously much faster than serial requests.
However, there is also a problem, that is, the website detects that too many requests in a short period of time will cause the page request to fail.
The page returns 429 (too many requests).
Solve the above problems
Two methods have come up with
1. Control the request time, use sleep delay to consume the time of each visit and reduce the number of visits per unit time. This is definitely OK, but the efficiency is too low.
2. Control the number of concurrency and the number of concurrency. It is generally recommended that the use of semaphores is relatively simple as follows:
from asyncio import tasks from import ClientSession from lxml import etree from time import sleep import time import asyncio import aiohttp async def read_page_list(page_num,sem): params = { 'page':page_num, } #Control the number of concurrency through connection pool limit default is 100 0 Unlimited async with sem: try: async with () as session: async with (url=url,params=params,headers=headers) as response: text = await () except Exception as e: print('exception:',e) tree = (text) page_list = ('//*[@]/section[1]/ul/li') # break for li in page_list: pic_small_url = ('.//img/@data-src')[0] # print(pic_small_url,type(pic_small_url)) # pic_small_url = str(pic_small_url) if 'small' in pic_small_url: temp_url = pic_small_url.replace('small','full') a = temp_url.rfind('/') temp_url1= temp_url[:a] pic_full_url = temp_url1+'/wallhaven-'+temp_url.split('/')[-1] pic_full_url = pic_full_url.replace('th','w') # print(page_num,pic_full_url) pic_list.append(pic_full_url) else: print(page_num,'find small error',pic_small_url) print(page_num,len(page_list),) # await (1) # Here you can use hard delay to control the access speed of the program, and then control the number of concurrents per unit time # sleep(0.5) #Define semaphoresem = (2) start = () #Create a task listtasks = [loop.create_task(read_page_list(i,sem)) for i in range(1,20)] loop.run_until_complete((tasks)) print('get page list use time:',()-start)
Experimental results
as follows:
- After testing, the server will not return 429 only when the request page has 20 sem=1.
- When the number of requested pages is changed to 10 sem=5, the service will not return 429.
Summarize
The above is personal experience. I hope you can give you a reference and I hope you can support me more.