Python crawler controls the number of concurrency of aiohttp

Preface

When using aiohttp to access multiple pages concurrently, it is obviously much faster than serial requests.

However, there is also a problem, that is, the website detects that too many requests in a short period of time will cause the page request to fail.

The page returns 429 (too many requests).

Solve the above problems

Two methods have come up with

1. Control the request time, use sleep delay to consume the time of each visit and reduce the number of visits per unit time. This is definitely OK, but the efficiency is too low.

2. Control the number of concurrency and the number of concurrency. It is generally recommended that the use of semaphores is relatively simple as follows:

from asyncio import tasks
from  import ClientSession
from lxml import etree
from time import sleep
import time
import asyncio
import aiohttp

async def read_page_list(page_num,sem):
    params = {
        'page':page_num,
    }
    #Control the number of concurrency through connection pool limit default is 100 0 Unlimited    async with sem:
        try:
            async with () as session:
                async with (url=url,params=params,headers=headers) as response:
                    text = await ()
        except Exception as e:
            print('exception:',e)
        
        tree = (text)
        page_list = ('//*[@]/section[1]/ul/li')
        # break
        for li in page_list:
            pic_small_url = ('.//img/@data-src')[0]
            # print(pic_small_url,type(pic_small_url))
            # pic_small_url = str(pic_small_url)
            if 'small' in pic_small_url:
                temp_url = pic_small_url.replace('small','full')
                a = temp_url.rfind('/')
                temp_url1= temp_url[:a]
                pic_full_url = temp_url1+'/wallhaven-'+temp_url.split('/')[-1]
                pic_full_url = pic_full_url.replace('th','w')
                # print(page_num,pic_full_url)
                pic_list.append(pic_full_url)
            else:
                print(page_num,'find small error',pic_small_url)
            
        print(page_num,len(page_list),)
        # await (1)
        # Here you can use hard delay to control the access speed of the program, and then control the number of concurrents per unit time        # sleep(0.5)

#Define semaphoresem = (2)

start = ()
#Create a task listtasks = [loop.create_task(read_page_list(i,sem)) for i in range(1,20)]
loop.run_until_complete((tasks))
print('get page list use time:',()-start)

Experimental results

as follows:

After testing, the server will not return 429 only when the request page has 20 sem=1.
When the number of requested pages is changed to 10 sem=5, the service will not return 429.

Summarize

The above is personal experience. I hope you can give you a reference and I hope you can support me more.