SoFunction
Updated on 2024-10-29

Asynchronous crawler in python explained in detail

python asynchronous crawler

basic concept

Purpose:To achieve high performance data crawling operations using asynchrony in crawlers.

Asynchronous Crawler Approach.

  • Multi-threading, multi-processing (not recommended) :
    • Benefits:You can open a separate thread or process for the relevant blocking operation, and the blocking operation can be executed asynchronously.
    • Disadvantage:Can't turn on multithreading or multiprocessing without limitations.
  • Thread pools, process pools (where appropriate).
    • Benefits:We can reduce the frequency of process or thread creation and destruction, thus reducing system overhead.
    • Cons:The number of threads or processes in the pool is capped.

Basic Use of Thread Pools

# import time
# Single-threaded serial execution
# start_time = ()
# def get_page(str):
# print('Downloading:',str)
#     (2)
# print('Download complete:',str)
#
# name_list = ['haha','lala','duoduo','anan']
#
# for i in range(len(name_list)):
#     get_page(name_list[i])
#
# end_time = ()
# print(end_time-start_time)
import time
from  import Pool
# Single-threaded serial execution
start_time = ()
def get_page(str):
    print('Downloading:',str)
    (2)
    print('Download complete:',str)
name_list = ['haha','lala','duoduo','anan']
pool = Pool(4)
(get_page,name_list)
end_time = ()
print(end_time-start_time)

rendering (visual representation of how things will turn out)

Single-threaded serial approach

在这里插入图片描述

thread pool

在这里插入图片描述

Crawl URL:/category_6

coding

import requests,re,random
from lxml import etree
from  import Pool
urls = [] # Dictionary of video addresses and video names
# Get video fake address function
def get_videoadd(detail_url,video_id):
    ajks_url = '/'
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
                      'Referer':detail_url
    }
    params = {
        'contId': video_id,
        'mrd': str(())
    }
    video_json = (headers=header,url=ajks_url,params=params).json()
    return video_json['videoInfo']['videos']['srcUrl']
# Fetch video data and persistent storage
def get_videoData(dic):
    right_url = dic['url']
    print(dic['name'],'start!')
    video_data = (url=right_url,headers=headers).content
    with open(dic['name'],'wb') as fp:
        (video_data)
    print(dic['name'],'over!')
if __name__ == '__main__':
    url = 'https://www.pear/category_6'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    page_text = (url=url,headers=headers).text
    tree = (page_text)
    li_list = ('//*[@]/li')
    for li in li_list:
        detail_url = '/'+('./div/a/@href')[0]
        name = ('./div/a/div[2]/text()')[0]+'.mp4'
        #Parsing Video IDs
        video_id = detail_url.split('/')[-1].split('_')[-1]
        false_url = get_videoadd(detail_url,video_id)
        temp = false_url.split('/')[-1].split('-')[0]
        #splicing out the correct url
        right_url = false_url.replace(temp,'cont-'+str(video_id))
        dic = {
            'name':name,
            'url':right_url
        }
        (dic)
    #Use the thread pool
    pool = Pool(4)
    (get_videoData,urls)
    #Subthread closed at the end of the thread
    ()
    #Main thread shutdown
    ()

rendering (visual representation of how things will turn out)

在这里插入图片描述

reasoning

1. Details page found ajks request

在这里插入图片描述

2. However, this is a false address Example: False address:

/mp4/adshort/20210323/1616511268090-15637590_adpkg-ad_hd.mp4

3. True address

/mp4/adshort/20210323/cont-1724179-15637590_adpkg-ad_hd.mp4

After comparing

在这里插入图片描述

Replacing the number in the circle with cont-video_id is the true address

to this article on python asynchronous crawler detailed article is introduced to this, more related python asynchronous crawler content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!