python asynchronous crawler
basic concept
Purpose:To achieve high performance data crawling operations using asynchrony in crawlers.
Asynchronous Crawler Approach.
- Multi-threading, multi-processing (not recommended) :
- Benefits:You can open a separate thread or process for the relevant blocking operation, and the blocking operation can be executed asynchronously.
- Disadvantage:Can't turn on multithreading or multiprocessing without limitations.
- Thread pools, process pools (where appropriate).
- Benefits:We can reduce the frequency of process or thread creation and destruction, thus reducing system overhead.
- Cons:The number of threads or processes in the pool is capped.
Basic Use of Thread Pools
# import time # Single-threaded serial execution # start_time = () # def get_page(str): # print('Downloading:',str) # (2) # print('Download complete:',str) # # name_list = ['haha','lala','duoduo','anan'] # # for i in range(len(name_list)): # get_page(name_list[i]) # # end_time = () # print(end_time-start_time) import time from import Pool # Single-threaded serial execution start_time = () def get_page(str): print('Downloading:',str) (2) print('Download complete:',str) name_list = ['haha','lala','duoduo','anan'] pool = Pool(4) (get_page,name_list) end_time = () print(end_time-start_time)
rendering (visual representation of how things will turn out)
Single-threaded serial approach
thread pool
Crawl URL:/category_6
coding
import requests,re,random from lxml import etree from import Pool urls = [] # Dictionary of video addresses and video names # Get video fake address function def get_videoadd(detail_url,video_id): ajks_url = '/' header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36', 'Referer':detail_url } params = { 'contId': video_id, 'mrd': str(()) } video_json = (headers=header,url=ajks_url,params=params).json() return video_json['videoInfo']['videos']['srcUrl'] # Fetch video data and persistent storage def get_videoData(dic): right_url = dic['url'] print(dic['name'],'start!') video_data = (url=right_url,headers=headers).content with open(dic['name'],'wb') as fp: (video_data) print(dic['name'],'over!') if __name__ == '__main__': url = 'https://www.pear/category_6' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36' } page_text = (url=url,headers=headers).text tree = (page_text) li_list = ('//*[@]/li') for li in li_list: detail_url = '/'+('./div/a/@href')[0] name = ('./div/a/div[2]/text()')[0]+'.mp4' #Parsing Video IDs video_id = detail_url.split('/')[-1].split('_')[-1] false_url = get_videoadd(detail_url,video_id) temp = false_url.split('/')[-1].split('-')[0] #splicing out the correct url right_url = false_url.replace(temp,'cont-'+str(video_id)) dic = { 'name':name, 'url':right_url } (dic) #Use the thread pool pool = Pool(4) (get_videoData,urls) #Subthread closed at the end of the thread () #Main thread shutdown ()
rendering (visual representation of how things will turn out)
reasoning
1. Details page found ajks request
2. However, this is a false address Example: False address:
/mp4/adshort/20210323/1616511268090-15637590_adpkg-ad_hd.mp4
3. True address
/mp4/adshort/20210323/cont-1724179-15637590_adpkg-ad_hd.mp4
After comparing
Replacing the number in the circle with cont-video_id is the true address
to this article on python asynchronous crawler detailed article is introduced to this, more related python asynchronous crawler content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!