This article describes the method of obtaining proxy and filtering from HTTP proxy websites in Python 3.4. Share it for your reference, as follows:
Recently, when I was writing crawlers, I was struggling to avoid using a proxy, and the default IP was blocked in less than a few minutes, so I could only look for a proxy. I thought it would be fine if I found an HTTP proxy, but I didn't expect that most of the proxy obtained from that website was unusable, and only a small number of them could be used. . . Therefore, in desperation, we can only obtain a large amount of proxy IP from those proxy websites, and then take it for further screening, extract the valid proxy IP, and leave it for further use.
The main principle of filtering is to extract the unfiltered proxy rawProxyList through the main function, and then try to connect to the target website through these proxy (in this article, it is to connect to Sina mobile phones). If the connection is successful within the specified time, it will be deemed to be a valid proxy and placed in the checkedProxyList.
__author__ = 'multiangle' __edition__='python3.4' import threading import as request import time rawProxyList=[] checkedProxyList=[] class proxycheck(): def __init__(self,proxy_list): .__init__(self) self.proxy_list=proxy_list =3 ='/' ='Sina Mobile' def checkproxy(self): cookies=() for proxy in self.proxy_list: handler=({'http':'http://%s'%(proxy)}) opener=request.build_opener(cookies,handler) t1=() try: req=(,timeout=) res=() res=str(res,encoding='utf8') usetime=()-t1 if in res: ((proxy,usetime)) except Exception as e : print(e) def run(self): () if __name__=='__main__': num=20 thread_num=10 checkThrends=[] url='YOUR PROXY URL' #Extract the agent's website. req=(url).read() req=str(req,encoding='utf-8') list=('\r\n') #The website returns a string format, and is divided using '\r\n' rawProxyList=list print('get raw proxy') for i in rawProxyList: print(i) # s=proxycheck_test(rawProxyList) batch_size=int((len(rawProxyList)+thread_num-1)/thread_num) print(batch_size) for i in range(thread_num): t=proxycheck(rawProxyList[batch_size*i:batch_size*(i+1)]) (t) for i in range(checkThrends.__len__()): checkThrends[i].start() for i in range(checkThrends.__len__()): checkThrends[i].join() print(checkedProxyList.__len__(),' useful proxy is find') for i in checkedProxyList: print(i)
For more information about Python, please view the special topic of this site: "Summary of Python Socket Programming Tips》、《Python data structure and algorithm tutorial》、《Summary of Python function usage tips》、《Summary of Python string operation skills》、《Python introduction and advanced classic tutorials"and"Summary of Python file and directory operation skills》
I hope this article will be helpful to everyone's Python programming.