Today we will write a very practical tool, which is to scan and obtain available proxy
First of all, I found a website on Baidu: As an example
This website publishes many ips and ports of agents available at home and abroad
Let's analyze it as usual, let's scan all domestic proxy first
Click the domestic part for review and found that the domestic proxy and directory are the following urls:
/nn/x
This x is about two thousand pages, so it seems that it needs to be processed again. . .
As usual, we try to get the content directly with the simplest()
Return 503, then we add a simple headers
Return to 200, it's done
OK, let's analyze the web content first and get the content you want
We found that the content containing ip information is in the <tr> tag, so we can easily obtain the tag content using bs
But we then found that the contents of ip, port, and protocol are respectively within the 2nd, 3rd, and 6th three <td> tags of the extracted <tr> tags
So we started to try writing, let’s take a look at the writing ideas:
When processing the page, you first extract the tr tag, and then extract the td tag in the tr tag.
Therefore, the bs operation was used twice, and the str processing was required when using the bs operation the second time.
Because after we get tr, we need the number 2, 3, 6 of them,
But when we use a for loop to output i, we cannot perform group operations
So we simply perform the second operation on each td's soup and then directly extract 2, 3, 6
After extraction, add .string to extract the content
r = (url = url,headers = headers) soup = bs(,"") data = soup.find_all(name = 'tr',attrs = {'class':('|[^odd]')}) for i in data: soup = bs(str(i),'') data2 = soup.find_all(name = 'td') ip = str(data2[1].string) port = str(data2[2].string) types = str(data2[5].string).lower() proxy = {} proxy[types] = '%s:%s'%(ip,port)
In this way, we can generate the corresponding proxy dictionary every time we loop so that we can verify the ip availability next
There is a point to note here in the dictionary. We have an operation to turn types into lowercase, because the protocol name written in proxies in the get method should be lowercase, and the web page crawls the content in uppercase, so a case conversion is performed.
So what is the idea of verifying IP availability
It's very simple, we use get, plus our proxy, and request the website:
http://1212./
This is a magical website that can return to your external network IP
url = 'http://1212./' r = (url = url,proxies = proxy,timeout = 6)
Here we need to add timeout to remove those agents that have been waiting for too long, and I set it to 6 seconds
We try it with an IP and analyze the returned page
The returned content is as follows:
<html> <head> <meta xxxxxxxxxxxxxxxxxx> <title> YoursIPaddress </title> </head> <body style="margin:0px"><center>YoursIPyes:[] From:xxxxxxxx</center></body></html>
Then we only need to extract the content of [] in the web page
If our proxy is available, the proxy's IP will be returned
(The returned address will appear here. Although I am not very clear about it, if I exclude this situation, the proxy should still be unavailable)
Then we can make a judgment. If the returned ip and the ip in the proxy dictionary are the same, we think that this ip is an available proxy and write it to the file.
This is our idea. Finally, we can process the queue and threading threads.
On code:
#coding=utf-8 import requests import re from bs4 import BeautifulSoup as bs import Queue import threading class proxyPick(): def __init__(self,queue): .__init__(self) self._queue = queue def run(self): while not self._queue.empty(): url = self._queue.get() proxy_spider(url) def proxy_spider(url): headers = { ....... } r = (url = url,headers = headers) soup = bs(,"") data = soup.find_all(name = 'tr',attrs = {'class':('|[^odd]')}) for i in data: soup = bs(str(i),'') data2 = soup.find_all(name = 'td') ip = str(data2[1].string) port = str(data2[2].string) types = str(data2[5].string).lower() proxy = {} proxy[types] = '%s:%s'%(ip,port) try: proxy_check(proxy,ip) except Exception,e: print e pass def proxy_check(proxy,ip): url = 'http://1212./' r = (url = url,proxies = proxy,timeout = 6) f = open('E:/url/ip_proxy.txt','a+') soup = bs(,'') data = soup.find_all(name = 'center') for i in data: a = (r'\[(.*?)\]',) if a[0] == ip: #print proxy ('%s'%proxy+'\n') print 'write down' () #proxy_spider() def main(): queue = () for i in range(1,2288): ('/nn/'+str(i)) threads = [] thread_count = 10 for i in range(thread_count): spider = proxyPick(queue) (spider) for i in threads: () for i in threads: () print "It's down,sir!" if __name__ == '__main__': main()
In this way, we can write all the available proxy ips provided on the website to the file ip_proxy.txt file
The above example of scanning proxy and obtaining available proxy IPs in Python is all the content I share with you. I hope you can give you a reference and I hope you can support me more.