Python scans proxy and gets an instance of available proxy ip

Today we will write a very practical tool, which is to scan and obtain available proxy

First of all, I found a website on Baidu: As an example

This website publishes many ips and ports of agents available at home and abroad

Let's analyze it as usual, let's scan all domestic proxy first

Click the domestic part for review and found that the domestic proxy and directory are the following urls:

/nn/x

This x is about two thousand pages, so it seems that it needs to be processed again. . .

As usual, we try to get the content directly with the simplest()

Return 503, then we add a simple headers

Return to 200, it's done

OK, let's analyze the web content first and get the content you want

We found that the content containing ip information is in the <tr> tag, so we can easily obtain the tag content using bs

But we then found that the contents of ip, port, and protocol are respectively within the 2nd, 3rd, and 6th three <td> tags of the extracted <tr> tags

So we started to try writing, let’s take a look at the writing ideas:

When processing the page, you first extract the tr tag, and then extract the td tag in the tr tag.

Therefore, the bs operation was used twice, and the str processing was required when using the bs operation the second time.

Because after we get tr, we need the number 2, 3, 6 of them,

But when we use a for loop to output i, we cannot perform group operations

So we simply perform the second operation on each td's soup and then directly extract 2, 3, 6

After extraction, add .string to extract the content

r = (url = url,headers = headers)

 soup = bs(,"")
 data = soup.find_all(name = 'tr',attrs = {'class':('|[^odd]')})

 for i in data:

  soup = bs(str(i),'')
  data2 = soup.find_all(name = 'td')
  ip = str(data2[1].string)
  port = str(data2[2].string)
  types = str(data2[5].string).lower() 

  proxy = {}
  proxy[types] = '%s:%s'%(ip,port)

In this way, we can generate the corresponding proxy dictionary every time we loop so that we can verify the ip availability next

There is a point to note here in the dictionary. We have an operation to turn types into lowercase, because the protocol name written in proxies in the get method should be lowercase, and the web page crawls the content in uppercase, so a case conversion is performed.

So what is the idea of verifying IP availability

It's very simple, we use get, plus our proxy, and request the website:

http://1212./

This is a magical website that can return to your external network IP

url = 'http://1212./'
r = (url = url,proxies = proxy,timeout = 6)

Here we need to add timeout to remove those agents that have been waiting for too long, and I set it to 6 seconds

We try it with an IP and analyze the returned page

The returned content is as follows:

&lt;html&gt;

&lt;head&gt;

&lt;meta xxxxxxxxxxxxxxxxxx&gt;

&lt;title&gt; YoursIPaddress &lt;/title&gt;

&lt;/head&gt;

&lt;body style="margin:0px"&gt;&lt;center&gt;YoursIPyes：[] From:xxxxxxxx&lt;/center&gt;&lt;/body&gt;&lt;/html&gt;

Then we only need to extract the content of [] in the web page

If our proxy is available, the proxy's IP will be returned

(The returned address will appear here. Although I am not very clear about it, if I exclude this situation, the proxy should still be unavailable)

Then we can make a judgment. If the returned ip and the ip in the proxy dictionary are the same, we think that this ip is an available proxy and write it to the file.

This is our idea. Finally, we can process the queue and threading threads.

On code:

#coding=utf-8

import requests
import re
from bs4 import BeautifulSoup as bs
import Queue
import threading 

class proxyPick():
 def __init__(self,queue):
  .__init__(self)
  self._queue = queue

 def run(self):
  while not self._queue.empty():
   url = self._queue.get()

   proxy_spider(url)

def proxy_spider(url):
 headers = {
   .......
  }

 r = (url = url,headers = headers)
 soup = bs(,"")
 data = soup.find_all(name = 'tr',attrs = {'class':('|[^odd]')})

 for i in data:

  soup = bs(str(i),'')
  data2 = soup.find_all(name = 'td')
  ip = str(data2[1].string)
  port = str(data2[2].string)
  types = str(data2[5].string).lower() 


  proxy = {}
  proxy[types] = '%s:%s'%(ip,port)
  try:
   proxy_check(proxy,ip)
  except Exception,e:
   print e
   pass

def proxy_check(proxy,ip):
 url = 'http://1212./'
 r = (url = url,proxies = proxy,timeout = 6)

 f = open('E:/url/ip_proxy.txt','a+')

 soup = bs(,'')
 data = soup.find_all(name = 'center')
 for i in data:
  a = (r'\[(.*?)\]',)
  if a[0] == ip:
   #print proxy
   ('%s'%proxy+'\n')
   print 'write down'
   
 ()

#proxy_spider()

def main():
 queue = ()
 for i in range(1,2288):
  ('/nn/'+str(i))

 threads = []
 thread_count = 10

 for i in range(thread_count):
  spider = proxyPick(queue)
  (spider)

 for i in threads:
  ()

 for i in threads:
  ()

 print "It's down,sir!"

if __name__ == '__main__':
 main()

In this way, we can write all the available proxy ips provided on the website to the file ip_proxy.txt file

The above example of scanning proxy and obtaining available proxy IPs in Python is all the content I share with you. I hope you can give you a reference and I hope you can support me more.