development environment (computer)
Python 3.8
Pycharm
Module Usage
requests >>> pip install requests
parsel >>> pip install parsel
If you install python third-party modules
win + R type cmd and click OK, type the install command pip install module name (pip install requests) Enter
In pycharm, click Terminal and enter the install command.
How to configure the python interpreter inside pycharm
select file >>> setting >>> Project >>> python interpreter
Click on the gear, select add
Add the python installation path
How to install plugins in pycharm
Select file >>> setting >>> Plugins
Click Marketplace and enter the name of the plugin you want to install, e.g.: translation plugin, enter translation / Chinese plugin, enter Chinese.
Select the appropriate plug-in and click install.
After the installation is successful, there is an option to restart pycharm. Click OK, and the restart will take effect.
Proxy ip structure
proxies_dict = { "http": "http://" + ip:port, "https": "http://" + ip:port, }
reasoning
I. Analysis of data sources
Find out what we want, where we want it from.
II. Code Implementation Steps
Send request, send request for target URL
Get data, get server response data (web page source code)
Parsing the data, extracting the content we want.
Save data, crawl music, video, local csv databases... IP Detection, detect if IP proxy is available, available IP proxy save
- from
- import import
- From what module? What method?
- from xxx import * # Import all methods
coding
# Import data request module import requests # Data request module Third-party module pip install requests # Import the regular expression module import re # Built-in modules # Import data parsing module import parsel # Data parsing module Third-party module pip install parsel >>> This is a core component of the scrapy framework. lis = [] lis_1 = [] # 1. Send a request, for the destination URL send a request for /free/. for page in range(11, 21): url = f'/free/inha/{page}/' # Determine the request url address """ headers Request headers that disguise python code. """ # Use the get method of the requests module to send a request to a url address, and then use the response variable to receive the returned data. response = (url) # <Response [200]> return response object after the request, 200 status code means the request was successful # 2. get data, get server response data (web page source code) get response body text data # print() # 3. Parsing the data, extracting what we want. """ Parsing data methods. Regular: You can extract the content of the string data directly. Need to get down html string data for conversion xpath: according to the label node to extract data content css selector: according to the label attributes to extract data content Which aspect to use that, that is preferred to use that """ # Regular expressions to extract data content """ # Regular Extract Data () calls a method inside the module # ♪ The positive is indecisive. ♪ *? matches any character (except the newline character \n). ip_list = ('<td data-title="IP">(.*?)</td>', , ) port_list = ('<td data-title="PORT">(.*?)</td>', , ) print(ip_list) print(port_list) """ # css selector. """ # css selector extract data need to get down html string data() for conversion # I don't know css or xpath what to do # # #list > table > tbody > tr > td:nth-child(1) # //*[@]/table/tbody/tr/td[1] selector = () # Convert html string data to selector object ip_list = ('#list tbody tr td:nth-child(1)::text').getall() port_list = ('#list tbody tr td:nth-child(2)::text').getall() print(ip_list) print(port_list) """ # xpath extract data selector = () # Convert html string data to selector object ip_list = ('//*[@]/table/tbody/tr/td[1]/text()').getall() port_list = ('//*[@]/table/tbody/tr/td[2]/text()').getall() # print(ip_list) # print(port_list) for ip, port in zip(ip_list, port_list): # print(ip, port) proxy = ip + ':' + port proxies_dict = { "http": "http://" + proxy, "https": "http://" + proxy, } # print(proxies_dict) (proxies_dict) # 4. Detecting IP quality try: response = (url=url, proxies=proxies_dict, timeout=1) if response.status_code == 200: print('Current proxy IP: ', proxies_dict, 'Can be used') lis_1.append(proxies_dict) except: print('Current proxy IP: ', proxies_dict, 'Request timed out, test failed') print('Number of proxy IPs acquired: ', len(lis)) print('Get the number of available IP proxies: ', len(lis_1)) print('Get available IP proxies: ', lis_1) dit = { 'http': 'http://110.189.152.86:40698', 'https': 'http://110.189.152.86:40698' }
To this article about a text to teach you how to create their own IP pool Python article is introduced to this, more related to the creation of IP pools Python content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!