Generalized Crawler and Focused Crawler
Depending on the usage scenario, web crawlers can be categorized into general purpose crawlers and focused crawlers .
Universal Crawler
Generalized web crawlers are an important part of letting engine crawling systems (Baidu, Google, Yahoo, etc.). The main purpose is to download web pages from the Internet to form a mirror backup of Internet content.
General search engine (Search Engine) works
General-purpose web crawler collects web pages from the Internet, collects information, these web page information is used to build an index for the search engine so as to provide support, it decides whether the whole engine system is rich in content and whether the information is instant, so its performance directly affects the effect of the search engine.
Step 1: Crawl the page
The basic workflow of a search engine web crawler is as follows:
First select a portion of the seed URLs and put these URLs into the queue of URLs to be crawled;
Fetches the URL to be crawled, resolves DNS to get the IP of the host, downloads the web page corresponding to the URL, stores it in the downloaded web page library, and puts these URLs into the queue of crawled URLs.
Analyze the URLs in the queue of crawled URLs, analyze the other URLs in it, and put the URLs into the queue of URLs to be crawled to move to the next loop ....
How search engines get the URL of a new website:
1. New website to the search engine initiative to submit URL: (such as Baidu)/linksubmit/url
)
2. Setting up external links to the new site on other sites (as far as possible within the reach of search engine crawlers)
3. Search engines and DNS resolution service providers (e.g. DNSPod, etc.) cooperate, and new website domain names will be crawled quickly.
But search engine spiders are fed certain rules for crawling, and it needs to follow the content of some commands or documents, such as links labeled as nofollow, or the Robots protocol.
#Robots protocol (also called crawler protocol, robot protocol, etc.), the full name is "web crawler exclusion criteria" (Robots Exclusion Protocol), the website through the Robots protocol to tell the search engine which pages can be crawled, which pages can not be crawled, for example:
Taobao:/
Tencent.com:/
Step 2: Data storage
Search engines crawl web pages through crawlers that deposit data into a raw page database. The page data in it is identical to the HTML that the user's browser gets.
Search engine spiders also do a certain amount of duplicate content detection when crawling a page, and once they encounter a large amount of plagiarized, captured, or duplicated content on a site with a very low access weight, they are likely to stop crawling.
Step 3: Pre-processing
Search engines take the pages that crawlers crawl back and pre-process them in various steps.
Extracting text Chinese word splitting to eliminate noise (e.g., copyright notice text, navigation bars, advertisements, etc. ......) Indexing Processing Link Relationship Calculation Special File Processing ....
In addition to HTML documents, search engines can usually crawl and index a wide range of text-based file types, such as PDF, Word, WPS, XLS, PPT, TXT documents, etc. We also often see these file types in the search results.
However, search engines are not yet able to process non-textual content such as images, videos, Flash, nor can they execute scripts and programs.
Step 4: Provide search services, website ranking
Search engine in the organization and processing of information, to provide users with keyword search services, the user search related information will be displayed to the user.
At the same time will be based on the PageRank value of the page (link to the ranking of the number of visits) to rank the site, so that Rank value of the site in the search results will be ranked higher, of course, you can also directly use Money to buy search engine site rankings, simple and brutal.
However, these generalized search engines also have some limitations:
The results returned by general-purpose search engines are web pages, and in most cases, 90% of the content on a web page is useless to the user.
Users from different fields and backgrounds often have different search purposes and needs, and search engines cannot provide search results for a specific user.
The richness of the World Wide Web data forms and the continuous development of network technology, images, databases, audio, video multimedia and other different data appeared in large quantities, the general-purpose search engine can not do anything about these files, can not be well discovered and accessed.
Most of the general-purpose search engines provide keyword-based retrieval, which makes it difficult to support queries based on semantic information and unable to accurately understand the specific needs of users.
In response to these situations, focused crawling techniques are widely used.
Spotlight on Reptiles
Focused Crawler is a web crawler program that is "theme-oriented", which is different from general search engine crawlers in that Focused Crawler will process and filter the content when crawling web pages, and try to ensure that it only crawls web pages that are relevant to the needs of the crawler.
And what we're going to learn in the future is to focus on crawling.
HTTP and HTTPS
HTTP (HyperText Transfer Protocol): A method of publishing and receiving HTML pages.
HTTPS (Hypertext Transfer Protocol over Secure Socket Layer) is simply a secure version of HTTP, adding an SSL layer under HTTP.
SSL (Secure Sockets Layer) is a secure transmission protocol mainly used in the Web, encrypting network connections at the transport layer to ensure the security of data transmission over the Internet.
The port number for HTTP is 80 and for HTTPS is 443
How HTTP works
The web crawler crawling process can be understood as the process of simulating browser operations.
The main function of a browser is to make requests to a server to display the web resource of your choice in a browser window.HTTP is a set of rules for computers to communicate over a network.
HTTP request and response
HTTP communication consists of two parts: the client request message and the server response message.
The process by which a browser sends an HTTP request:
- When the user enters a URL in the address bar of the browser and presses the Enter key, the browser will send an HTTP request to the HTTP server, which is mainly categorized into "Get" and "Post" methods.
- When we enter the URL in the browser
when the browser sends a Request to get thehtml file, the server sends the Response file object back to the browser.
- The browser analyzes the HTML in the Response and finds that it references many other files, such as Images files, CSS files, and JS files. The browser automatically sends another Request to get the images, CSS files, or JS files.
- When all the files have been downloaded successfully, the web page will be displayed in its entirety, according to the HTML syntax structure.
URL (abbreviation of Uniform / Universal Resource Locator): Uniform Resource Locator, is an identification method used to completely describe the addresses of web pages and other resources on the Internet.
Basic format: scheme://host[:port#]/path/.../[?query-string][#anchor]
- scheme: protocol (e.g. http, https, ftp)
- host: IP address or domain name of the server
- port#: port of the server (default port 80 if going to protocol default port)
- path: the path to access the resource
- query-string: parameters, data to be sent to the http server
- anchor: anchor (jumps to a specified anchor position on a web page)
Example:
ftp://192.168.0.116:8080/index
/
Client HTTP request
A URL simply identifies the location of a resource, whereas HTTP is used to submit and retrieve resources. A client sends an HTTP request to the server with a request message that includes the following format:
Request line, request header, blank line, request data
Composed of four parts, the following figure gives the general format of the request message.
An example of a typical HTTP request
GET / HTTP/1.1 Host: Connection: keep-alive Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Referer: / Accept-Encoding: gzip, deflate, sdch, br Accept-Language: zh-CN,zh;q=0.8,en;q=0.6 Cookie: BAIDUID=04E4001F34EA74AD4601512DD3C41A7B:FG=1; BIDUPSID=04E4001F34EA74AD4601512DD3C41A7B; PSTM=1470329258
Request method
GET / HTTP/1.1
According to the HTTP standard, HTTP requests can use multiple request methods.
HTTP 0.9: Only basic text GET functionality.
HTTP 1.0: Refines the request/response model and completes the protocol by defining three request methods: GET, POST and HEAD methods.
HTTP 1.1: Updated from 1.0 with five new request methods: OPTIONS, PUT, DELETE, TRACE and CONNECT methods.
HTTP 2.0 (not yet universal): the definition of request/response headers is basically unchanged, except that all header keys must be all lowercase, and the request line should be independent of the key-value pairs :method, :scheme, :host, :path.
serial number | methodologies | descriptive |
---|---|---|
1 | GET | Requests the specified page information and returns the entity body. |
2 | HEAD | Similar to a get request, except that there is no specific content in the returned response, which is used to get the header |
3 | POST | Submitting data to a specified resource to process a request (such as submitting a form or uploading a file) is included in the request body.A POST request may result in the creation of a new resource and/or the modification of an existing resource. |
4 | PUT | The data transmitted from the client to the server replaces the contents of the specified document. |
5 | DELETE | Requests the server to delete the specified page. |
6 | CONNECT | The HTTP/1.1 protocol is reserved for proxies that can change the connection to a pipeline. |
7 | OPTIONS | Allows the client to view the performance of the server. |
8 | TRACE | Display back requests received by the server, mainly for testing or diagnostics. |
HTTP requests are mainly divided into two methods Get and Post
- GET is to get data from the server, POST is to send data to the server
- GET request parameters are displayed, are displayed on the browser URL, HTTP server according to the request contains the parameters of the URL to generate the response content, that is, the parameters of the "Get" request is part of the URL. For example: /s?wd=Chinese
- POST request parameters in the request body, the message length is not limited and sent in an implicit way, usually used to submit a large amount of data to the HTTP server (such as the request contains many parameters or file upload operations, etc.), the parameters of the request are contained in the "Content-Type" message header The request parameters are contained in the "Content-Type" header, which specifies the media type and encoding of the message body.
Note: Avoid using Get method for form submission as it may lead to security issues. For example, if you use Get in a login form, the username and password entered by the user will be exposed in the address bar.
Commonly used request headers
1. Host (host and port number)
Host: Web name and port number in the corresponding URL, used to specify the Internet host and port number of the requested resource, usually part of the URL.
2. Connection (link type)
Connection: Indicates the type of client-service connection.
Client initiates a request containing Connection:keep-alive, which HTTP/1.1 uses as the default value.
Server receives the request:
If the Server supports keep-alive, replies with a response containing Connection:keep-alive and does not close the connection; if the Server does not support keep-alive, replies with a response containing Connection:close and closes the connection.
If the client receives a response containing Connection:keep-alive, send the next request to the same connection until one party voluntarily closes the connection.
keep-alive can reuse connections in many cases, reducing resource consumption and response time, such as when the browser needs multiple files (e.g., an HTML file and associated graphics file) without having to request a connection each time.
3. Upgrade-Insecure-Requests (Upgrade to HTTPS Requests)
Upgrade-Insecure-Requests: upgrades insecure requests, meaning it will automatically replace them with https requests when loading http resources, so that the browser will no longer show http request alerts in https pages.
HTTPS is a secure HTTP tunnel, so HTTP requests are not allowed on HTTPS-hosted pages, and when they do occur, they are prompted or reported as errors.
4. User-Agent (browser name)
User-Agent: is the name of the client's browser, more on this later.
5. Accept (type of file transferred)
Accept: refers to the MIME (Multipurpose Internet Mail Extensions) file type that can be accepted by browsers or other clients, based on which the server can determine and return the appropriate file format.
Examples:
Accept: */*: indicates that anything can be received.
Accept: image/gif: Indicates that the client wishes to accept resources in the GIF image format;
Accept: text/html: indicates that the client wants to accept html text.
Accept: text/html, application/xhtml+xml;q=0.9, image/*;q=0.8: indicates that the MIME types supported by the browser are html text, xhtml and xml documents, and all image format resources respectively.
q is a weighting factor in the range 0 =< q <= 1. The larger the q value, the more the request tends to get the content indicated by the type before its ";". If no q value is specified, it defaults to 1, in left-to-right sorting order; if assigned 0, it is used to indicate that the browser does not accept this content type.
Text: Text messages used for standardized representation, text messages can be in multiple character sets and or multiple formats; Application: Used to transfer application data or binary data.
6. Referer (page jump)
Referer: Indicates the URL from which the requested page originated, and from which Referer page the user accessed the currently requested page. This attribute can be used to track which page the Web request came from, what site it came from, and so on.
Sometimes encountered to download a website picture, you need to correspond to the referer, otherwise you can not download the picture, that's because people do the anti-theft chain, the principle is based on the referer to determine whether it is the site's address, if not, it is rejected, if it is, you can download;
7. Accept-Encoding (document coding and decoding format)
Accept-Encoding: Indicates the encoding that is acceptable to the browser. Encoding differs from file formatting in that it is used to compress files and speed up file delivery. The browser decodes the Web response before checking the file format, which in many cases can reduce download time by a significant amount.
Example: Accept-Encoding:gzip;q=1.0, identity; q=0.5, *;q=0
If more than one encoding matches at the same time, they are listed in order of their q-values. In this case, gzip, identity compression encoding is supported in that order, and browsers that support gzip will return gzip-encoded HTML pages. If this field is not set in the request message, the server assumes that the client can accept all kinds of content encoding.
8. Accept-Language (language type)
Accept-Langeuage: Indicates the kind of language that the browser can accept, such as en or en-us for English, zh or zh-cn for Chinese, to be used when the server can provide more than one language version.
9. Accept-Charset (character encoding)
Accept-Charset: Indicates the character encoding that the browser can accept.
Example: Accept-Charset:iso-8859-1,gb2312,utf-8ISO8859-1: usually called Latin-1. Latin-1 includes additional characters that are indispensable for writing all Western European languages, and the default for English browsers is ISO-8859-1. gb2312: the standard Simplified Chinese character set. gb2312: The standard Simplified Chinese character set. utf-8: A variable-length character encoding of UNICODE that solves the problem of displaying text in multiple languages, thus enabling internationalization and localization of applications.
If this field is not set in the request message, the default is that any character set is acceptable.
10. Cookie (Cookie)
Cookie: The browser uses this attribute to send a cookie to the server.Cookie is a small body of data hosted in the browser, which can record user information related to the server, but also can be used to achieve session functionality, which will be discussed in detail later.
11. Content-Type (POST data type)
Content-Type: the type of content used in the POST request.
Example: Content-Type = Text/XML; charset=gb2312:
Indicates that the message body of the request contains plain text XML type data, character encoding using "gb2312".
Server-side HTTP response
The HTTP response also consists of four parts: a status line, a message header, a blank line, and the response body
HTTP/1.1 200 OK Server: Tengine Connection: keep-alive Date: Wed, 30 Nov 2016 07:58:21 GMT Cache-Control: no-cache Content-Type: text/html;charset=UTF-8 Keep-Alive: timeout=20 Vary: Accept-Encoding Pragma: no-cache X-NWS-LOG-UUID: bd27210a-24e5-4740-8f6c-25dbafa9c395 Content-Length: 180945 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" ....
Commonly used response headers (understand)
Theoretically all the response header information should be in response to the request header. But the server side for efficiency, security, and other considerations, will add the corresponding response header information, as you can see from the figure above:
1. Cache-Control:must-revalidate, no-cache, private。
This value tells the client that the server does not want the client to cache the resource, and that on the next request for the resource, it must make a new request to the server and cannot get the resource from the cached copy.
Cache-Control is a very important information in the response header, when the client request header contains Cache-Control:max-age=0 request, it is clear that the server will not cache resources, Cache-Control as a response to the information, usually returns no-cache, meaning that, "then do not cache it".
When the client in the request header does not contain Cache-Control, the server will often set, different resources, different caching strategies, such as oschina in the caching of image resources is the strategy of Cache-Control: max-age = 86400, which means that, starting from the current time, in the 86400 seconds of time, the client can This means that within 86400 seconds from the current time, the client can read the resource directly from the cached copy without requesting it from the server.
2. Connection:keep-alive
This field is used as a response to the client's Connection: keep-alive, which tells the client that the server's tcp connection is also a long connection and that the client can continue to use this tcp connection to send http requests.
3. Content-Encoding:gzip
Tells the client that the resource sent by the server is encoded using gzip, and that the client should use gzip to decode the resource when it sees this message.
4. Content-Type:text/html;charset=UTF-8
Tell the client the type of resource file and the character encoding, the client decodes the resource by utf-8, and then parses the resource by html. Often we will see some sites are messy, often the server side does not return the correct encoding.
5. Date:Sun, 21 Sep 2016 06:18:21 GMT
This is the server time when the server sends resources, GMT is the standard time of Greenwich. http protocol to send the time are GMT, this is mainly to solve the Internet, different time zones in each other when requesting resources, time confusion.
6. Expires:Sun, 1 Jan 2000 01:00:00 GMT
This response header is also related to caching, telling the client that it can directly access the cached copy until this time. Obviously this value can be problematic because the client and server times are not always the same, and if the times are different it can cause problems. So this response header is not as accurate as the Cache-Control: max-age=* response header, because the date in max-age=date is a relative time, which is not only better understood, but also more accurate.
7. Pragma:no-cache
This meaning is equivalent to Cache-Control.
:Tengine/1.4.6
This is the server and the corresponding version that just tells the client about the server.
9. Transfer-Encoding:chunked
This response header tells the client that the server sends the resource in a chunked way. General chunks of resources sent are dynamically generated by the server, in sending the size of the sent resources do not know, so the use of chunks sent, each piece is independent of the independent block can be labeled with its own length, the last piece of 0 length, when the client reads the 0 length of the block, you can be sure that the resources have been transferred.
10. Vary: Accept-Encoding
Tells the cache server to cache both compressed and uncompressed versions of the file. This field is not very useful nowadays because browsers nowadays support compression.
response status code
The response status code consists of three digits, the first of which defines the category of the response and has five possible values.
Common Status Codes:
100~199: Indicates that the server successfully receives part of the request and requires the client to continue to submit the rest of the request in order to complete the entire process.
200~299: Indicates that the server has successfully received the request and has completed the entire processing. Commonly used 200 (OK request successful).
300~399: In order to complete the request, the client needs to further refine the request. For example: the requested resource has been moved to a new address, commonly used 302 (the requested page has been temporarily moved to a new url), 307 and 304 (using cached resources). 400~499: the client's request has an error, commonly used 404 (the server could not locate the requested page), 403 (the server denied access, not enough permissions). 500~599: the server-side There is an error, commonly used 500 (The request was not completed. The server encounters an unpredictable situation).
Cookies and sessions:
The interaction between the server and the client is limited to the request/response process, and is disconnected when it is over, with the server assuming a new client at the time of the next request.
In order to maintain the link between them and let the server know that this is a request sent by a previous user, the client's information must be stored in one place.
Cookie: Identifies the user by information recorded on the client.
Session: Identifies the user by information recorded on the server side.
Fiddler interface
Once set up, all local HTTP traffic will go through the 127.0.0.1:8888 proxy, which will also be intercepted by Fiddler.
Request section in detail
- Headers -- Displays the header of the HTTP request sent from the client to the server, shown as a hierarchical view containing Web client information, cookies, transmission status, and so on.
- Textview -- displays the body portion of the POST request as text.
- WebForms -- displays the GET parameters and POST body content of the request.
- HexView -- displays requests in hexadecimal data.
- Auth -- Displays the Proxy-Authorization and Authorization information in the response header.
- Raw -- Displays the entire request as plain text.
- JSON - Displays JSON format files.
- XML -- If the request body is in XML format, it is displayed in a hierarchical XML tree.
Response section in detail
- Transformer -- Displays information about the encoding of the response.
- Headers -- Displays the header of the response in a hierarchical view.
- TextView -- displays the corresponding body using text.
- ImageVies -- If the request is for an image resource, display the image of the response.
- HexView -- displays the response in hexadecimal data.
- WebView -- Responds to a preview effect in a web browser.
- Auth -- Displays the Proxy-Authorization and Authorization information in the response header.
- Caching -- Displays caching information for this request.
- Privacy -- Displays private (P3P) information for this request.
- Raw -- Displays the entire response as plain text.
- JSON - Displays JSON format files.
- XML -- If the response body is in XML format, it is displayed in a hierarchical XML tree.
HTTP/HTTPS GET and POST methods
()
# Test results in IPython In [1]: import In [2]: word = {"wd" : "Transworld."} # Convert dictionary key-value pairs into URL-encoded pairs that can be accepted by the web server via the () method. In [3]: urllib..(word) Out[3]: "wd=%E4%BC%A0%E6%99%BA%E6%92%AD%E5%AE%A2" # Convert the URL-encoded string back to the original string via the () method. In [4]: print (("wd=%E4%BC%A0%E6%99%BA%E6%92%AD%E5%AE%A2")) wd=PassThinkPlayer (brand)
Generally HTTP requests submit data that needs to be encoded into a URL-encoded format and then passed as part of the url, or as a parameter to the Request object.
Get method
GET requests are generally used when we get data from the server, for example, we use Baidu to search for Transfers:/s?wd=Communicator
The browser url will jump to something like this.
/s?wd=%E4%BC%A0%E6%99%BA%E6%92%AD%E5%AE%A2
In it we can see that in the request section, a long string appears after /s?, which contains the keyword we want to query Transn, so we can try to send the request using the default Get method.
# urllib_get.py import # Responsible for url encoding processing import url = "/s" word = {"wd":"Transworld."} word = (word) # Convert to url encoding format (string) newurl = url + "?" + word # The first url separator is ? headers={ "User-Agent": " Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36 "} request = (newurl, headers=headers) response = (request) print (())
Batch crawl posting page data
First we create a python file, and what we want to accomplish is to enter a Baidu posting address, for example:
Baidu sticker bar LOL bar first page:/f?kw=lol&ie=utf-8&pn=0
Page two:/f?kw=lol&ie=utf-8&pn=50
Page three:/f?kw=lol&ie=utf-8&pn=100
found the law, posting each page in the different, is the value of the last url pn, the rest are the same, we can seize this law.
#!/usr/bin/env python # -*- coding:utf-8 -*- import import def loadPage(url, filename): """ Role: send a request based on the url, get the server response file url: the url to be crawled filename : the name of the file to be processed """ print ("Downloading" + filename) headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"} request = (url, headers = headers) return (request).read() def writePage(html, filename): """ Function: Write html content locally html: content of the corresponding file on the server """ print ("Saving" + filename) # File Write with open(filename, "wb+") as f: (html) print ("-" * 30) def tiebaSpider(url, beginPage, endPage): """ Role: posting crawler scheduler, responsible for combining the processing of each page's url url : the first part of the posting url beginPage : start page endPage : end page """ for page in range(beginPage, endPage + 1): pn = (page - 1) * 50 filename = "No." + str(page) + "page.html" fullurl = url + "&pn=" + str(pn) #print fullurl html = loadPage(fullurl, filename) #print html writePage(html, filename) print ('Thanks for using') if __name__ == "__main__": kw = input("Please enter the name of the posting you need to crawl:") beginPage = int(input("Please enter the start page:")) endPage = int(input("Please enter the end page:")) url = "/f?" key = ({"kw": kw}) fullurl = url + key tiebaSpider(fullurl, beginPage, endPage)
In fact, many websites are like this, similar sites under the html page number, respectively, corresponding to the URL after the page serial number, as long as the discovery of the law can be batch crawl page.
POST method:
Above, we said that the Request object has a data parameter, which is used in POST, the data we want to transmit is this parameter data, data is a dictionary, which should match the key-value pairs.
There is a translation site for Dictionary:
Entering the test data and then observing by using Fiddler, one of them is a POST request and the request data sent to the server is not in the url, so we can try to simulate this POST request.
So, we can try to send the request by POST.
#!/usr/bin/env python # -*- coding:utf-8 -*- import import # The url obtained by grabbing a packet is not the url displayed on the browser url = "/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null" # Full headers headers = { "Accept" : "application/json, text/javascript, */*; q=0.01", "X-Requested-With" : "XMLHttpRequest", "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36", "Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8", } # User interface inputs # Form data sent to the web server formdata = { "type" : "AUTO", "i" : "I love you.", "doctype" : "json", "xmlVersion" : "1.6", "keyfrom" : "", "ue" : "UTF-8", "typoResult" : "true" } # Transcoded by urlencode data = (formdata).encode('utf-8') # If the data parameter in the Request() method has a value, then the request is a POST # If not, it's Get #request = (url, data = data, headers = headers) response = (url,data) html = ().decode('utf-8') print(html) #print ((req).read())
When sending a POST request, you need to pay special attention to some attributes of headers:
Content-Length: 144: means the length of the form data sent is 144, that is, the number of characters is 144.
X-Requested-With: XMLHttpRequest: indicates an Ajax asynchronous request.
Content-Type: application/x-www-form-urlencoded: Indicates that when the browser submits a Web form, the form data will be encoded in the form of name1=value1&name2=value2 key-value pairs.
Getting AJAX loaded content
Some web page content using AJAX loading, just remember, AJAX is generally returned to the JSON, directly to the AJAX address for post or get, the return of JSON data.
"As a crawler engineer, the most important thing you need to focus on, is the source of the data"
import urllib import urllib2 # demo1 url = "/j/chart/top_list?type=11&interval_id=100%3A90&action" headers={"User-Agent": "Mozilla...."} # Variable are these two parameters, starting at start and working backwards to show the limit number of formdata = { 'start':'0', 'limit':'10' } data = (formdata) request = (url, data = data, headers = headers) response = (request) print () # demo2 url = "/j/chart/top_list?" headers={"User-Agent": "Mozilla...."} # Process all parameters formdata = { 'type':'11', 'interval_id':'100:90', 'action':'', 'start':'0', 'limit':'10' } data = (formdata) request = (url, data = data, headers = headers) response = (request) print ()
Question: Why is it that sometimes POST can also see data within the URL?
- The GET method is direct access in the form of a link that contains all the parameters that the server side uses to get the value of the variable. It is an insecure option if a password is included, but you can visualize what you are submitting.
- POST will not display all the parameters on the url, the server side uses to get the submitted data in the Form submission. But if you don't specify the method attribute in the HTML code, it will be GET request by default, and the data submitted in the Form will be appended after the url, separated from the url by ? Separate from the url.
- Form data can be sent as a URL field (method="get") or as an HTTP POST (method="post"). For example, in the following HTML code, the form data will be appended to the URL because of (method="get"):
<form action="form_action.asp" method="get"> <p>First name: <input type="text" name="fname" /></p> <p>Last name: <input type="text" name="lname" /></p> <input type="submit" value="Submit" /> </form>
Simulate login with cookies
#!/usr/bin/env python # -*- coding:utf-8 -*- import url = "/410043129/profile" headers = { "Host" : "", "Connection" : "keep-alive", #"Upgrade-Insecure-Requests" : "1", "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36", "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Referer" : "/", #"Accept-Encoding" : "gzip, deflate, sdch",# plus will get compressed file "Cookie" : "anonymid=ixrna3fysufnwv; _r01_=1; depovince=GW; jebe_key=f6fb270b-d06d-42e6-8b53-e67c3156aa7e%7Cc13c37f53bca9e1e7132d4b58ce00fa3%7C1484060607478%7C1%7C1484400895379; jebe_key=f6fb270b-d06d-42e6-8b53-e67c3156aa7e%7Cc13c37f53bca9e1e7132d4b58ce00fa3%7C1484060607478%7C1%7C1484400890914; JSESSIONID=abcX8s_OqSGsYeRg5vHMv; jebecookies=0c5f9b0d-03d8-4e6a-b7a9-3845d04a9870|||||; ick_login=8a429d6c-78b4-4e79-8fd5-33323cd9e2bc; _de=BF09EE3A28DED52E6B65F6A4705D973F1383380866D39FF5; p=0cedb18d0982741d12ffc9a0d93670e09; ap=327550029; first_login_flag=1; ln_uact=mr_mao_hacker@; ln_hurl=/photos/hdn521/20140529/1055/h_main_9A3Z_e0c300019f6a195a.jpg; t=56c0c522b5b068fdee708aeb1056ee819; societyguester=56c0c522b5b068fdee708aeb1056ee819; id=327550029; xnsid=5ea75bd6; loginfrom=syshome", "Accept-Language" : "zh-CN,zh;q=0.8,en;q=0.6", } request = (url, headers = headers) response = (request) print(())
Processing HTTPS Requests SSL Certificate Validation
Now everywhere you can see the https beginning of the site, urllib2 can be for HTTPS requests to verify the SSL certificate, just like a web browser, if the site's SSL certificate is certified by the CA, it will be able to access the normal, such as: /, etc. ...
If the SSL certificate validation does not pass, or the operating system does not trust the server's security certificate, for example, the browser is accessing the 12306 website such as:https:///mormhweb/
The user is warned that the certificate is not trusted. (The 12306 website certificate is said to be self-made and not certified by CA)
#!/usr/bin/env python # -*- coding:utf-8 -*- import import ssl # Ignore SSL security certificates context = ssl._create_unverified_context() url = "https:///mormhweb/" #url = "/" headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36" } request = (url, headers = headers) # Add to the context parameter response = (request, context = context) print (())
About CA
CA (Certificate Authority) is the abbreviation of digital certificate authentication center, refers to the issuance, management, abolition of digital certificates of the trusted third-party institutions, such as the Beijing Digital Authentication Co...
The role of the CA is to check the legitimacy of the identity of the certificate holder and issue certificates to prevent them from being forged or tampered with, as well as to manage the certificates and keys.
In real life, you can use an ID card to prove your identity, so in the online world, a digital certificate is an ID card. And real life is different, not every Internet user has a digital certificate, often only when a person needs to prove their identity only need to use digital certificates.
Regular users generally don't need it because websites don't care who visits them, they only care about traffic these days. But on the flip side, websites need to prove themselves.
For example, there are a lot of phishing websites nowadays, for example, you want to visit, but in fact, you are visiting", so you need to verify the identity of the website before submitting your private information, and ask the website to show a digital certificate.
Usually normal websites will take the initiative to present their own digital certificates to ensure that the communication data between the client and the web server is encrypted and secure.
To this article on the principle of crawling and data capture article is introduced to this, more related to the principle of crawling data capture content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!