SoFunction
Updated on 2024-10-29

python crawler basics

Introduction to Crawlers

According to the Baidu Encyclopedia definition: A web crawler (also known as a web spider, web robot, or, more often among the FOAF community, a web chaser), is a program or script that automatically crawls the World Wide Web for information according to certain rules. Some other less frequently used names are ants, autoindexers, simulation programs, or worms.

With the continuous development of big data, crawler this technology slowly into the people's field of vision, it can be said that the crawler is the product of the emergence of big data, at least I lifted the big data to understand the crawler this technology

I. Request-response

In the use of python language to implement the crawler, the main use of urllib and urllib2 two libraries. First of all, use a piece of code to illustrate the following:

 import urllib
 import urllib2
 url=""
 request=(url)
 response=(request)
 print ()

We know that a web page is composed of html as the skeleton, js as the muscle and css as the clothes. The function realized by the above code is to crawl the source code of Baidu web page to local.

Among them, the url is the URL of the web page to be crawled; request sends a request, response is to accept the request to give the response. Finally, the read() function outputs the source code of the Baidu web page.

II. GET-POST

Both are passing data to a web page, the most important difference is that the GET method is accessed directly in the form of a link which contains all the parameters, of course it is an insecure option if it contains a password, but you can visualize what you are submitting.

POST, on the other hand, does not show all the parameters on the URL, but if you want to see what was submitted directly it is not very convenient, you can choose at your discretion.

POST method:

 import urllib
 import urllib2
 values={'username':'2680559065@','Password':'XXXX'}
 data=(values)
 url='/account/login?from=/my/mycsdn'
 request=(url,data)
 response=(request)
 print ()

GET method:

import urllib
import urllib2
values={'username':'2680559065@','Password':'XXXX'}
data=(values)
url = "/account/login"
geturl = url + "?"+data
request=(geturl)
response=(request)
print ()

III. Exception handling

The try-except statement is used when handling exceptions.

import urllib2
 try:
   response=("")
 except ,e:
   print 

summarize

The above is a small introduction to the basic knowledge of python crawler, I hope to help you, if you have any questions please leave me a message, I will reply to you in a timely manner. I would also like to thank you very much for your support of my website!