This is the ability to utilize different protocols to obtain URLs, he also provides a more complex interface to deal with the general case, such as: basic authentication, cookies, proxies and others.
They are provided through the handlers and openers objects.
urllib2 supports getting URLs in different formats (strings defined before the ":" of the URL, e.g. "ftp" is a prefix of "ftp:/"), which utilize their associated network protocols (e.g. FTP, HTTP).
Perform a fetch. This tutorial focuses on the most widely used application - HTTP.
For simple applications, urlopen is very easy to use. But when you encounter errors or exceptions when opening HTTP URLs, you will need some understanding of the Hypertext Transfer Protocol (HTTP).
The most authoritative HTTP document is of course RFC 2616(/). It's a technical document, so it's not easy to read. The purpose of this HOWTO tutorial is to show how to use urllib2.
and provides enough HTTP details to help you understand. He is not a documentation note for urllib2, but plays a supporting role.
Get URLs
The simplest use of urllib2 would be as follows
import urllib2
response = ('/')
html = ()
Many applications of urllib2 are as simple as that (remember, in addition to "http:", URLs can also be replaced with "ftp:", "file:", etc.). But this article teaches more complex applications of HTTP.
HTTP is based on a request-and-response mechanism - the client makes the request, and the server provides the response. urllib2 maps your HTTP requests to a Request object, and in its simplest form of use, you'll use the request you're asking for to make a request.
Address to create a Request object, by calling urlopen and passing in the Request object, will return a related request response object, this response object as a file object, so you can call .read() in the Response.
import urllib2
req = ('https://')
response = (req)
the_page = ()
Remember that urllib2 uses the same interface for all URL headers. For example you can create an ftp request like below.
req = ('ftp:///')
HTTP requests allow you to do two additional things. Firstly you are able to send data form data, and secondly you are able to send additional information ("metadata") about the data or the send itself to the server, which is sent as HTTP "headers".
Let's see how these are sent next.
Data
Sometimes you want to send some data to a URL (usually the URL is hooked up to a CGI [Common Gateway Interface] script, or other web application). In HTTP, this is often sent using the familiar POST request. This is usually done by your browser when you submit an HTML form.
Not all POSTs originate from forms, and you are able to use POST to submit arbitrary data to your own program. With a normal HTML form, the data needs to be encoded into a standardized form. The data is then passed as a data parameter to the Request object. The encoding is done using urllib functions rather than the
urllib2。
import urllib
import urllib2
url = 'https://'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = (values)
req = (url, data)
response = (req)
the_page = ()
Remember that sometimes other encodings are required (e.g. uploading files from HTML - see http:///TR/REC-html40/interact/#h-17.13 HTML Specification, Form Submission for details).
If ugoni doesn't pass the data parameter, urllib2 uses GET requests, which differ from POST requests in that POST requests usually have "side-effects" that change the state of the system in one way or another (e.g., by submitting piles of garbage to your door).
Although the HTTP standard makes it clear that POSTs usually produce side effects and GET requests do not, there is nothing to prevent a GET request from producing side effects, and it is equally possible for a POST request not to produce side effects.Data can also be used in a Get request by using the
is transmitted by encoding it above the URL itself.
An example can be seen as follows
>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = (data)
>>> print url_values
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'https://'
>>> full_url = url + '?' + url_values
>>> data = (full_url)
Headers
We'll discuss specific HTTP headers here to illustrate how to add headers to your HTTP requests.
There are some sites that don't like to be accessed by programs (not human-accessible), or send different versions of content to different browsers. The default urllib2 identifies itself as "Python-urllib/" (x and y are the Python major and minor version numbers, e.g. Python-urllib/2.5).
This identity may confuse the site or simply not work. Browsers identify themselves with the User-Agent header, and when you create a request object, you can give it a dictionary containing the header data. The following example sends the same content as above, but puts its own
Emulated as Internet Explorer.
import urllib
import urllib2
url = 'https://'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = (values)
req = (url, data, headers)
response = (req)
the_page = ()
The response object also has two very useful methods. Look at the following sections info and geturl and we will see what happens when an error occurs.
Handle Exceptions
When urlopen is not able to handle a response, a urlError is generated (though the usual Python APIs exceptions such as ValueError, TypeError, etc. are also generated at the same time).
HTTPError is a subclass of urlError and is usually generated at specific HTTP URLs.
URLError
Typically, URLError is thrown when there is no network connection (not routed to a specific server), or when the server does not exist. In this case, the exception will also have a "reason" attribute, which is a tuple containing an error number and an error message.
for example
>>> req = ('https://')
>>> try: (req)
>>> except URLError, e:
>>>
(4, 'getaddrinfo failed')
HTTPError
Each HTTP response on the server contains a numeric "status code". Sometimes the status code indicates that the server was unable to complete the request. The default processor handles some of these responses for you (e.g., if the response is a "redirect" that requires the client to retrieve a document from another address).
The urllib2 will take care of that for you). Otherwise, urlopen generates an HTTPError. Typical errors include "404" (page cannot be found), "403" (request forbidden), and "401" (request with validation).
See RFC 2616 Section X for all the HTTP error codes!
HTTPError instances are generated with an integer 'code' attribute, which is the relevant error number sent by the server.
Error Codes
Because the default processor handles redirects (numbers other than 300) and indicates success for numbers in the 100-299 range, you will only see 400-599 error numbers.
is a useful dictionary of answer numbers, showing all the answer numbers used by RFC 2616. The dictionary is re-displayed here for convenience. (Translator's omission)
When an error number is generated, the server returns an HTTP error number, and an error page. You can use the HTTPError instance as the response object returned by the page, which means that, like the error attribute, it also contains the read, geturl, and info methods.
>>> req = ('/')
>>> try:
>>> (req)
>>> except URLError, e:
>>> print ()
>>>
404
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http:///TR/html4/">
<?xml-stylesheet href="./css/"
type="text/css"?>
<html><head><title>Error 404: File Not Found</title>
...... etc...
Wrapping it Up Packaging
So if you want to prepare for HTTPError or URLError, there will be two basic approaches. I, on the other hand, prefer the second one.
The first one:
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ',
except URLError, e:
print 'We failed to reach a server.'
print 'Reason: ',
else:
# everything is fine
Note: except HTTPError must be first, otherwise except URLError will also accept HTTPError.
The second one:
from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ',
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ',
else:
# everything is fine
info and geturl
The response object response (or HTTPError instance) returned by urlopen has two useful methods info() and geturl().
geturl -- This returns the actual URL that was obtained, which is useful because urlopen (or the opener object uses it) might
There will be redirects. The fetched URL might be different from the request URL.
info -- This returns a dictionary object that describes the fetched page. Usually specific headers sent by the server. currently an instance.
Classic headers include "Content-length", "Content-type", and others. See Quick Reference to HTTP Headers (/~jkorpela/).
Get a list of useful HTTP headers and what they mean in terms of interpretation.
Openers and Handlers
When you fetch a URL you use an opener (an instance of one, perhaps the name might be a bit confusing.) Normally, we
The default opener is used -- via urlopen, but you can create personalized openers, which use handler handlers, and all the "heavy lifting" is handled by the handlers. Each handler knows
How URLs are opened via specific protocols, or how aspects of URL opening are handled, such as HTTP redirects or HTTP cookies.
If you want to fetch URLs with a specific processor you'll want to create an opener, for example to get an opener that can handle cookies, or to get an opener that doesn't redirect.
To create an opener, instantiate an OpenerDirector and call the constant call .add_handler(some_handler_instance).
Similarly, you can use build_opener, which is a more convenient function for creating opener objects, and he only needs one function call.
build_opener adds several processors by default, but provides shortcut methods to add or update default processors.
Other processor handlers you might want to handle are proxies, validation, and other common but somewhat special cases.
install_opener is used to create a (global) default opener. this means that calls to urlopen will use your installed opener.
The Opener object has an open method, which can be used to get urls directly like the urlopen function: it is usually not necessary to call install_opener, except for convenience.