How python crawlers use cookies correctly

Very often, we have to log in to find the content we want to view, such as the answers of Zhihu, the list of friends in Qzone, the people we follow and fans on Weibo, etc. When we want to use a crawler to crawl this information directly, there is a problem that is not very easy to solve. To use the crawler to directly log in to capture this information, there is a not very good solution to the problem, is that these sites set the login rules and login CAPTCHA recognition. However, we can find a way to get around it, the idea is this: first use the browser to log in, get the login credentials from the browser, and then put this "credentials" into the crawler to simulate the user's behavior to continue to crawl. Here, the credentials we want to get is the cookie information.

This time we try to use python and cookies to grab the friends list on Qzone. The tools used are FireFox browser, FireBug and Python.

Getting cookies

Open FireFox browser, log in to Qzone, start FireBug, select the Cookies tab in FireBug, click the Cookies button menu in the tab, and select "Export cookies from this site" to complete the export of cookies.

The export cookie will exist as a named text file.

program implementation

Then we will use the acquired cookie to create a new opener to replace the default opener used in the previous request. copy the acquired cookie to the program directory and write the script as follows:

#!python
# encoding: utf-8
from  import MozillaCookieJar
from  import Request, build_opener, HTTPCookieProcessor
 
DEFAULT_HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"}
DEFAULT_TIMEOUT = 360
 
 
def grab(url):
    cookie = MozillaCookieJar()
    ('', ignore_discard=True, ignore_expires=True)
    req = Request(url, headers=DEFAULT_HEADERS)
    opener = build_opener(HTTPCookieProcessor(cookie))
    response = (req, timeout=DEFAULT_TIMEOUT)
    print(().decode('utf8'))
 
 
if __name__ == '__main__':
    grab(<a href="/QQ number/myhome/friends" rel="external nofollow" >/QQhorn (wind instrument)/myhome/friends</a>)

Since we are using a cookie file exported from the FireFox browser, the cookieJar used here is MozillaCookieJar.

Execute the script...however an error is reported:

Traceback (most recent call last):
  File "D:/pythonDevelop/spider/use_cookie.py", line 17, in <module>
    start()
  File "D:/pythonDevelop/spider/use_cookie.py", line 9, in start
    ('', ignore_discard=True, ignore_expires=True)
  File "D:\Program Files\python\python35\lib\http\", line 1781, in load
    self._really_load(f, filename, ignore_discard, ignore_expires)
  File "D:\Program Files\python\python35\lib\http\", line 2004, in _really_load
    filename)
: '' does not look like a Netscape format cookies file

The problem is with the cookie file, saying that it doesn't look like a Netscape-formatted cookie file. But it's easy to fix, just add the following line to the beginning of the cookie file:

# Netscape HTTP Cookie File

Passing this line suggests to the python cookie parser that this is a cookie for FireFox browsers.

Execute it again and it still reports an error, since it's quite long I'll just post the key part:

: invalid Netscape format cookies file '': '.\tTRUE\t/\tFALSE\tblabla\tdynamic'

It means that there is a formatting error in some lines in the cookie. Where exactly is the error, you need to first understand the FireFox browser cookie format. MozillaCookieJar believes that each line of the cookie needs to contain the following information, each piece of information is separated by tabs:

name (of a thing)	domain	domain_specified	path	secure	expires	name	value
typology	string (computer science)	boolean	string (computer science)	boolean	long integer	string (computer science)	string (computer science)
clarification	domain name	—	Applicable paths	Whether security protocols are used	expiration date (of document)	name (of a thing)	(be) worth

I'm not quite sure what domain_specified means in this case, I'll add it later when I figure it out. Let's take a look at some of the lines of the cookie we're getting:

	FALSE	/	FALSE	814849905_todaycount	0
	FALSE	/	FALSE	814849905_totalcount	0
.	TRUE	/	FALSE	1473955201	Loading	Yes
.	TRUE	/	FALSE	1789265237	QZ_FE_WEBP_SUPPORT	0

The first two lines are formatted incorrectly, the last two lines are formatted correctly. The first two lines are missing the "expires" attribute. What should I do - just add them. Just pick a time from one of the other cookies and fill it in.

After completing the cookie, executing it again is normal and no errors are reported. But it didn't print out the friend information as expected, because the URL was wrong. Use firebug to find out the correct URL:

/proxy/domain//cgi-bin/tfriend/friend_ship_manager.cgi?uin=QQhorn (wind instrument)&do=1&rd=0.44948123599838985&fupdate=1&clean=0&g_tk=515169388

This grabs the buddy list. The buddy list is a json string.

As for how to parse json, it will be explained in the next section.

Getting cookies dynamically

Cookies have an expiration date. If you want to crawl the web for a long time, you need to update the cookie every once in a while, and it would be stupid to get it manually from the FireFox browser. The cookie from the browser is just an entry point, after that the request is made, we still have to rely on python to actively get the cookie, the following is a program to get the cookie:

#!python
# encoding: utf-8
from  import CookieJar
from  import Request, HTTPCookieProcessor, build_opener
 
DEFAULT_HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"}
DEFAULT_TIMEOUT = 360
 
 
def get(url):
    cookie = CookieJar()
    handler = HTTPCookieProcessor(cookie)
    opener = build_opener(handler)
    req = Request(url, headers=DEFAULT_HEADERS)
    response = (req, timeout=DEFAULT_TIMEOUT)
    for item in cookie:
        print( + " = " + )
    ()

The example program shows how to get a cookie and prints the name and value attributes of the cookie. The example shows that the cookie is re-fetched every time the http request is executed, so we can adjust our program a little: the first request is executed with the cookie we obtained through the browser, and each subsequent request can use the cookie obtained during the last request. adjusted program:

#!python
# encoding: utf-8
from  import MozillaCookieJar, CookieJar
from  import Request, build_opener, HTTPCookieProcessor, urlopen
 
DEFAULT_HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"}
DEFAULT_TIMEOUT = 360
 
 
def gen_login_cookie():
    cookie = MozillaCookieJar()
    ('', ignore_discard=True, ignore_expires=True)
    return cookie
 
 
def grab(cookie, url):
    req = Request(url, headers=DEFAULT_HEADERS)
    opener = build_opener(HTTPCookieProcessor(cookie))
    response = (req, timeout=DEFAULT_TIMEOUT)
    print(().decode("utf8"))
    ()
 
 
def start(url1, url2):
    cookie = gen_login_cookie()
    grab(cookie, url1)
    grab(cookie, url2)
 
 
if __name__ == '__main__':
    u1 = "/QQ number/myhome/friends"
    u2 = "/proxy/domain//cgi-bin/tfriend/friend_ship_manager.cgi?uin=QQ number&do=2&rd=0.44948123599838985&fupdate=1&clean=0&g_tk=515169388 "
    start(u1, u2)

That's all.

(sth. or sb) else

In fact, there is another way to use cookies when logging into QZone - by observation, you can also add cookie information in the http request header.

The way to get the cookie in the request header: Open FireFox browser, open FireBug and activate the network tab of FireBug, log in to Qzone on FireFox browser, then find the login page request in FireBug, and then you can find the cookie information in the request header.

Organize the cookie information into a single line and add it to the request header for direct access. This method is relatively simple and reduces the number of steps to modify the cookie file.

In addition, in aBlog PostsI also found a solution for logging into Qzone directly. This is the best known method, as long as Tencent does not change the login rules, it is very simple to execute the request to get a cookie. but it is a long time ago, I don't know if the rules are still applicable.

Above is the python crawler how to correctly use cookies in detail, more information about python crawler using cookies please pay attention to my other related articles!