Very often, we have to log in to find the content we want to view, such as the answers of Zhihu, the list of friends in Qzone, the people we follow and fans on Weibo, etc. When we want to use a crawler to crawl this information directly, there is a problem that is not very easy to solve. To use the crawler to directly log in to capture this information, there is a not very good solution to the problem, is that these sites set the login rules and login CAPTCHA recognition. However, we can find a way to get around it, the idea is this: first use the browser to log in, get the login credentials from the browser, and then put this "credentials" into the crawler to simulate the user's behavior to continue to crawl. Here, the credentials we want to get is the cookie information.
This time we try to use python and cookies to grab the friends list on Qzone. The tools used are FireFox browser, FireBug and Python.
Getting cookies
Open FireFox browser, log in to Qzone, start FireBug, select the Cookies tab in FireBug, click the Cookies button menu in the tab, and select "Export cookies from this site" to complete the export of cookies.
The export cookie will exist as a named text file.
program implementation
Then we will use the acquired cookie to create a new opener to replace the default opener used in the previous request. copy the acquired cookie to the program directory and write the script as follows:
#!python # encoding: utf-8 from import MozillaCookieJar from import Request, build_opener, HTTPCookieProcessor DEFAULT_HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"} DEFAULT_TIMEOUT = 360 def grab(url): cookie = MozillaCookieJar() ('', ignore_discard=True, ignore_expires=True) req = Request(url, headers=DEFAULT_HEADERS) opener = build_opener(HTTPCookieProcessor(cookie)) response = (req, timeout=DEFAULT_TIMEOUT) print(().decode('utf8')) if __name__ == '__main__': grab(<a href="/QQ number/myhome/friends" rel="external nofollow" >/QQhorn (wind instrument)/myhome/friends</a>)
Since we are using a cookie file exported from the FireFox browser, the cookieJar used here is MozillaCookieJar.
Execute the script...however an error is reported:
Traceback (most recent call last): File "D:/pythonDevelop/spider/use_cookie.py", line 17, in <module> start() File "D:/pythonDevelop/spider/use_cookie.py", line 9, in start ('', ignore_discard=True, ignore_expires=True) File "D:\Program Files\python\python35\lib\http\", line 1781, in load self._really_load(f, filename, ignore_discard, ignore_expires) File "D:\Program Files\python\python35\lib\http\", line 2004, in _really_load filename) : '' does not look like a Netscape format cookies file
The problem is with the cookie file, saying that it doesn't look like a Netscape-formatted cookie file. But it's easy to fix, just add the following line to the beginning of the cookie file:
# Netscape HTTP Cookie File
Passing this line suggests to the python cookie parser that this is a cookie for FireFox browsers.
Execute it again and it still reports an error, since it's quite long I'll just post the key part:
: invalid Netscape format cookies file '': '.\tTRUE\t/\tFALSE\tblabla\tdynamic'
It means that there is a formatting error in some lines in the cookie. Where exactly is the error, you need to first understand the FireFox browser cookie format. MozillaCookieJar believes that each line of the cookie needs to contain the following information, each piece of information is separated by tabs:
name (of a thing) | domain | domain_specified | path | secure | expires | name | value |
typology | string (computer science) | boolean | string (computer science) | boolean | long integer | string (computer science) | string (computer science) |
clarification | domain name | — | Applicable paths | Whether security protocols are used | expiration date (of document) | name (of a thing) | (be) worth |
I'm not quite sure what domain_specified means in this case, I'll add it later when I figure it out. Let's take a look at some of the lines of the cookie we're getting:
FALSE / FALSE 814849905_todaycount 0 FALSE / FALSE 814849905_totalcount 0 . TRUE / FALSE 1473955201 Loading Yes . TRUE / FALSE 1789265237 QZ_FE_WEBP_SUPPORT 0
The first two lines are formatted incorrectly, the last two lines are formatted correctly. The first two lines are missing the "expires" attribute. What should I do - just add them. Just pick a time from one of the other cookies and fill it in.
After completing the cookie, executing it again is normal and no errors are reported. But it didn't print out the friend information as expected, because the URL was wrong. Use firebug to find out the correct URL:
/proxy/domain//cgi-bin/tfriend/friend_ship_manager.cgi?uin=QQhorn (wind instrument)&do=1&rd=0.44948123599838985&fupdate=1&clean=0&g_tk=515169388
This grabs the buddy list. The buddy list is a json string.
As for how to parse json, it will be explained in the next section.
Getting cookies dynamically
Cookies have an expiration date. If you want to crawl the web for a long time, you need to update the cookie every once in a while, and it would be stupid to get it manually from the FireFox browser. The cookie from the browser is just an entry point, after that the request is made, we still have to rely on python to actively get the cookie, the following is a program to get the cookie:
#!python # encoding: utf-8 from import CookieJar from import Request, HTTPCookieProcessor, build_opener DEFAULT_HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"} DEFAULT_TIMEOUT = 360 def get(url): cookie = CookieJar() handler = HTTPCookieProcessor(cookie) opener = build_opener(handler) req = Request(url, headers=DEFAULT_HEADERS) response = (req, timeout=DEFAULT_TIMEOUT) for item in cookie: print( + " = " + ) ()
The example program shows how to get a cookie and prints the name and value attributes of the cookie. The example shows that the cookie is re-fetched every time the http request is executed, so we can adjust our program a little: the first request is executed with the cookie we obtained through the browser, and each subsequent request can use the cookie obtained during the last request. adjusted program:
#!python # encoding: utf-8 from import MozillaCookieJar, CookieJar from import Request, build_opener, HTTPCookieProcessor, urlopen DEFAULT_HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"} DEFAULT_TIMEOUT = 360 def gen_login_cookie(): cookie = MozillaCookieJar() ('', ignore_discard=True, ignore_expires=True) return cookie def grab(cookie, url): req = Request(url, headers=DEFAULT_HEADERS) opener = build_opener(HTTPCookieProcessor(cookie)) response = (req, timeout=DEFAULT_TIMEOUT) print(().decode("utf8")) () def start(url1, url2): cookie = gen_login_cookie() grab(cookie, url1) grab(cookie, url2) if __name__ == '__main__': u1 = "/QQ number/myhome/friends" u2 = "/proxy/domain//cgi-bin/tfriend/friend_ship_manager.cgi?uin=QQ number&do=2&rd=0.44948123599838985&fupdate=1&clean=0&g_tk=515169388 " start(u1, u2)
That's all.
(sth. or sb) else
In fact, there is another way to use cookies when logging into QZone - by observation, you can also add cookie information in the http request header.
The way to get the cookie in the request header: Open FireFox browser, open FireBug and activate the network tab of FireBug, log in to Qzone on FireFox browser, then find the login page request in FireBug, and then you can find the cookie information in the request header.
Organize the cookie information into a single line and add it to the request header for direct access. This method is relatively simple and reduces the number of steps to modify the cookie file.
In addition, in aBlog PostsI also found a solution for logging into Qzone directly. This is the best known method, as long as Tencent does not change the login rules, it is very simple to execute the request to get a cookie. but it is a long time ago, I don't know if the rules are still applicable.
Above is the python crawler how to correctly use cookies in detail, more information about python crawler using cookies please pay attention to my other related articles!