This article example describes the method of Python regular grabbing Netease news. Shared for your reference, as follows:
I wrote something about crawling NetEase News, and found that the source code of its web page and the comments on the web page are not even close to being right, so I used a packet grabbing tool to get the hidden address of its comments (each browser has its own packet grabbing tool, which can be used to analyze the site)
If you look closely you will see that there is a special one, then this is the one you want
Then open the link to find the relevant review content. (The picture below shows the content of the first page)
The next step is the code (also rewritten according to the gods).
#coding=utf-8 import urllib2 import re import json import time class WY(): def __init__(self): = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.24 (KHTML, like '} ='./data/news3_bbs/df/B9IBDHEH000146BE_1.html' def getpage(self,page): full_url='./cache/newlist/news3_bbs/B9IBDHEH000146BE_'+str(page)+'.html' return full_url def gethtml(self,page): try: req=(page,None,) response = (req) html = () return html except ,e: if hasattr(e,'reason'): print u"Connection failed.", return None # Processing strings def Process(self,data,page): if page == 1: data=('var replyData=','') else: data=('var newPostList=','') reg1=(" \[<a href=''>") data=(' ',data) reg2=('<\\\/a>\]') data=('',data) reg3=('<br>') data=('',data) return data # Parsing json def dealJSON(self): with open("","a") as file: ('ID'+'|'+'Comments'+'|'+'Stomp'+'|'+'Top'+'\n') for i in range(1,12): if i == 1: data=() data=(data,i)[:-1] value=(data) file=open('','a') for item in value['hotPosts']: try: (item['1']['f'].encode('utf-8')+'|') (item['1']['b'].encode('utf-8')+'|') (item['1']['a'].encode('utf-8')+'|') (item['1']['v'].encode('utf-8')+'\n') except: continue () print '--Collecting %d/12--'%i (5) else: page=(i) data = (page) data = (data,i)[:-2] # print data value=(data) # print value file=open('','a') for item in value['newPosts']: try: (item['1']['f'].encode('utf-8')+'|') (item['1']['b'].encode('utf-8')+'|') (item['1']['a'].encode('utf-8')+'|') (item['1']['v'].encode('utf-8')+'\n') except: continue () print '--Collecting %d/12--'%i (5) if __name__ == '__main__': WY().dealJSON()
Above is the code I crawled.
PS: Here are 2 more very convenient regular expression tools for your reference:
JavaScript regular expression online test tool:
http://tools./regex/javascript
Regular expression online generation tool:
http://tools./regex/create_reg
More about Python related content can be viewed on this site's topic: thePython Regular Expression Usage Summary》、《Python Data Structures and Algorithms Tutorial》、《Python Socket Programming Tips Summary》、《Summary of Python function usage tips》、《Summary of Python string manipulation techniques》、《Python introductory and advanced classic tutorialsand theSummary of Python file and directory manipulation techniques》
I hope that what I have said in this article will help you in Python programming.