SoFunction
Updated on 2024-10-29

Example of Python Regular Grabbing of Netease News

This article example describes the method of Python regular grabbing Netease news. Shared for your reference, as follows:

I wrote something about crawling NetEase News, and found that the source code of its web page and the comments on the web page are not even close to being right, so I used a packet grabbing tool to get the hidden address of its comments (each browser has its own packet grabbing tool, which can be used to analyze the site)

If you look closely you will see that there is a special one, then this is the one you want

Then open the link to find the relevant review content. (The picture below shows the content of the first page)

The next step is the code (also rewritten according to the gods).

#coding=utf-8
import urllib2
import re
import json
import time
class WY():
  def __init__(self):
     = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.24 (KHTML, like '}
    ='./data/news3_bbs/df/B9IBDHEH000146BE_1.html'
  def getpage(self,page):
    full_url='./cache/newlist/news3_bbs/B9IBDHEH000146BE_'+str(page)+'.html'
    return full_url
  def gethtml(self,page):
    try:
      req=(page,None,)
      response = (req)
      html = ()
      return html
    except ,e:
      if hasattr(e,'reason'):
        print u"Connection failed.",
        return None
  # Processing strings
  def Process(self,data,page):
    if page == 1:
      data=('var replyData=','')
    else:
      data=('var newPostList=','')
    reg1=(" \[<a href=''>")
    data=(' ',data)
    reg2=('<\\\/a>\]')
    data=('',data)
    reg3=('<br>')
    data=('',data)
    return data
  # Parsing json
  def dealJSON(self):
    with open("","a") as file:
      ('ID'+'|'+'Comments'+'|'+'Stomp'+'|'+'Top'+'\n')
    for i in range(1,12):
      if i == 1:
        data=()
        data=(data,i)[:-1]
        value=(data)
        file=open('','a')
        for item in value['hotPosts']:
          try:
            (item['1']['f'].encode('utf-8')+'|')
            (item['1']['b'].encode('utf-8')+'|')
            (item['1']['a'].encode('utf-8')+'|')
            (item['1']['v'].encode('utf-8')+'\n')
          except:
            continue
        ()
        print '--Collecting %d/12--'%i
        (5)
      else:
        page=(i)
        data = (page)
        data = (data,i)[:-2]
        # print data
        value=(data)
        # print value
        file=open('','a')
        for item in value['newPosts']:
          try:
            (item['1']['f'].encode('utf-8')+'|')
            (item['1']['b'].encode('utf-8')+'|')
            (item['1']['a'].encode('utf-8')+'|')
            (item['1']['v'].encode('utf-8')+'\n')
          except:
            continue
        ()
        print '--Collecting %d/12--'%i
        (5)
if __name__ == '__main__':
  WY().dealJSON()

Above is the code I crawled.

PS: Here are 2 more very convenient regular expression tools for your reference:

JavaScript regular expression online test tool:
http://tools./regex/javascript

Regular expression online generation tool:
http://tools./regex/create_reg

More about Python related content can be viewed on this site's topic: thePython Regular Expression Usage Summary》、《Python Data Structures and Algorithms Tutorial》、《Python Socket Programming Tips Summary》、《Summary of Python function usage tips》、《Summary of Python string manipulation techniques》、《Python introductory and advanced classic tutorialsand theSummary of Python file and directory manipulation techniques

I hope that what I have said in this article will help you in Python programming.