Crawling Cat's Eye Movie Data with Python to Analyze The Man with No Name

preamble

Author: Luo Zhaocheng

PS: If you need Python learning materials, you can click the link below to get your own!

/noteshare?id=3054cce4add8a909e784ad934f956cef

Getting Cat's Eye interface data

As a programmer who has been home for a long time, I'm simply handy with all kinds of packet captures. Viewing the pattern of the original code in Chrome, you can clearly see the interface, the interface address that is:/mmdb/comments/movie/?_v_=yes&offset=15

In Python, it's easy to use request to send a network request and get the results back:

def getMoveinfo(url):
 session = ()
 headers = {
  "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X)"
 }
 response = (url, headers=headers)
 if response.status_code == 200:
  return 
 return None

According to the above request, we can get the return data of this interface, the data content has a lot of information, but there is a lot of information that we do not need, first of all, take a general look at the return data:

{
 "cmts":[
  {
   "approve":0,
   "approved":false,
   "assistAwardInfo":{
    "avatar":"",
    "celebrityId":0,
    "celebrityName":"",
    "rank":0,
    "title":""
   },
   "authInfo":"",
   "cityName":"Guiyang prefecture level city in Guizhou",
   "content":"It must be very，A movie you have to see even if you have to borrow money。",
   "filmView":false,
   "id":1045570589,
   "isMajor":false,
   "juryLevel":0,
   "majorType":0,
   "movieId":1208282,
   "nick":"nick",
   "nickName":"nickName",
   "oppose":0,
   "pro":false,
   "reply":0,
   "score":5,
   "spoiler":0,
   "startTime":"2018-11-22 23:52:58",
   "supportComment":true,
   "supportLike":true,
   "sureViewed":1,
   "tagList":{
    "fixed":[
     {
      "id":1,
      "name":"Favorable review."
     },
     {
      "id":4,
      "name":"Purchase of Tickets"
     }
    ]
   },
   "time":"2018-11-22 23:52",
   "userId":1871534544,
   "userLevel":2,
   "videoDuration":0,
   "vipInfo":"",
   "vipType":0
  }
 ]
}

With so much data, the only fields we are interested in are the following:

nickName, cityName, content, startTime， score

Next, we do our more important data processing, parsing the required fields from the JSON data we get:

def parseInfo(data):
 data = (html)['cmts']
 for item in data:
  yield{
   'date':item['startTime'],
   'nickname':item['nickName'],
   'city':item['cityName'],
   'rate':item['score'],
   'conment':item['content']
  }

After getting the data, we can start data analysis. But in order to avoid frequent requests for data to the cat's eye, you need to store the data, here, I use SQLite3, put into the database, more convenient for subsequent processing. The code to store the data is as follows:

def saveCommentInfo(moveId, nikename, comment, rate, city, start_time)
 conn = ('unknow_name.db')
 conn.text_factory=str
 cursor = ()
 ins="insert into comments values (?,?,?,?,?,?)"
 v = (moveId, nikename, comment, rate, city, start_time)
 (ins,v)
 ()
 ()
 ()

data processing

Because we used a database for data storage in the previous section, you can use SQL directly to query the results you want, such as what are the top five cities for reviews:

SELECT city, count(*) rate_count FROM comments GROUP BY city ORDER BY rate_count DESC LIMIT 5

The results are as follows:

From the data above, we can see that Beijing has the highest number of comments.

Not only that, but more SQL statements can be used to query the desired results. For example, the number of people in each rating, the percentage of each rating, and so on. If the author is interested, you can try to query the data, it is so simple.

And in order to better present the data, we use the library Pyecharts for data visualization.

Based on the data we got from Cat's Eye, we used Pyecharts directly to display the data on a map of China according to geographic location:

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
city = (['city'])
city_com = city['rate'].agg(['mean','count'])
city_com.reset_index(inplace=True)
data_map = [(city_com['city'][i],city_com['count'][i]) for i in range(0,city_com.shape[0])]
geo = Geo("GEO Geographic Location Analysis",title_pos = "center",width = 1200,height = 800)
while True:
 try:
  attr,val = (data_map)
  ("",attr,val,visual_range=[0,300],visual_text_color="#fff",
    symbol_size=10, is_visualmap=True,maptype='china')

 except ValueError as e:
  e = ("No coordinate is specified for ")[1]
  data_map = filter(lambda item: item[0] != e, data_map)
 else :
  break
('geo_city_location.html')

Note: Using the data map provided by Pyecharts, there are some cities in the cat's eye data that can't find the corresponding slave markers, so in the code, the GEO adds the wrong city, and we delete it directly, filtering out quite a lot of data.

Using Python, it's that simple to generate the following map:

From the visualized data, we can see that the people who both watch and comment on movies are mainly distributed in the east of China, and again Beijing, Shanghai, Chengdu and Shenzhen are the most numerous. Although we can see a lot of data from the graph, it is still not enough to visualize, if we want to see the distribution of each province/city, we need to further process the data.

In the data obtained from the cat's eye, the city contains data with counties in the data, so it is necessary to do a conversion of the data obtained to convert all the counties to the corresponding provinces and cities, and then add the number of comments in the same province and city to get the final result.

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
city = (['city'])
city_com = city['rate'].agg(['mean','count'])
city_com.reset_index(inplace=True)
fo = open("",'r')
citys_info = ()
citysJson = (str(citys_info[0]))
data_map_all = [(getRealName(city_com['city'][i], citysJson),city_com['count'][i]) for i in range(0,city_com.shape[0])]
data_map_list = {}
for item in data_map_all:
 if data_map_list.has_key(item[0]):
  value = data_map_list[item[0]]
  value += item[1]
  data_map_list[item[0]] = value
 else:
  data_map_list[item[0]] = item[1]
data_map = [(realKeys(key), data_map_list[key] ) for key in data_map_list.keys()]
def getRealName(name, jsonObj):
 for item in jsonObj:
  if (name) :
   return jsonObj[item]
 return name
def realKeys(name):
 return (u"Save.", "").replace(u"City.", "")
    .replace(u"Hui Autonomous Region", "").replace(u"Uighur Autonomous Region", "")
    .replace(u"Zhuang Autonomous Region", "").replace(u"Autonomous regions", "")

After the above data processing, use the map provided by Pyecharts to generate a map by province/city:

def generateMap(data_map):
 map = Map("Number of city comments", width= 1200, height = 800, title_pos="center")
 while True:
  try:
   attr,val = (data_map)
   ("",attr,val,visual_range=[0,800],
     visual_text_color="#fff",symbol_size=5,
     is_visualmap=True,maptype='china',
     is_map_symbol_show=False,is_label_show=True,is_roam=False,
     )
  except ValueError as e:
   e = ("No coordinate is specified for ")[1]
   data_map = filter(lambda item: item[0] != e, data_map)
  else :
   break
 ('city_rate_count.html')

Of course, we can also come to visualize the number of people who rated each, which is shown in this place using a bar chart:

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
# Categorized by rating
rateData = (['rate'])
rateDataCount = rateData["date"].agg([ "count"])
rateDataCount.reset_index(inplace=True)
count = [0] - 1
attr = [rateDataCount["rate"][count - i] for i in range(0, [0])]
v1 = [rateDataCount["count"][count - i] for i in range(0, [0])]
bar = Bar("Number of ratings")
("Quantity",attr,v1,is_stack=True,xaxis_rotate=30,yaxix_min=4.2,
  xaxis_interval=0,is_splitline_show=True)
("html/rate_count.html")

The chart drawn, as shown below, shows that in Cat's Eye's data, the percentage of five-star reviews is more than 50%, much better than the 34.8% five-star data on Douban.

From the above data on audience distribution and ratings, we can see that this drama is still very much enjoyed by the audience. Previously, I got the audience's comment data from Cat's Eye. Now, I will split the comments into words by jieba, and then create a word cloud by Wordcloud, to see how the viewers like "The Man with No Name":

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
comment = (str(data['comment']),cut_all=False)
wl_space_split = " ".join(comment)
backgroudImage = ((r"./unknow_3.png"))
stopword = ()
wc = WordCloud(width=1920,height=1080,background_color='white',
 mask=backgroudImage,
 font_path="./",
 stopwords=stopword,max_font_size=400,
 random_state=50)
wc.generate_from_text(wl_space_split)
(wc)
("off")
wc.to_file('unknow_word_cloud.png')

Export:

To this point this article on the use of Python crawl cat's eye movie data analysis of the "Nameless Generation" article is introduced to this, more related Python crawl cat's eye movie data analysis of the "Nameless Generation" content, please search for my previous articles or continue to browse the following related articles I hope that you will have more support for me in the future!