I. Preface
The Spring Festival New Year's movie "Hello, Lee Hwan Young", after the latest data came out on February 23, has exceeded 4.2 billion at the box office, and caught up with other New Year's movies, becoming a dark horse in 2021.
From actor to director, Jialing's debut movie "Hello Li Huanying", why can it be so hot? Next, Rong will take you to use Python to analyze the reasons for the high box office of this movie from various angles with the help of a movie website.
Second, movie review crawling and word cloud analysis
Undoubtedly, Chinese film criticism is undergoing obvious changes along with the change of the whole social and cultural context and the change of different venues and carriers. After a hundred years of paper-based film criticism, there have been different forms of critical discourse combining different modes of business, such as television film criticism, Internet film criticism, and new media film criticism. The production and dissemination of movie reviews have indeed entered an era of democratic pluralism.
The purpose of film criticism is to analyze, identify and evaluate the aesthetic value, cognitive value, social significance, lens language and other aspects embedded in the screen, to achieve the purpose of filming the film, to explain the theme expressed in the film, not only through the analysis of the film's successes and failures to help the director to broaden their horizons, improve the level of creativity, in order to promote the prosperity and development of the art of cinema; and through the analysis and evaluation, to influence the audience's understanding and appreciation of the film, to improve the audience's appreciation level, thus indirectly promoting the development of film art. It can also influence the audience's understanding and appreciation of the movie through analysis and evaluation, improve the audience's appreciation level, and thus indirectly promote the development of movie art.
2.1 Website Selection
python crawler combat - crawling Douban movie review data
2.2 Crawl Ideas
Climbing Douban movie review data steps: 1, get the page request
2、Parse the obtained web page
3、Extract movie review data
4. Save the document
5. Word cloud analysis
2.2.1 Obtaining Web Requests
The example was chosen to be coded using the selenium library.
library
# Import library from selenium import webdriver
Browser drivers
# Browse drive paths chromedriver = 'E:/software/chromedriver_win32/' driver = (chromedriver)
Open the page.
("Fill in the URL here")
2.2.2 Parsing Acquired Web Pages
F12 key to enter the developer tools and determine the location of the data extraction, copy the XPath path in it
2.2.3 Extracting movie review data
Movie review data extraction using XPath
driver.find_element_by_xpath('//*[@]/div[{}]/div[2]/p/span')
2.2.4 Saving documents
# New folders and files basePathDirectory = "Hudong_Coding" if not (basePathDirectory): (basePathDirectory) baiduFile = (basePathDirectory, "") # New if file doesn't exist, append if it does if not (baiduFile): info = (baiduFile, 'w', 'utf-8') else: info = (baiduFile, 'a', 'utf-8')
txt file write
( + '\r\n')
2.2.5 Word cloud analysis
The jieba library and the worldcloud library are used for word cloud analysis.
It is worth noting that the figure below shows the selection path method for text.
2.3 General view of the code
2.3.1 Crawling code
# -*- coding: utf-8 -*- # !/usr/bin/env python import os import codecs from selenium import webdriver # Get summary information def getFilmReview(): try: # New folders and files basePathDirectory = "DouBan_FilmReview" if not (basePathDirectory): (basePathDirectory) baiduFile = (basePathDirectory, "DouBan_FilmReviews.txt") # New if file doesn't exist, append if it does if not (baiduFile): info = (baiduFile, 'w', 'utf-8') else: info = (baiduFile, 'a', 'utf-8') # Browse drive paths chromedriver = 'E:/software/chromedriver_win32/' [""] = chromedriver driver = (chromedriver) # Open the page for k in range(15000): # About 15,000 pages k = k + 1 g = 2 * k ("/subject/34841067/comments?start={}".format(g)) try: # Automated searches for i in range(21): elem = driver.find_element_by_xpath('//*[@]/div[{}]/div[2]/p/span'.format(i+1)) print() ( + '\r\n') except: pass except Exception as e: print('Error:', e) finally: print('\n') () # main function def main(): print('Start crawling') getFilmReview() print('End crawl') if __name__ == '__main__': main()
2.3.2 Word cloud analysis code
# -*- coding: utf-8 -*- # !/usr/bin/env python import jieba #Chinese Segmentation import wordcloud # Mapping word clouds # Display data f = open('E:/software/PythonProject/DouBan_FilmReview/DouBan_FilmReviews.txt', encoding='utf-8') txt = () txt_list = (txt) # print(txt_list) string = ' '.join((txt_list)) print(string) # Very much based on the pop-up data obtained to map the word cloud # mk = (r'image path') w = (width=1000, height=700, background_color='white', font_path='C:/Windows/Fonts/', #mask=mk, scale=15, stopwords={' '}, contour_width=5, contour_color='red' ) (string) w.to_file('DouBan_FilmReviews.png')
III. Real-time box office collection
3.1 Website selection
3.2 Code Writing
# -*- coding: utf-8 -*- # !/usr/bin/env python import os import time import datetime import requests class PF(object): def __init__(self): = '/dashboard-ajax?orderType=0&uuid=173d6dd20a2c8-0559692f1032d2-393e5b09-1fa400-173d6dd20a2c8&riskLevel=71&optimusCode=10' = { "Referer": "/dashboard", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36", } def main(self): while True: # You need to run this file under the dos command to clear the screen ('cls') result_json = self.get_parse() if not result_json: break results = (result_json) # Getting time calendar = result_json['calendar']['serverTimestamp'] t = ('.')[0].split('T') t = t[0] + " " + ((t[1], "%H:%M:%S") + (hours=8)).strftime("%H:%M:%S") print('Beijing time:', t) x_line = '-' * 155 # Gross box office total_box = result_json['movieList']['data']['nationBoxInfo']['nationBoxSplitUnit']['num'] # Gross box office units total_box_unit = result_json['movieList']['data']['nationBoxInfo']['nationBoxSplitUnit']['unit'] print(f"Today's Gross: {total_box} {total_box_unit}", end=f'\n{x_line}\n') print('Movie Title'.ljust(14), 'Combined box office'.ljust(11), 'Box office share'.ljust(13), 'Average Attendance'.ljust(11), 'Fielding average'.ljust(11),'Scheduling of movie shows'.ljust(12),'Percentage of movies in the queue'.ljust(12), 'Cumulative gross'.ljust(11), 'Days in theaters', sep='\t', end=f'\n{x_line}\n') for result in results: print( result['movieName'][:10].ljust(9), # Movie titles result['boxSplitUnit'][:8].rjust(10), # Combined box office result['boxRate'][:8].rjust(13), # Box office share result['avgSeatView'][:8].rjust(13), # Average attendance result['avgShowView'][:8].rjust(13), # Average attendance result['showCount'][:8].rjust(13), # 'Theatrical releases' result['showCountRate'][:8].rjust(13), # Movie share of the schedule result['sumBoxDesc'][:8].rjust(13), # Cumulative gross result['releaseInfo'][:8].rjust(13), # Theater information sep='\t', end='\n\n' ) break (4) def get_parse(self): try: response = (, headers=) if response.status_code == 200: return () except as e: print("ERROR:", e) return None def parse(self, result_json): if result_json: movies = result_json['movieList']['data']['list'] # attendance, attendance average, box office share, movie title. # Release Info (Days), Movie Schedule, Movie Share, Combined Box Office,Cumulative Total ticks = ['avgSeatView', 'avgShowView', 'boxRate', 'movieName', 'releaseInfo', 'showCount', 'showCountRate', 'boxSplitUnit', 'sumBoxDesc'] for movie in movies: = {} for tick in ticks: # Separation of numbers and units requires joins if tick == 'boxSplitUnit': movie[tick] = ''.join([str(i) for i in movie[tick].values()]) # Multi-level dictionary nesting if tick == 'movieName' or tick == 'releaseInfo': movie[tick] = movie['movieInfo'][tick] if movie[tick] == '': movie[tick] = 'This item is empty' [tick] = str(movie[tick]) yield if __name__ == '__main__': while True: pf = PF() ()
3.3 Presentation of results
IV. Crew photo crawling
4.1 Website selection
4.2 Code Writing
# -*- coding: utf-8 -*- # !/usr/bin/env python import requests from bs4 import BeautifulSoup import re from PIL import Image def get_data(url): # Requests for web pages resp = (url) # The headers parameter determines headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36' } # Perform 'utf-8' transcoding of the fetched HTML binary file to a string file html = ('utf-8') # BeautifulSoup narrows the lookup soup = BeautifulSoup(html, '') # Get a hyperlink to <a>. for link in soup.find_all('a'): a = ('href') if type(a) == str: b = ('(.*?)jpg', a) try: print(b[0]+'jpg') img_urls = b[0] + '.jpg' # Data preservation for img_url in img_urls: # Send image URL requests image = (img_url, headers=headers).content # Data preservation with open(r'E:/IMAGES/' + image, 'wb') as img_file: img_file.write(image) except: pass else: pass # Crawl target pages if __name__ == '__main__': get_data('https:///newgallery/hdpic/')
4.3 Display of effects
V. Summary
See this movie began to laugh how happy, behind the cry how sad, this movie with a child's perspective, selected the mother in the choice of love and marriage during the choices made, through the observation of the mother, appreciate the mother's so-called happiness, and not what Jialing think: married to the son of the factory director can be obtained, this is their common choice, no matter how many times, the mother will be righteous to choose the right for their own rather than others think that kind of happy life. Their own rather than others believe that the kind of happy life, which also indirectly tells us: our pursuit of happiness in the process, to go by virtue of their own, rather than to live in the eyes and mouth of others happy, after all, a lot of choices in life only once.
To this article on the python crawler of the Hello, Lee Hwan Young movie box office data analysis is introduced to this article, more related python crawling movie box office content, please search for my previous posts or continue to browse the following related articles I hope that you will support me more in the future!