Python Crawler's Hello, Lee Hwan Young Movie Box Office Data Analysis

I. Preface

The Spring Festival New Year's movie "Hello, Lee Hwan Young", after the latest data came out on February 23, has exceeded 4.2 billion at the box office, and caught up with other New Year's movies, becoming a dark horse in 2021.

From actor to director, Jialing's debut movie "Hello Li Huanying", why can it be so hot? Next, Rong will take you to use Python to analyze the reasons for the high box office of this movie from various angles with the help of a movie website.

Second, movie review crawling and word cloud analysis

Undoubtedly, Chinese film criticism is undergoing obvious changes along with the change of the whole social and cultural context and the change of different venues and carriers. After a hundred years of paper-based film criticism, there have been different forms of critical discourse combining different modes of business, such as television film criticism, Internet film criticism, and new media film criticism. The production and dissemination of movie reviews have indeed entered an era of democratic pluralism.

The purpose of film criticism is to analyze, identify and evaluate the aesthetic value, cognitive value, social significance, lens language and other aspects embedded in the screen, to achieve the purpose of filming the film, to explain the theme expressed in the film, not only through the analysis of the film's successes and failures to help the director to broaden their horizons, improve the level of creativity, in order to promote the prosperity and development of the art of cinema; and through the analysis and evaluation, to influence the audience's understanding and appreciation of the film, to improve the audience's appreciation level, thus indirectly promoting the development of film art. It can also influence the audience's understanding and appreciation of the movie through analysis and evaluation, improve the audience's appreciation level, and thus indirectly promote the development of movie art.

2.1 Website Selection

python crawler combat - crawling Douban movie review data

2.2 Crawl Ideas

Climbing Douban movie review data steps: 1, get the page request
2、Parse the obtained web page
3、Extract movie review data
4. Save the document
5. Word cloud analysis

2.2.1 Obtaining Web Requests

The example was chosen to be coded using the selenium library.

library

# Import library
from selenium import webdriver

Browser drivers

# Browse drive paths
chromedriver = 'E:/software/chromedriver_win32/'
driver = (chromedriver)

Open the page.

("Fill in the URL here")

2.2.2 Parsing Acquired Web Pages

F12 key to enter the developer tools and determine the location of the data extraction, copy the XPath path in it

2.2.3 Extracting movie review data

Movie review data extraction using XPath

driver.find_element_by_xpath('//*[@]/div[{}]/div[2]/p/span')

2.2.4 Saving documents

# New folders and files
basePathDirectory = "Hudong_Coding"
if not (basePathDirectory):
        (basePathDirectory)
baiduFile = (basePathDirectory, "")
# New if file doesn't exist, append if it does
if not (baiduFile):
        info = (baiduFile, 'w', 'utf-8')
else:
        info = (baiduFile, 'a', 'utf-8')

txt file write

( + '\r\n')

2.2.5 Word cloud analysis

The jieba library and the worldcloud library are used for word cloud analysis.

It is worth noting that the figure below shows the selection path method for text.

2.3 General view of the code

2.3.1 Crawling code

# -*- coding: utf-8 -*-
# !/usr/bin/env python
import os
import codecs
from selenium import webdriver
 
# Get summary information
def getFilmReview():
    try:
        # New folders and files
        basePathDirectory = "DouBan_FilmReview"
        if not (basePathDirectory):
            (basePathDirectory)
        baiduFile = (basePathDirectory, "DouBan_FilmReviews.txt")
        # New if file doesn't exist, append if it does
        if not (baiduFile):
            info = (baiduFile, 'w', 'utf-8')
        else:
            info = (baiduFile, 'a', 'utf-8')
 
        # Browse drive paths
        chromedriver = 'E:/software/chromedriver_win32/'
        [""] = chromedriver
        driver = (chromedriver)
        # Open the page
        for k in range(15000):  # About 15,000 pages
            k = k + 1
            g = 2 * k
            ("/subject/34841067/comments?start={}".format(g))
            try:
                # Automated searches
                for i in range(21):
                    elem = driver.find_element_by_xpath('//*[@]/div[{}]/div[2]/p/span'.format(i+1))
                    print()
                    ( + '\r\n')
            except:
                pass
 
    except Exception as e:
        print('Error:', e)
 
    finally:
        print('\n')
        ()
 
# main function
def main():
    print('Start crawling')
    getFilmReview()
    print('End crawl')
 
if __name__ == '__main__':
    main()

2.3.2 Word cloud analysis code

# -*- coding: utf-8 -*-
# !/usr/bin/env python
 
import jieba                #Chinese Segmentation
import wordcloud            # Mapping word clouds
 
# Display data
 
f = open('E:/software/PythonProject/DouBan_FilmReview/DouBan_FilmReviews.txt', encoding='utf-8')
 
txt = ()
txt_list = (txt)
# print(txt_list)
string = ' '.join((txt_list))
print(string)
 
# Very much based on the pop-up data obtained to map the word cloud
# mk = (r'image path')
 
w = (width=1000,
                        height=700,
                        background_color='white',
                        font_path='C:/Windows/Fonts/',
                        #mask=mk,
                        scale=15,
                        stopwords={' '},
                        contour_width=5,
                        contour_color='red'
                        )
 
(string)
w.to_file('DouBan_FilmReviews.png')

III. Real-time box office collection

3.1 Website selection

3.2 Code Writing

# -*- coding: utf-8 -*-
# !/usr/bin/env python
import os
import time
import datetime
import requests
 
class PF(object):
    def __init__(self):
         = '/dashboard-ajax?orderType=0&uuid=173d6dd20a2c8-0559692f1032d2-393e5b09-1fa400-173d6dd20a2c8&riskLevel=71&optimusCode=10'
         = {
            "Referer": "/dashboard",
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36",
        }
 
    def main(self):
        while True:
            # You need to run this file under the dos command to clear the screen
            ('cls')
            result_json = self.get_parse()
            if not result_json:
                break
            results = (result_json)
            # Getting time
            calendar = result_json['calendar']['serverTimestamp']
            t = ('.')[0].split('T')
            t = t[0] + " " + ((t[1], "%H:%M:%S") + (hours=8)).strftime("%H:%M:%S")
            print('Beijing time:', t)
            x_line = '-' * 155
            # Gross box office
            total_box = result_json['movieList']['data']['nationBoxInfo']['nationBoxSplitUnit']['num']
            # Gross box office units
            total_box_unit = result_json['movieList']['data']['nationBoxInfo']['nationBoxSplitUnit']['unit']
            print(f"Today's Gross: {total_box} {total_box_unit}", end=f'\n{x_line}\n')
            print('Movie Title'.ljust(14), 'Combined box office'.ljust(11), 'Box office share'.ljust(13), 'Average Attendance'.ljust(11), 'Fielding average'.ljust(11),'Scheduling of movie shows'.ljust(12),'Percentage of movies in the queue'.ljust(12), 'Cumulative gross'.ljust(11), 'Days in theaters', sep='\t', end=f'\n{x_line}\n')
            for result in results:
                print(
                    result['movieName'][:10].ljust(9),  # Movie titles
                    result['boxSplitUnit'][:8].rjust(10),  # Combined box office
                    result['boxRate'][:8].rjust(13),  # Box office share
                    result['avgSeatView'][:8].rjust(13),  # Average attendance
                    result['avgShowView'][:8].rjust(13),  # Average attendance
                    result['showCount'][:8].rjust(13),  # 'Theatrical releases'
                    result['showCountRate'][:8].rjust(13),  # Movie share of the schedule
                    result['sumBoxDesc'][:8].rjust(13),  # Cumulative gross
                    result['releaseInfo'][:8].rjust(13),  # Theater information
                    sep='\t', end='\n\n'
                )
                break
            (4)
 
    def get_parse(self):
        try:
            response = (, headers=)
            if response.status_code == 200:
                return ()
        except  as e:
            print("ERROR:", e)
            return None
 
    def parse(self, result_json):
        if result_json:
            movies = result_json['movieList']['data']['list']
            # attendance, attendance average, box office share, movie title.
            # Release Info (Days), Movie Schedule, Movie Share, Combined Box Office,Cumulative Total
            ticks = ['avgSeatView', 'avgShowView', 'boxRate', 'movieName',
                     'releaseInfo', 'showCount', 'showCountRate', 'boxSplitUnit', 'sumBoxDesc']
            for movie in movies:
                 = {}
                for tick in ticks:
                    # Separation of numbers and units requires joins
                    if tick == 'boxSplitUnit':
                        movie[tick] = ''.join([str(i) for i in movie[tick].values()])
                    # Multi-level dictionary nesting
                    if tick == 'movieName' or tick == 'releaseInfo':
                        movie[tick] = movie['movieInfo'][tick]
                    if movie[tick] == '':
                        movie[tick] = 'This item is empty'
                    [tick] = str(movie[tick])
                yield 
 
 
if __name__ == '__main__':
    while True:
        pf = PF()
        ()

3.3 Presentation of results

IV. Crew photo crawling

4.1 Website selection

4.2 Code Writing

# -*- coding: utf-8 -*-
# !/usr/bin/env python
import requests
from bs4 import BeautifulSoup
import re
from PIL import Image
 
def get_data(url):
    # Requests for web pages
    resp = (url)
    # The headers parameter determines
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
    }
        # Perform 'utf-8' transcoding of the fetched HTML binary file to a string file
    html = ('utf-8')
    # BeautifulSoup narrows the lookup
    soup = BeautifulSoup(html, '')
    # Get a hyperlink to <a>.
    for link in soup.find_all('a'):
        a = ('href')
        if type(a) == str:
            b = ('(.*?)jpg', a)
            try:
                print(b[0]+'jpg')
                img_urls = b[0] + '.jpg'
                # Data preservation
                for img_url in img_urls:
                    # Send image URL requests
                    image = (img_url, headers=headers).content
                    # Data preservation
                    with open(r'E:/IMAGES/' + image, 'wb') as img_file:
                        img_file.write(image)
            except:
                pass
        else:
            pass
 
# Crawl target pages
if __name__ == '__main__':
    get_data('https:///newgallery/hdpic/')

4.3 Display of effects

V. Summary

See this movie began to laugh how happy, behind the cry how sad, this movie with a child's perspective, selected the mother in the choice of love and marriage during the choices made, through the observation of the mother, appreciate the mother's so-called happiness, and not what Jialing think: married to the son of the factory director can be obtained, this is their common choice, no matter how many times, the mother will be righteous to choose the right for their own rather than others think that kind of happy life. Their own rather than others believe that the kind of happy life, which also indirectly tells us: our pursuit of happiness in the process, to go by virtue of their own, rather than to live in the eyes and mouth of others happy, after all, a lot of choices in life only once.

To this article on the python crawler of the Hello, Lee Hwan Young movie box office data analysis is introduced to this article, more related python crawling movie box office content, please search for my previous posts or continue to browse the following related articles I hope that you will support me more in the future!