In this article, the example of Python using Scrapy framework to crawl the Douban movie. Shared for your reference, as follows:
1. Concepts
Scrapy is an application framework written to crawl website data and extract structured data. It can be applied in a range of programs including data mining, information processing or storing historical data.
The Python package management tool makes it easy to install scrapy. If you get an error about missing dependencies during installation, install the missing package via pip.
pip install scrapy
The composition of scrapy is shown below
Engine Scrapy EngineThe signaling and data transfer used to relay other parts of the scheduling
SchedulerThe engine sends the requested connection to the Scheduler, which queues up the request, but then sends the first request in the queue to the engine when it needs it.
DownloaderDownloaderAfter the engine sends the Request link to the Downloader, it downloads the corresponding data from the Internet and hands over the returned data Responses to the engine.
Crawler SpidersThe engine parses the downloaded Responses data to Spiders to extract the information we need from the web page. If a new url link is found during parsing, Spiders will give the link to the engine to deposit into the scheduler.
PipelineItem PiplineThe crawler will pass the data from the page through the engine to the pipeline for further processing, filtering, storage, etc.
Downloader MiddlewaresThe custom extensions are used to encapsulate proxies, http request headers, and other operations when requesting a page.
Crawler Middleware Spider MiddlewaresThis is used to make some changes to the data of incoming Spiders' Responses and outgoing Requests.
The workflow of scrapy: first we give the entry url to the spider crawler, the crawler will put the url into the scheduler through the engine, and after queuing up the scheduler, it will return the first request Request, the engine will give the request to the downloader to download, and then the downloaded data will be given to the crawler for crawling, and part of the crawled data will be given to the pipeline to clean and store the data, and part of the new url connection will be given to the scheduler again. and storage, and part of the new url connection will be given to the scheduler again, and then the data will be recycled for crawling.
2, the new Scrapy project
First, open the command line in the folder where the project is stored, and type scrapy startproject project name at the command line, it will automatically create the python files needed for the project in the current folder, for example, to create a project to crawl Douban movies, its directory structure is as follows:
Db_Project/ --Configuration file for the project douban/ --projectpythonModule Catalog,In which to writepythoncoding __init__.py --pythonPackage initialization files --Used to defineitemdata structure --in the projectpipelinesfile --Define global settings for the project,For example, download delays、concurrency spiders/ --存放爬虫coding的包目录 __init__.py ...
Afterwards, go to the spiders directory and enter scrapy genspider crawler name domain name, it will generate a crawler file file, which is used to define the crawler's crawling logic and regular expressions, etc. later on
scrapy genspider douban
3. Defining data
The URL of the Douban movie to crawl is/top250Each of these movies is as follows
We want to crawl the key information of serial number, name, introduction, star rating, number of reviews, and description in it, so we need to define these objects in the pipeline file first, similar to an ORM, and define a data type for each field through the () method
import scrapy class DoubanItem(): ranking = () # Ranking name = () # Movie title introduce = () # Introduction star = () # Stars comments = () # of comments describe = () # descriptive
4、Data Crawling
Open the crawler file previously created in the spiders folder is shown below, as well as automatically create three variables and a method in the parse method to return data response processing, we need to provide the entry address of the crawler in the start_urls. Note that the crawler will automatically filter out domains other than allowed_domains, so you need to pay attention to the value of this variable
# spiders/ import scrapy class MovieSpider(): # Crawler name name = 'movie' # of domains allowed to be crawled allowed_domains = [''] # entry url start_urls = ['/top250'] def parse(self, response): pass
Before crawling the data, you should first set up some network agents, find the USER_AGENT variable within the file and modify it as follows:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'
You can start a crawler named douban at the command line with the following command: scrapy crawl douban, or you can write a startup file as follows and run it
from scrapy import cmdline ('scrapy crawl movie'.split())
The next step is to filter the crawled data, through the Xpath rule allows us to conveniently select the specified elements in the web page, as shown in the following figure, each movie entry is wrapped in a <li> tag under <ol class="grid_view">, so by xpath://ol[@class=' grid_view']/li then all movie entries on this page are selected. The xpath value can be obtained via the Xpath plugin for Google Chrome or ChroPath for Firefox. Right-clicking on an element in your browser brings up the developer tools shown below, of which the ChroPath plugin is on the far right, which visualizes the element's xpath value: //div[@id='wrapper']//li
Crawler response object of the xpath () method can directly deal with the xpath rule string and return the corresponding page content, these contents are Selector object Selector, you can further refine the content of the selection, through the xpath to select out of which the name of the movie, synopsis, ratings, star ratings, etc., that is, previously defined in the document data structure UrbanItem. Loop through each movie list from which to crawl to the exact movie information, and saved as UrbanItem object item, and finally through the field will be returned to the Item pipeline from the Spiders item object.
Crawler in addition to extracting Item data from the page will also crawl the url link so as to form the next page of Request request, as shown in the figure below for the bottom of the bean page of the next page of information, the second page of the parameter "?start=25&filter=", with the website address/top250Splice it together to get the address of the next page. As above, the content is extracted by xpath, and if it is not null, then the Request request field obtained by splicing is submitted to the scheduler.
The final crawler file is as follows
# -*- coding: utf-8 -*- import scrapy from items import DoubanItem class MovieSpider(): # Crawler name name = 'movie' # Crawl the site's domain allowed_domains = [''] # entry url start_urls = ['/top250'] def parse(self, response): # First grab the list of movies movie_list = ("//ol[@class='grid_view']/li") for selector in movie_list: # Iterate through each movie list, grab the exact information you need from it and save it as an item object. item = DoubanItem() item['ranking'] = (".//div[@class='pic']/em/text()").extract_first() item['name'] = (".//span[@class='title']/text()").extract_first() text = (".//div[@class='bd']/p[1]/text()").extract() intro = "" for s in text: # Putting the profile into a string intro += "".join(()) # Remove spaces item['introduce'] = intro item['star'] = ('.rating_num::text').extract_first() item['comments'] = (".//div[@class='star']/span[4]/text()").extract_first() item['describe'] = (".//span[@class='inq']/text()").extract_first() # print(item) yield item # Return the resultant item object to the Item pipeline # Crawl the next page url information in a web page next_link = ("//span[@class='next']/a[1]/@href").extract_first() if next_link: next_link = "/top250" + next_link print(next_link) # Submit the Request request to the scheduler yield (next_link, callback=)
xpath selector
// means to search from the next level of the directory at the current location, // means to search from any level of the subdirectory at the current location, the
The default search starts from the root directory, . represents searching from the current directory, @ is followed by the tag attribute, and text() function represents taking out the text content.
//div[@id='wrapper']//li Represents finding the div tag with the id wrapper first, starting at the root, and then taking out all the li tags under it
. //div[@class='pic']/em[1]/text() represents the first em tag under all divs with class pic, starting from the current selector directory, to take out the text content
string(//div[@id='endText']/p[position()>1]) represents the selection of all text content after the second p tag under the div with the id endText
/bookstore/book[last()-2] Selects the penultimate book element that is a child of bookstore.
CSS Selector
You can also use css selectors to select elements within a page, which expresses the selected elements by means of CSS pseudo-classes, using the following
# Select the text in the p tag under the div with the class name left (' p::text').extract_first() # Select the text in the element named star under the element with id tag. ('#tag .star::text').extract_first()
5. Data retention
When running the crawler file through the parameter -o specify the location of the file can be saved, you can choose to save the file as a json or csv file according to the file extension, for example
scrapy crawl movie -o
You can further manipulate the obtained Item data in the file to save it to the database via python operations.
6. Middleware settings
Sometimes in order to cope with the website's anti-crawler mechanism, it is necessary to carry out some camouflage settings for the download middleware, including the use of IP proxies and proxy user-agent mode, a new user_agent class is created in the file for the request header to add a list of users, and some commonly used user agents are put into the USER_AGENT_LIST list from the Internet and then randomly select one as a proxy through the random function to randomly select one of them as the agent, set to reques request header User_Agent field
class user_agent(object): def process_request(self, request, spider): # user agent list USER_AGENT_LIST = [ 'MSIE (MSIE 6.0; X11; Linux; i686) Opera 7.23', 'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)', 'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)', 'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)', 'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)', 'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0', 'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)', 'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)' ] agent = (USER_AGENT_LIST) # Randomly select an agent from the list above ['User_Agent'] = agent # Setting the user agent for the request header
In the file set to open the download middleware that is, remove the following lines of comments, register the agent class user_agent and set the priority, the smaller the number the higher the priority
More about Python related content can be viewed on this site's topic: thePython Socket Programming Tips Summary》、《Python Regular Expression Usage Summary》、《Python Data Structures and Algorithms Tutorial》、《Summary of Python function usage tips》、《Summary of Python string manipulation techniques》、《Python introductory and advanced classic tutorialsand theSummary of Python file and directory manipulation techniques》
I hope the description of this article will help you in Python programming.