Practicing Python's crawler framework Scrapy to crawl Douban Movie Top 250

Installing and Deploying Scrapy

The first thing you need to make sure before installing Scrapy is that you have installed Python (currently Scrapy supports Python 2.5, Python 2.6 and Python 2.7). The official document describes three ways to install, I use easy_install to install, the first is to download the Windows version of the setuptools (download address: /pypi/setuptools), after downloading all the way NEXT on it.
After installing setuptool. Execute CMD and run the command:

easy_install -U Scrapy

Again you can choose to install using pip, which is located at /pypi/pip
The command to install Scrapy using pip is

pip install Scrapy

If you have previously installed visual studio 2008 or visual studio 2010 on your computer then all is well and Scrapy is already installed. If you get the following error: Unable to find, then you need to work around it. You can either install visual studio and then install it or use the following method to solve the problem:
First install MinGW (MinGW download address: /projects/mingw/files/), in the MinGW installation directory to find the bin folder, find, make a copy of the renamed;
Add the path of MinGW to the environment variable path. For example, if I install MinGW into D:\MinGW\, add D:\MinGW\bin to path;
Open a command line window and go to the directory where you want to install the code in the command line window;
Enter the following command install build -compiler=mingw32 to install it.

If the error "xslt-config' is not an internal or external command, nor a runnable program or batch file." occurs, the reason is mainly because the installation of lxml is not successful. error, the reason is mainly because the lxml installation is not successful, just go to /simple/lxml/ and download an exe file to install it.
Here's where you can get to the point.

New construction
Let's use a crawler to get information about movies in the Douban Movie Top 250. Before we start, let's create a new Scrapy project. Since I'm using Win7, go into a directory in CMD where I want to save the code and execute it:

D:\WEB\Python>scrapy startproject doubanmoive

This command creates a new directory, doubanmoive, in the current directory with the following structure:

D:\WEB\Python\doubanmoive>tree /f
Folder PATH listing for volume Data
Volume serial number is 00000200 34EC:9CB9
D:.
│ 
│
└─doubanmoive
 │ 
 │ 
 │ 
 │ __init__.py
 │
 └─spiders
   __init__.py

These documents are mainly for:

doubanmoive/: defines the content fields to be fetched, similar to an entity class.
doubanmoive/: project pipeline file, used to process data crawled by Spider.
doubanmoive/: project configuration file
doubanmoive/spiders: directory for placing spiders

Defining an Item

Item is used to load the container of the grabbed data, and the entity class in Java (Entity) is more like, open doubanmoive/ you can see the following code created by default.

from  import Item, Field

class DoubanmoiveItem(Item):
  pass

We just need to add the field we need to capture in Doubanmoive class, such as name=Field(), and finally complete the code according to our needs as follows.

from  import Item, Field

class DoubanmoiveItem(Item):
 name=Field()# Movie title
 year=Field()#Year of release
 score=Field()#Bean Score
 director=Field()#Director
 classification=Field()#categorized
 actor=Field()# Actors

Writing a Spider

Spider is the core class in the whole project, in this class we will define the crawling object (domain, URL) and crawling rules.The tutorials in the official Scrapy docs are based on BaseSpider, but BaseSpider can only crawl a given list of URLs, and can't be expanded outward based on an initial URL. In addition to BaseSpider, however, there are many other classes that inherit directly from Spider, such as .

Create a new moive_spider.py file in the doubanmoive/spiders directory and fill in the code.

# -*- coding: utf-8 -*-
from  import Selector
from  import CrawlSpider,Rule
from  import SgmlLinkExtractor
from  import DoubanmoiveItem

class MoiveSpider(CrawlSpider):
 name="doubanmoive"
 allowed_domains=[""]
 start_urls=["/top250"]
 rules=[
  Rule(SgmlLinkExtractor(allow=(r'/top250\?start=\d+.*'))),
  Rule(SgmlLinkExtractor(allow=(r'/subject/\d+')),callback="parse_item"),  
 ]

 def parse_item(self,response):
  sel=Selector(response)
  item=DoubanmoiveItem()
  item['name']=('//*[@]/h1/span[1]/text()').extract()
  item['year']=('//*[@]/h1/span[2]/text()').re(r'\((\d+)\)')
  item['score']=('//*[@]/div/p[1]/strong/text()').extract()
  item['director']=('//*[@]/span[1]/a/text()').extract()
  item['classification']= ('//span[@property="v:genre"]/text()').extract()
  item['actor']= ('//*[@]/span[3]/a[1]/text()').extract()
  return item

Code Description: MoiveSpider inherited Scrapy in the CrawlSpider , name , allow_domains , start_url look at the name to know what the meaning of which rules a little more complex, defines the URL crawling rules, in line with the allow regular expression of the link will be added to the Scheduler. By analyzing the paging URL /top250?start=25&filter=&type= of Douban movie Top250, we can get the following rules

Rule(SgmlLinkExtractor(allow=(r'/top250\?start=\d+.*'))),
And we really want to crawl the page is the details of each movie, such as Shawshank Redemption link for /subject/1292052/, that only the subject after the number is changed, according to the regular expression to get the following code. We need to grab the content in this type of link, so we add the callback attribute and give the Response to the parse_item function to process.

Rule(SgmlLinkExtractor(allow=(r'/subject/\d+')),callback="parse_item"),
The processing logic in the parse_item function is very simple, get the code of the link that matches the condition, then grab the content according to certain rules and assign it to the item and return the Item Pipeline. To get the content of most tags without writing complex regular expressions, we can use XPath. XPath is a language for finding information in XML documents, but it can also be used in HTML. The following table lists commonly used expressions.

displayed formula	descriptive
nodename	Selects all children of this node.
/	Selection from the root node.
//	Selects nodes in the document from the current node of the matching selection, regardless of their position.
.	Selects the current node.
..	Selects the parent of the current node.
@	Select the attribute.

For example //*[@]/h1/span[1]/text() gets the text content of the first element in the list of spans under the h1 element under any element with id content. We can get the XPath expression of a content through Chrome Developer Tools (F12) by clicking on the review element on the content that needs to be crawled, the developer tools will appear below and locate the element, right click on the content and select Copy XPath.

Stored Data

Crawler to obtain the data we need to store it in the database, we mentioned that the operation needs to rely on the project pipeline (pipeline) to deal with the operation is usually performed:

Cleaning HTML data
Validation of the parsed data (checking that the item contains the necessary fields)
Check for duplicate data (delete if duplicates)
Store parsed data in a database

Since we get data in a variety of formats, some of which are not conveniently stored in relational databases, I wrote a MongoDB one after writing the MySQL version of Pipeline.

MySQL version:

# -*- coding: utf-8 -*-
from scrapy import log
from  import adbapi
from  import Request

import MySQLdb
import 


class DoubanmoivePipeline(object):
 def __init__(self):
   = ('MySQLdb',
    db = 'python',
    user = 'root',
    passwd = 'root',
    cursorclass = ,
    charset = 'utf8',
    use_unicode = False
  )
 def process_item(self, item, spider):
  query = (self._conditional_insert, item)
  (self.handle_error)
  return item

 def _conditional_insert(self,tx,item):
  ("select * from doubanmoive where m_name= %s",(item['name'][0],))
  result=()
  (result,level=)
  print result
  if result:
   ("Item already stored in db:%s" % item,level=)
  else:
   classification=actor=''
   lenClassification=len(item['classification'])
   lenActor=len(item['actor'])
   for n in xrange(lenClassification):
    classification+=item['classification'][n]
    if n<lenClassification-1:
     classification+='/'
   for n in xrange(lenActor):
    actor+=item['actor'][n]
    if n<lenActor-1:
     actor+='/'

   (\
    "insert into doubanmoive (m_name,m_year,m_score,m_director,m_classification,m_actor) values (%s,%s,%s,%s,%s,%s)",\
    (item['name'][0],item['year'][0],item['score'][0],item['director'][0],classification,actor))
   ("Item stored in db: %s" % item, level=)

 def handle_error(self, e):
  (e)

MongoDB version:

# -*- coding: utf-8 -*-
import pymongo

from  import DropItem
from  import settings
from scrapy import log

class MongoDBPipeline(object):
 #Connect to the MongoDB database
 def __init__(self):
  connection = (settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
  db = connection[settings['MONGODB_DB']]
   = db[settings['MONGODB_COLLECTION']]

 def process_item(self, item, spider):
  #Remove invalid data
  valid = True
  for data in item:
   if not data:
   valid = False
   raise DropItem("Missing %s of blogpost from %s" %(data, item['url']))
  if valid:
  #Insert data into database
   new_moive=[{
    "name":item['name'][0],
    "year":item['year'][0],
    "score":item['score'][0],
    "director":item['director'],
    "classification":item['classification'],
    "actor":item['actor']
   }]
   (new_moive)
   ("Item wrote to MongoDB database %s/%s" %
   (settings['MONGODB_DB'], settings['MONGODB_COLLECTION']),
   level=, spider=spider) 
  return item

You can see that the basic process is the same, but MySQL is less convenient is the need to array type data through the delimiter conversion. MongoDB supports the deposit of List, Dict and other types of data.

configuration file

You will also need to add some configuration information to the crawler before running it.

BOT_NAME = 'doubanmoive'
SPIDER_MODULES = ['']
NEWSPIDER_MODULE = ''
ITEM_PIPELINES={
 'doubanmoive.mongo_pipelines.MongoDBPipeline':300,
 '':400,
}
LOG_LEVEL='DEBUG'

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True

MONGODB_SERVER = 'localhost'
MONGODB_PORT = 27017
MONGODB_DB = 'python'
MONGODB_COLLECTION = 'test'

ITEM_PIPELINES in the definition of MySQL and MongoDB two Pipeline file, the number after the execution of the priority of the order, the range of 0 ~ 1000. And the middle of the DOWNLOAD_DELAY and other information is in order to prevent the crawler by the bean Ban off, add some random delay, browser proxy and so on. The last is the configuration information of MongoDB, MySQL can also refer to this way to write.

So far, the crawl for grabbing doubanmoive movies is complete. Execute Scrapy crawl doubanmoive on the command line and let the spider start crawling!