Python scrapy incremental crawling instance and implementation process analysis

This article mainly introduces Python scrapy incremental crawling examples and implementation process analysis. The article introduces the example code in detail, which has certain reference learning value for everyone's learning or work. Friends who need it can refer to it

When I started to get involved in crawlers, I was still a beginner in Python. I used request, bs4, and pandas. Then I came into contact with scrapy to make one or two crawlers. I thought the framework was better, but unfortunately I forgot about it without records. Now I need to crawl a certain article to do a recommendation system, so I picked up scrapy again. Take this opportunity to make a record.

The directory is as follows:

environment
Local window debugging command
Project Directory
xpath selector
A simple incremental crawler example
Configuration Introduction

environment

In your own environment, you must use anaconda (against the superiority of anaconda again

Local window debugging and running

During development, you can use the debugging function provided by scrapy to simulate requests, so that request and response will remain the same as the subsequent code.

# Test request for a websitescrapy shell URL
# Set request headerscrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0" URL

# Specify the crawler content output file format (json, csv, etc.scrapy crawl SPIDER_NAME -o FILE_NAME.csv

# Create a crawler projectscrapy startproject articles # Create ascrapyproject

Introduction to the new engineering structure

#Storage all crawlers under spiders file and format data output# Set request details (request header, etc.), as a pipeline for data output, each encapsulated item will pass here# Make global settings for the project (store configuration├── articles
│  ├── articles
│  │  ├── __init__.py
│  │  ├── 
│  │  ├── 
│  │  ├── 
│  │  ├── 
│  │  └── spiders
│  │    ├── healthy_living.py
│  │    ├── __init__.py
│  │    └── people_health.py
│  └── 
├── 
└──

Page parsing artifact—Xpath selector

scrapy comes with its own xpath selector, which is very convenient, briefly introduces some commonly used ones

# The entire site crawling artifact - LinkExtractor, which can automatically obtain all urls and text under this tag (because the website structure is mostly the same routinefrom  import LinkExtractor
le = LinkExtractor(restrict_xpaths="//ul[@class='nav2_UL_1 clearFix']")# Return an iterator, and by looping (for i in le), you can get url() ()
# Get the attribute class as the content in the div tag content of all aa("//div[@class='aa']/text()").extract()    # '//' means to get all, '/' means to get the first one, similar to other tags with attribute ul
# Get the links contained in all the a tags with "Next Page" (extract the next page link tool("//a[contains(text(),'Next Page')]/@href").extract()

A simple incremental crawl example

The idea of incremental crawling here is very simple: the data of the target website are arranged according to time, so before requesting a certain connection, first check whether there is this data in the database. If there is, stop the crawler and if no request is initiated.

class HealthyLiving():
  # You must have a globally unique crawler name. You need to specify this name when starting the command line  name = "healthy_living"
  # Specify the crawler entrance, scrapy supports multiple entrances, so it must be in the form of Lis  start_urls = ['/healthyLiving/']

  '''
   Crawl the entrance to the major category tags
   '''
  def parse(self, response):
    le = LinkExtractor(restrict_xpaths="//ul[@class='nav2_UL_1 clearFix']")
    for link in le.extract_links(response)[1:-1]:
      tag = 
      # Pass the information extracted from this level to the next level through the request header (here is to tag the data      meta = {"tag": tag}
      # parse each link in turn and pass it to the next level for continued crawling      yield (, callback=self.parse_articles, meta=meta)

  '''
   Crawl the article links and the next page links in the page
   '''
  def parse_articles(self, response):
    # Receive information delivered at the previous level    meta = 
    article_links = ("//div[@class='txt']/h4/a/@href").extract()
    for link in article_links:
      res = .find_one({"article_url": link}, {"article_url": 1})
      full_meta = dict(meta)
      # Pass the article link to the next level      full_meta.update({"article_url": link})
      if res is None:
        yield (link, callback=self.parse_article, meta=full_meta)
      else:
        return
    next_page = ("//div[@class='page']//a[contains(text(),'»')]/@href").extract()[0]
    if next_page:
      yield (next_page, callback=self.parse_articles, meta=meta)

# Finally parse the page and output it  def parse_article(self, response):
   # Import data encapsulation format from    article_item = ArticlesItem()
    meta = 
    # Use xpath to extract page information and encapsulate it into item    try:
      article_item["tag"] = ""
      # ... Omitted    finally:
      yield article_item

Project configuration introduction

Set request header and configure database

# Set the request header, set it in, enable it inclass RandomUA(object):
  user_agents = [
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit"
      "/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
      "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit"
      "/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16"
    ]

  def process_request(self, request, spider):
    ["User-Agent"] = (self.user_agents)


# Set up data entry processing, configure and enableclass MongoPipeline(object):
  def __init__(self, mongo_uri, mongo_db):
    self.mongo_uri = mongo_uri
    self.mongo_db = mongo_db

  @classmethod
  def from_crawler(cls, crawler):
    return cls(
      mongo_uri=('MONGO_URI'),
      mongo_db=('MONGO_DB')
    )

  def open_spider(self, spider):
    print("Start crawling", ().strftime('%Y-%m-%d %H:%M:%S'))
     = (self.mongo_uri)
     = [self.mongo_db]

  def process_item(self, item, spider):
    data = [].find_one({"title": item["title"], "date": item["date"]})

    if data is None:
      [].insert(dict(item))
    # else:
    #   self.close_spider(self, spider)
    return item

  def close_spider(self, spider):
    print("Crawl ends", ().strftime('%Y-%m-%d %H:%M:%S'))
    ()
# Initiate: modify the request header, database configurationDOWNLOADER_MIDDLEWARES = {
  # '': 543,
  '': 543,#543 represents priority, the lower the number, the higher the priority}

ITEM_PIPELINES = {
  '': 300,
}

# Some other configurationsROBOTSTXT_OBEY = True # Whether to comply with the website's robot agreementFEED_EXPORT_ENCODING = 'utf-8' # Specify the encoding format of the data output## Database ConfigurationMONGO_URI = ''
MONGO_DB = ''
MONGO_PORT = 27017
MONGO_COLLECTION = ''

The above is all the content of this article. I hope it will be helpful to everyone's study and I hope everyone will support me more.