Implementation of setting acquisition depth in Python in Scrapy

Scrapy is a very powerful Python crawler framework that allows developers to crawl data from websites through a small amount of code. To control the behavior of crawlers, Scrapy provides many configuration options, where acquisition depth is a key parameter. The acquisition depth control crawler starts from the starting URL and deeply crawls the link level. Setting the acquisition depth rationally can help you optimize the efficiency of crawlers, avoid unnecessary deep crawls, and prevent crawlers from falling into endless link loops.

This article will introduce in detail how to set the acquisition depth in Scrapy, and how to control and monitor the crawl depth to improve crawler performance and data quality.

1. What is the acquisition depth?

Crawl Depth refers to the level recursively when the crawler crawls the links in the page starting from the initial URL (seed URL). Suppose you have the following web page hierarchy:

Page 1 (Initial URL)
├── Page 2 (Depth 1)
│ ├── Page 3 (Depth 2)
│ └── Page 4 (Depth 2)
└── Page 5 (Depth 1)

In this example:

Page 1 is the initial page with a depth of 0.
Page 2 and Page 5 are direct links to the initial page with a depth of 1.
Page 3 and Page 4 are pages linked from Page 2 with a depth of 2.
By controlling the acquisition depth, you can limit the recursive levels of crawlers between pages, avoiding getting stuck in too deep link chains, thus crawling data more efficiently.

2. Why should we control the acquisition depth?

Controlling acquisition depth helps manage crawler performance, and here are a few common reasons:

Avoid deep link loops: Many websites have deep link loops, and crawlers may crawl unlimitedly when there is no depth limit, thus wasting a lot of time and resources.
Improve crawling efficiency: Usually, the core content of the page exists at a shallow level, and deep-level pages may be irrelevant or unimportant. Limiting depth can improve the efficiency of data acquisition.
Prevent crawlers from falling into a dead loop: By limiting depth, crawlers can be prevented from getting lost in dynamically generated or structured pages.
Reduce data volume: Acquisition depth control can avoid crawling too many unnecessary pages, especially in large-scale crawling, which helps reduce data redundancy.

3. How to set the acquisition depth in Scrapy?

Scrapy provides several key configuration items to control the acquisition depth of the crawler:

DEPTH_LIMIT: Used to limit the maximum depth of the crawler.
DEPTH_STATS: Used to enable depth statistics to help you monitor the depth of crawling.
DEPTH_PRIORITY: Used to set the acquisition strategy of the crawler (deep priority or breadth priority).

3.1 Use DEPTH_LIMIT to set the maximum acquisition depth

DEPTH_LIMIT is a configuration item in Scrapy that limits the maximum crawling depth of a crawler. You can set it in . By default, Scrapy has no depth limit (i.e. unlimited depth crawl), which can be achieved by setting this parameter if you want to limit the depth of crawl.

Example: Restrict crawlers to crawl up to 3 layers of pages

In the file of the Scrapy project, add the following configuration:

# 

# Set the maximum crawling depth of the crawler to 3DEPTH_LIMIT = 3

In this way, the crawler will only crawl to 3 depths of the initial URL. If the crawler starts from Page 1, it will crawl to pages with a depth of 2 at most.

3.2 Enable Depth Statistics: DEPTH_STATS

Scrapy allows you to enable depth statistics to view page crawling at each layer of depth through the DEPTH_STATS configuration item. This feature is very useful and can help you understand the crawler's crawl depth and page distribution.

To enable depth statistics, set in:

# 

# Enable Depth StatisticsDEPTH_STATS = True

# Enable printout of statisticsDEPTH_STATS_VERBOSE = True

When you enable DEPTH_STATS_VERBOSE, Scrapy prints out crawl statistics for each depth at the end of the crawl, including how many pages are crawled for each layer.

Output example:

Depth stats:
depth=0 - 1 pages
depth=1 - 10 pages
depth=2 - 25 pages
depth=3 - 30 pages

This report clearly shows how many pages the crawler crawls on each layer, helping you evaluate the coverage of the crawler.

3.3 Control acquisition strategy in combination with DEPTH_PRIORITY

In addition to limiting the crawl depth, Scrapy allows you to control whether the crawler adopts a depth-first (DFS) or breadth-first (BFS) strategy through DEPTH_PRIORITY.

Setting DEPTH_PRIORITY = 1 makes crawlers more inclined to depth-first searches.
Setting DEPTH_PRIORITY = -1 will make crawlers more inclined to breadth-first searches (this is the default).
For example, if you want the crawler to use depth-first acquisition, prioritize crawling newly discovered pages:

# 

# Set as depth-first searchDEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = ''
SCHEDULER_MEMORY_QUEUE = ''

If you want the crawler to use breadth-first search by default (i.e. crawl layer by layer until the maximum depth is crawled), you can keep the default settings or set DEPTH_PRIORITY to -1:

# 

# Use breadth-first search (default)DEPTH_PRIORITY = -1
SCHEDULER_DISK_QUEUE = ''
SCHEDULER_MEMORY_QUEUE = ''

4. Example: Controlling the acquisition depth of Scrapy crawlers

Here is a simple Scrapy crawler example showing how to use DEPTH_LIMIT to control the crawl depth and enable depth statistics.

Crawler code:

import scrapy

class DepthSpider():
    name = 'depth_spider'
    start_urls = ['']

    def parse(self, response):
        (f'Crawl the page: {}')
        # Extract links from the page and continue crawling        for href in ('a::attr(href)').getall():
            yield (url=(href), callback=)

document:

# 

# Set the maximum crawl depth to 3DEPTH_LIMIT = 3

# Enable Depth StatisticsDEPTH_STATS = True
DEPTH_STATS_VERBOSE = True

# Use the depth-first strategyDEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = ''
SCHEDULER_MEMORY_QUEUE = ''

Running the crawler:
Run the crawler on the command line:

scrapy crawl depth_spider

The crawler will follow a depth-first strategy, crawl up to 3 layers of pages, and output statistics for each depth after the crawl is completed.

5. How to dynamically adjust the acquisition depth?

In addition to configuring acquisition depth in , you can also dynamically set depth limits when running crawlers. You can pass the DEPTH_LIMIT parameter through the command line without modifying the file.

For example, set the depth limit to 2 when running a crawler:

scrapy crawl depth_spider -s DEPTH_LIMIT=2

This method is very flexible and is suitable for quickly adjusting the behavior of crawlers in different scenarios.

6. Summary

By reasonably controlling the acquisition depth of Scrapy crawlers, you can help you optimize crawl efficiency, avoid getting stuck in endless link loops, and limit crawlers to acquire too much irrelevant content. Scrapy provides configuration options such as DEPTH_LIMIT, DEPTH_STATS, and DEPTH_PRIORITY, allowing you to flexibly control the depth of the crawler, monitor the crawling process, and set appropriate acquisition strategies.

This is the end of this article about setting the acquisition depth in Python in Scrapy. For more related content on setting the acquisition depth in Scrapy, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!