SoFunction
Updated on 2025-03-08

ltem and detailed explanation of scrapy framework in Python

Item

Item is a container that stores crawled data, and its usage method is similar to that of a dictionary. However, compared to dictionaries, Item provides an additional protection mechanism to avoid spelling errors or field errors.

To create an Item, you need to inherit the class and define the field of type. This is how the Item file is when creating the project.

import scrapy
class Tutorial1tem():
    #define the fields for your item here 7ike:#Define your field with reference to the following parameters    #name = scrapy.Fie1d()
    pass

When saving data, you can initialize a dictionary and other format at a time, but the most convenient and best way to save it is to use the ltem data structure that comes with Scrapy.

We learned how to extract data from pages, and then we learned how to encapsulate crawled data. What kind of data structure should be used to maintain these scattered information fields? The easiest thing to think of is to use the Python dictionary (dict).

Review the previous code

class Quotesspider():
    name = 'quotes'
    a1lowed_domains = ['']
    start_ur1s = ['http: ///']
def parse(self,response):
    quotes = ( '.quote ' )
    for quote in quotes :
        text = ('.text: :text ').get()
        author = ( '.author : :text ').get()
        tags = ( '.tag : :text ' ).getall()
    yield {
    'text':text,
    'author':author,
    'tags ' : tags,
    }

In this case, we used a Python dictionary to store information about a book, but the dictionary may have the following disadvantages:

(1) It is impossible to understand at a glance which fields are contained in the data, which affects the readability of the code.

(2 Lack of detection of field names, and it is easy to make mistakes due to programmers' typos.

(3) It is not convenient to carry metadata (information passed to other components).

ltem and Field

Scrapy provides the following two classes, which users can use to customize data classes (such as book information) and encapsulate crawled data:

1. ltem base class

The base class of a data structure, when defining a data structure, needs to be inherited from the base class.

2. Field class

Used to describe which fields (such as name, price, etc.) are included in a custom data class.

To customize a data class, just inherit ltem and create a series of class attributes of Field objects.

Taking the book information quote as an example, it contains fields, namely the book's name text, author and tags, and the code is as follows:

#Special dictionary structure can pass data in scrapyclass TutorialItem():
    #Field Field    # is similar to producing data similar to dictionary format with some properties of the dictionary    # field is empty by default    #We can assign values ​​by instantiating like keys, but if this key is not written, we cannot assign values, but dictionary can    text = scrapy. Field()
    author = scrapy.Fie1d()
    tags = scrapy . Fie1d()

ltem supports dictionary interface, so Tutorialtem is similar to Python dictionary in use.

When assigning a field, Tutorialtem will detect the field name internally. If an undefined field is assigned, an exception will be thrown (prevents errors caused by user carelessness)

Request and Response objects for crawling websites. The Request object is used to describe an HTTP request. The following is a parameter list of its constructor method:

Request(url,ca11back=None,method='GET', headers=None,body=None,
        cookies=None,meta=None,encoding='utf-8 ',priority=O,
        dont_filter=False,errback=None,flags=None,cb_kwargs=None)
  • url (string) - the URL for this request
  • callback (callable) - The function that will be called with the request's response (once downloaded) as the first parameter. For more information, see Passing other data to callback functions below. If the "request" does not specify a callback, parse() will use the "Spider" method. Note that if an exception is raised during processing, errback will be called.
  • method (string) - HTTP method for this request. The default is ‘GET’.
  • meta (dict) - The initial value of the property. If given, the dictionary passed in this parameter will be copied shallowly.
  • headers (dict)-Request header. The dict value can be a string (for single-value header) or a list (for multi-value header). If None is passed as a value, the HTTP header will not be sent at all.
c1ass Quotesspider():
    name = 'quotes_3'
    allowed_domains = ['']
    start__ur1s = ['/']
def parse(self,response):
    quotes = ( '.quote ' )
    for quote in quotes:
        text = ( '.text: :text ' ).get()
        author = ( '.author : :text ' ).get()
        tags = ( '.tag : :text ' ). geta11()
        yield Qd01QuotesItem(text=text,author=author,tags=tags)
    next_page = ( '.next a: :attr(href) ' ).get()
    if next_page:
        next__ur1 = 'http: //' + next_page
        yield scrapy. Request(next_url, cal7back=)

This is the end of this article about the ltem and detailed explanation of the scrapy framework in Python. For more relevant ltems and contents of the scrapy framework, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!