This article mainly introduces Python scrapy incremental crawling examples and implementation process analysis. The article introduces the example code in detail, which has certain reference learning value for everyone's learning or work. Friends who need it can refer to it
When I started to get involved in crawlers, I was still a beginner in Python. I used request, bs4, and pandas. Then I came into contact with scrapy to make one or two crawlers. I thought the framework was better, but unfortunately I forgot about it without records. Now I need to crawl a certain article to do a recommendation system, so I picked up scrapy again. Take this opportunity to make a record.
The directory is as follows:
- environment
- Local window debugging command
- Project Directory
- xpath selector
- A simple incremental crawler example
- Configuration Introduction
environment
In your own environment, you must use anaconda (against the superiority of anaconda again
Local window debugging and running
During development, you can use the debugging function provided by scrapy to simulate requests, so that request and response will remain the same as the subsequent code.
# Test request for a websitescrapy shell URL # Set request headerscrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0" URL # Specify the crawler content output file format (json, csv, etc.scrapy crawl SPIDER_NAME -o FILE_NAME.csv # Create a crawler projectscrapy startproject articles # Create ascrapyproject
Introduction to the new engineering structure
#Storage all crawlers under spiders file and format data output# Set request details (request header, etc.), as a pipeline for data output, each encapsulated item will pass here# Make global settings for the project (store configuration├── articles │ ├── articles │ │ ├── __init__.py │ │ ├── │ │ ├── │ │ ├── │ │ ├── │ │ └── spiders │ │ ├── healthy_living.py │ │ ├── __init__.py │ │ └── people_health.py │ └── ├── └──
Page parsing artifact—Xpath selector
scrapy comes with its own xpath selector, which is very convenient, briefly introduces some commonly used ones
# The entire site crawling artifact - LinkExtractor, which can automatically obtain all urls and text under this tag (because the website structure is mostly the same routinefrom import LinkExtractor le = LinkExtractor(restrict_xpaths="//ul[@class='nav2_UL_1 clearFix']")# Return an iterator, and by looping (for i in le), you can get url() () # Get the attribute class as the content in the div tag content of all aa("//div[@class='aa']/text()").extract() # '//' means to get all, '/' means to get the first one, similar to other tags with attribute ul # Get the links contained in all the a tags with "Next Page" (extract the next page link tool("//a[contains(text(),'Next Page')]/@href").extract()
A simple incremental crawl example
The idea of incremental crawling here is very simple: the data of the target website are arranged according to time, so before requesting a certain connection, first check whether there is this data in the database. If there is, stop the crawler and if no request is initiated.
class HealthyLiving(): # You must have a globally unique crawler name. You need to specify this name when starting the command line name = "healthy_living" # Specify the crawler entrance, scrapy supports multiple entrances, so it must be in the form of Lis start_urls = ['/healthyLiving/'] ''' Crawl the entrance to the major category tags ''' def parse(self, response): le = LinkExtractor(restrict_xpaths="//ul[@class='nav2_UL_1 clearFix']") for link in le.extract_links(response)[1:-1]: tag = # Pass the information extracted from this level to the next level through the request header (here is to tag the data meta = {"tag": tag} # parse each link in turn and pass it to the next level for continued crawling yield (, callback=self.parse_articles, meta=meta) ''' Crawl the article links and the next page links in the page ''' def parse_articles(self, response): # Receive information delivered at the previous level meta = article_links = ("//div[@class='txt']/h4/a/@href").extract() for link in article_links: res = .find_one({"article_url": link}, {"article_url": 1}) full_meta = dict(meta) # Pass the article link to the next level full_meta.update({"article_url": link}) if res is None: yield (link, callback=self.parse_article, meta=full_meta) else: return next_page = ("//div[@class='page']//a[contains(text(),'»')]/@href").extract()[0] if next_page: yield (next_page, callback=self.parse_articles, meta=meta) # Finally parse the page and output it def parse_article(self, response): # Import data encapsulation format from article_item = ArticlesItem() meta = # Use xpath to extract page information and encapsulate it into item try: article_item["tag"] = "" # ... Omitted finally: yield article_item
Project configuration introduction
Set request header and configure database
# Set the request header, set it in, enable it inclass RandomUA(object): user_agents = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit" "/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit" "/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16" ] def process_request(self, request, spider): ["User-Agent"] = (self.user_agents) # Set up data entry processing, configure and enableclass MongoPipeline(object): def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=('MONGO_URI'), mongo_db=('MONGO_DB') ) def open_spider(self, spider): print("Start crawling", ().strftime('%Y-%m-%d %H:%M:%S')) = (self.mongo_uri) = [self.mongo_db] def process_item(self, item, spider): data = [].find_one({"title": item["title"], "date": item["date"]}) if data is None: [].insert(dict(item)) # else: # self.close_spider(self, spider) return item def close_spider(self, spider): print("Crawl ends", ().strftime('%Y-%m-%d %H:%M:%S')) () # Initiate: modify the request header, database configurationDOWNLOADER_MIDDLEWARES = { # '': 543, '': 543,#543 represents priority, the lower the number, the higher the priority} ITEM_PIPELINES = { '': 300, } # Some other configurationsROBOTSTXT_OBEY = True # Whether to comply with the website's robot agreementFEED_EXPORT_ENCODING = 'utf-8' # Specify the encoding format of the data output## Database ConfigurationMONGO_URI = '' MONGO_DB = '' MONGO_PORT = 27017 MONGO_COLLECTION = ''
The above is all the content of this article. I hope it will be helpful to everyone's study and I hope everyone will support me more.