SoFunction
Updated on 2025-03-10

How to add promise support for node crawler

background

Recently, I did some crawler-related work in the group. I originally wanted to make a simple wheel myself, but after being recommended by netizens, I used node-crawler. After a period of time, it did meet most of my needs. However, its API does not support promises, and I also need some synchronous crawling and synchronous processing capabilities. If you don't use promises, the writing method is very inelegant, so I simply encapsulated a layer of promise api for it.

status quo

Currently, the use of node-crawler does not support promise. Here is a use example on npm

const Crawler = require("crawler")

// Instantiationconst c = new Crawler({
  // ... Some configurations can be passed in  callback : function (error, res, done) {
    // Request a callback. The callback passed in during instantiation is used as the default callback. If no callback is passed in each subsequent crawl, the default callback will be called.    done();
  }
})

// Crawl([{
  uri: '/',
  jQuery: false,
 
  // The global callback won't be called
  callback: function (error, res, done) {
    if(error){
      (error);
    }else{
      ('Grabbed', , 'bytes');
    }
    done();
  }
}])

This callback method is not friendly for multi-crawler synchronous crawling

Renovation

Ideal for use:

const Crawler = require('crawler')

const c = new Crawler({
  // Some default configurations})

c
.queue({
  uri: 'xxx'
})
.then(res => {
  // Crawling successfully})
.catch(err => {
  // Crawling failed})

Renovation plan:

// utils/
const Crawler = require('crawler')
const defaultOptions = {
 jQuery: false,
 rateLimit: fetchRateLimit,
 retries: 0,
 timeout: fetchTimeout,
}

 = class PromiseifyCrawler extends Crawler {
  // namespace is for the purpose of distinguishing the subsequent crawling results when reporting them uniformly.  constructor(namespace = 'unknow', options = {}) {
   if (typeof namespace === 'object') {
    options = namespace
    namespace = 'unknow'
   }
   
   options = merge({}, defaultOptions, options)

   const cb = 
    = (err, res, done) => {
    typeof cb === 'function' && cb(err, res, noop)
    (done)
    // Here you can customize whether the crawl is successful or failed    // What I set here is that if http code is not 200, it will be considered an error    // And here you can also do some statistics on successful and failed crawling    if (err ||  !== 200) {
     if (!err) err = new Error(`${}-${}`)
      = 
     (err)
    } else {
     (res)
    }
   }
    = ({}, , {
    'X-Requested-With': 'XMLHttpRequest',
   })
   super(options)
  }
 
  queue(options = {}) {
   // Every crawl is a new promise   return new Promise((resolve, reject) => {
    // Then mount resolve and reject in options    // This can be used on global callback     = resolve
     = reject

    const pr = 
     = (options, done) => {
     typeof pr === 'function' && pr(options, noop)
     // Here you can also do some general pre-crawling processing     
     done()
    }

    (options)
   })
  }
  
  // Direct API }
// useconst Crawler = require('./utils/crawler')

const crawler = new Crawler('Sample crawler namespace')

crawler
.queue({
 uri: 'xxx',
 preRequest: options => log('Start crawling'),
})
.then(res => {
 log('Crawling successfully')
 return res
})
.catch(err => {
 log('Crawling failed')
 throw err
})
promise After transformation,The way to write multiple crawl tasks in synchronously is much more friendly:

// Crawl task 1const fetchTask1 = () => ({/* Configuration */}).then(res => handle(res))
// Crawl Task 2const fetchTask2 = () => ({/* Configuration */}).then(res => handle(res))

const fetch = () => {
  return ([
    fetchTask1(),
    fetchTask2(),
  ])
}

fetch()

This completes the promise transformation of node-crawler

The above is all the content of this article. I hope it will be helpful to everyone's study and I hope everyone will support me more.