background
Recently, I did some crawler-related work in the group. I originally wanted to make a simple wheel myself, but after being recommended by netizens, I used node-crawler. After a period of time, it did meet most of my needs. However, its API does not support promises, and I also need some synchronous crawling and synchronous processing capabilities. If you don't use promises, the writing method is very inelegant, so I simply encapsulated a layer of promise api for it.
status quo
Currently, the use of node-crawler does not support promise. Here is a use example on npm
const Crawler = require("crawler") // Instantiationconst c = new Crawler({ // ... Some configurations can be passed in callback : function (error, res, done) { // Request a callback. The callback passed in during instantiation is used as the default callback. If no callback is passed in each subsequent crawl, the default callback will be called. done(); } }) // Crawl([{ uri: '/', jQuery: false, // The global callback won't be called callback: function (error, res, done) { if(error){ (error); }else{ ('Grabbed', , 'bytes'); } done(); } }])
This callback method is not friendly for multi-crawler synchronous crawling
Renovation
Ideal for use:
const Crawler = require('crawler') const c = new Crawler({ // Some default configurations}) c .queue({ uri: 'xxx' }) .then(res => { // Crawling successfully}) .catch(err => { // Crawling failed})
Renovation plan:
// utils/ const Crawler = require('crawler') const defaultOptions = { jQuery: false, rateLimit: fetchRateLimit, retries: 0, timeout: fetchTimeout, } = class PromiseifyCrawler extends Crawler { // namespace is for the purpose of distinguishing the subsequent crawling results when reporting them uniformly. constructor(namespace = 'unknow', options = {}) { if (typeof namespace === 'object') { options = namespace namespace = 'unknow' } options = merge({}, defaultOptions, options) const cb = = (err, res, done) => { typeof cb === 'function' && cb(err, res, noop) (done) // Here you can customize whether the crawl is successful or failed // What I set here is that if http code is not 200, it will be considered an error // And here you can also do some statistics on successful and failed crawling if (err || !== 200) { if (!err) err = new Error(`${}-${}`) = (err) } else { (res) } } = ({}, , { 'X-Requested-With': 'XMLHttpRequest', }) super(options) } queue(options = {}) { // Every crawl is a new promise return new Promise((resolve, reject) => { // Then mount resolve and reject in options // This can be used on global callback = resolve = reject const pr = = (options, done) => { typeof pr === 'function' && pr(options, noop) // Here you can also do some general pre-crawling processing done() } (options) }) } // Direct API } // useconst Crawler = require('./utils/crawler') const crawler = new Crawler('Sample crawler namespace') crawler .queue({ uri: 'xxx', preRequest: options => log('Start crawling'), }) .then(res => { log('Crawling successfully') return res }) .catch(err => { log('Crawling failed') throw err }) promise After transformation,The way to write multiple crawl tasks in synchronously is much more friendly: // Crawl task 1const fetchTask1 = () => ({/* Configuration */}).then(res => handle(res)) // Crawl Task 2const fetchTask2 = () => ({/* Configuration */}).then(res => handle(res)) const fetch = () => { return ([ fetchTask1(), fetchTask2(), ]) } fetch()
This completes the promise transformation of node-crawler
The above is all the content of this article. I hope it will be helpful to everyone's study and I hope everyone will support me more.