A complete example of Nodejs implementing a timed crawler

Cause of the incident

Two days ago, I had to help my friend Bilibili captain group review and search for the captain list one by one. Of course, it is not the first choice for a programmer. Leave the task to the computer and let him do it himself. Fishing is the right way. Theory is established and started Coding.

Because the known captain list API crawler uses Axios to access the interface directly

So I spent 100 million minutes to finish writing this crawler, I call it bilibili-live-captain-tools 1.0

const axios = require('axios')
const roomid = "146088"
const ruid = "642922"
const url = `/xlive/app-room/v2/guardTab/topList?roomid=${roomid}&amp;ruid=${ruid}&amp;page_size=30`

const Captin = {
 1: 'Governor',
 2: 'Admiral',
 3: 'captain'
}

const reqPromise = url =&gt; (url);

let CaptinList = []
let UserList = []

async function crawler(URL, pageNow) {
 const res = await reqPromise(URL);
 if (pageNow == 1) {
 CaptinList = (.top3);
 }
 CaptinList = ();
}


function getMaxPage(res) {

 const Info = 
 const { page: maxPage } = Info
 return maxPage
}


function getUserList(res) {

 for (let item of res) {
 const userInfo = item
 const { uid, username, guard_level } = userInfo
 ({ uid, username, Captin: Captin[guard_level] })
 }
}

async function main(UID) {
 const maxPage = await reqPromise(`${url}&amp;page=1`).then(getMaxPage)
 for (let pageNow = 1; pageNow &lt; maxPage + 1; pageNow++) {
 const URL = `${url}&amp;page=${pageNow}`;
 await crawler(URL, pageNow);
 }
 getUserList(CaptinList)
 (search(UID, UserList))
 return search(UID, UserList)
}

function search(uid, UserList) {
 for (let i = 0; i &lt; ; i++) {
 if (UserList[i].uid === uid) {
 return UserList[i];
 }
 }
 return 0
}

 = {
 main
}

It is obvious that this crawler can only trigger manually, and it also requires a command line and node environment to run directly, so he opened a page service for him using Koa2 and wrote an extremely simple page.

const Koa = require('koa');
const app = new Koa();
const path = require('path')
const fs = require('fs');
const router = require('koa-router')();
const index = require('./index')
const views = require('koa-views')



(views((__dirname, './'), {
 extension: 'ejs'
}))
(());

('/', async ctx => {
  = 'html';
  = ('./');
})

('/api/captin', async (ctx) => {
 const UID = 
 (UID)
 const Info = await (parseInt(UID))
 await ('index', {
 Info,
 })
});

(3000);

Since the page does not have anti-shake throttling, the current version can only crawl in real time, and the waiting time is long. Frequent refreshes will naturally trigger the anti-crawling mechanism of B station, so the current server IP is risk-controlled.

So bilibili-live-captain-tools 2.0 was released

function throttle(fn, delay) {
 var timer;
 return function () {
 var _this = this;
 var args = arguments;
 if (timer) {
  return;
 }
 timer = setTimeout(function () {
  (_this, args);
  timer = null; // Clear the timer after execution of fn after delay. At this time, the timer is false. The throttle trigger can enter the timer }, delay)
 }
}

While adding throttling and anti-shake, use pseudo-real-time crawlers (crawl once in a minute through timing tasks)

In this case, we need to execute crawler scripts regularly. At this time, I thought about using egg's schedule function, but I don't want a crawler program to be so "short-handed" and if you don't decide, I'll go to Baidu. So the following plan was found

Implement timing tasks with Node Schedule

Node Schedule is a flexible cron class and non-cron class job scheduler for use. It allows you to schedule jobs (arbitrary functions) using optional duplication rules to execute on a specific date. It only uses one timer at any given time (rather than reevaluating upcoming jobs per second/minute).

1. Install node-schedule

npm install node-schedule
# oryarn add node-schedule

2. Basic usage

Let's take a look at the official examples

const schedule = require('node-schedule');

const job = ('42 * * * *', function(){
 ('The answer to life, the universe, and everything!');
});

The first parameter needs to be entered as follows

Node Schedule rules are represented by the following table

* * * * * *
┬ ┬ ┬ ┬ ┬ ┬
│ │ │ │ │ |
│ │ │ │ │ │ └ Day of the week, value: 0 - 7, where 0 and 7 both represent Sunday
│ │ │ └─── Month, value: 1 - 12
│ │ └─────────Date, value: 1 - 31
│ │ └───────────────────────────────────────────────────────────────────────────────�
│ └───────────────────────────────────────
└───────────────────────────────────────────────────────────────────────────────────�
You can also specify a specific time, such as: const date = new Date()

Understand the rules and implement one by one

const schedule = require('node-schedule');

// Define a timelet date = new Date(2021, 3, 10, 12, 00, 0);

// Define a tasklet job = (date, () =&gt; {
 ("Current time:",new Date());
});

The above example represents the execution of time report at 12:00 on March 10, 2021.

3. Advanced usage

In addition to basic usage, we can also use some more flexible methods to implement timing tasks.

3.1. Execute once every minute

const schedule = require('node-schedule');

// Define ruleslet rule = new ();
 = 0
//Execute once every 0 seconds per minute
// Start the tasklet job = (rule, () =&gt; {
 (new Date());
});

The values supported by rule are second, minute, hour, date, dayOfWeek, month, year, etc.

Some common rules are as follows

Execute every second
= [0,1,2,3......59];
Execute in 0 seconds per minute
= 0;
Execute at 30 minutes per hour
= 30;
= 0;
Perform at 0 o'clock every day
=0;
=0;
=0;
Execute at 10 o'clock on the 1st of each month
= 1;
= 10;
= 0;
= 0;
Perform every Monday, Wednesday and Friday at 0 and 12 o'clock
= [1,3,5];
= [0,12];
= 0;
= 0;

4. Terminate the task

You can use cancel() to terminate a running task. Cancel and terminate the task in time when an exception occurs.

();

Summarize

node-schedule is a timed task (crontab) module of . We can use timed tasks to maintain the server system, allowing it to perform certain necessary operations in a fixed time period, and can also use timed tasks to send emails, crawl data, etc.;

This is the end of this article about Nodejs implementing timed crawlers. For more related Nodejs timed crawler content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!