SoFunction
Updated on 2025-03-02

Tutorial for crawling web pages with PhantomJs

Preface

When I want to use nodejs to crawl some web pages, my first reaction is to use the http module, such as crawling Baidu homepage:

var http = require('http');
var req = ('/', function (res) {
 ('utf8');
 ('data', function (chunk) {
  //Response content  (chunk)
 });
});
(function () {
 // ('Connection Close');});

However, this is limited to simply crawling html, which has great limitations.

If the content you want is not in html but is dynamically generated by js, then the http module cannot meet your needs;

If the web page is encoded by gbk, the above method is not very useful.

If it is https, the above method needs to be changed.

I long for a more powerful tool, but not troublesome to use.

PhantomJs

PhantomJs can solve the above problems.

PhantomJs is a browser without an interface.

Install

Just install PhantomJS using cnpm:

cnpm install phantomjs --save-dev

I did not choose global installation here, because if it is global installation, when others use my source code, I don’t know that there is still such a dependency, so the project will not be able to run.

If you also choose a local installation, then you need to add a paragraph to the scripts in it:

"phantomjs":"node_modules/.bin/phantomjs"

I will use this later. At this point, the installation is completed.

Write code

Let's create a new file with the name of it, here I create a new one:

var webpage = require('webpage');
var page = ();
('/', function (status) {
 var data;
 if (status === 'fail') {
  ('open page fail!');
 } else {
  ();//Print out HTML content }
 ();//Close the web page ();//Exit the phantomjs command line});

There is a webpage module here. We didn't have this module just now, so why can we reference this module???

Of course, it cannot be quoted, if we usenode If you run this code, you can't run it. You should run this code like this:

npm run phantomjs 

Herenpm run phantomjs The corresponding command is the previous section we added. It is very convenient, almost as convenient as the http module.

It's the html code. This page object has many properties and is more powerful.

At this point, you are already in the beginning. If you want to know more, you can go to the phantomjs official website to check the documentation.

Summarize

The above is the entire content of this article. I hope the content of this article will be of some help to your study or work. If you have any questions, you can leave a message to communicate. Thank you for your support.