SoFunction
Updated on 2025-03-05

A brief analysis of the use of colly of Golang crawler framework

Golang is a language that is very suitable for writing web crawlers. It has efficient concurrency processing capabilities and rich network programming libraries. Here is a simple example of a Golang web crawler:

package main
import (
    "fmt"
    "net/http"
    "io/ioutil"
    "regexp"
)
func main() {
    resp, err := ("")
    if err != nil {
        ("Error:", err)
        return
    }
    defer ()
    body, err := ()
    if err != nil {
        ("Error:", err)
        return
    }
    re := ("<title>(.*)</title>")
    title := (string(body))[1]
    ("Title:", title)
}

The function of this crawler is to get the title of the specified website. The code uses Go's standard libraries net/http and regexp to match network requests and regular expressions. Of course, this is just a simple example. In fact, crawlers need to consider more issues, such as anti-crawlers, data storage, concurrency control, etc.

gocolly is a web crawler framework implemented using go. The version I used to test here is: colly "/gocolly/colly/v2"

Gocolly's web crawler is still very powerful. Let's take a look at the use of this function through the code.

package main
import (
  "fmt"
  colly "/gocolly/colly/v2"
  "/gocolly/colly/v2/debug"
)
func main() {
  mUrl := "/"
  //The main body of colly is the Collector object, which manages network communication and is responsible for executing additional re-drop functions when the job is running  c := (
    // Turn on the debugging of the machine    (&amp;{}),
  )
  //Execution function before sending the request  (func(r *) {
    ("Here is the function executed before sending")
  })
  //The send request error was called back  (func(_ *, err error) {
    (err)
  })
  //Called back after responding to the request  (func(r *) {
    ("Response body length:", len())
  })
  //The function will be called after response to analyze the page data  ("div#newsList h1 a", func(e *) {
    ()
  })
  //Called after OnHTML  (func(r *) {
    ("Finished", )
  })
  //Here is the execution access url  (mUrl)
}

The operation results are as follows:

Here is the function executed before sending

[000001] 1 [     1 - request] map["url":"/"] (0s)
[000002] 1 [     1 - responseHeaders] map["status":"OK" "url":"/"] (64.9485ms)
Response body length:250326
Finished /
[000003] 1 [     1 - response] map["status":"OK" "url":"/"] (114.9949ms)
[000004] 1 [     1 - html] map["selector":"div#newsList h1 a" "url":"/"] (118.9926ms)
[000005] 1 [     1 - html] map["selector":"div#newsList h1 a" "url":"/"] (118.9926ms)
[000006] 1 [     1 - scraped] map["url":"/"] (118.9926ms)

To summarize:

The callback function is called in the following order:

OnRequest is called before the request is initiated

OnError is called if an error occurs during the request process

OnResponse is called after receiving the reply

OnHTML is called after OnResponse, if the received content is HTML

OnScraped is called after OnHTML

This is the article about the use of colly on the Golang crawler framework. For more relevant Go colly framework content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!