Preface
This article mainly introduces the relevant content about the use of golang analysis web tool goquery, and shares it for your reference and learning. I won’t say much below, let’s take a look at the detailed introduction together.
Using Jsoup in Java and Cheerio in nodejs can all be used to parse web pages very conveniently. I also found a powerful tool for web page parsing in golang language, which is quite useful. The selector is the same as jQuery.
Install
go get /PuerkitoBio/goquery
use
It's actually the demo in the project
package main import ( "fmt" "log" "/PuerkitoBio/goquery" ) func ExampleScrape() { doc, err := ("") if err != nil { (err) } // Find the review items (".sidebar-reviews article .content-block").Each(func(i int, s *) { // For each item found, get the band and title band := ("a").Text() title := ("i").Text() ("Review %d: %s - %s\n", i, band, title) }) } func main() { ExampleScrape() }
Garbage code problem
There will be garbled problems with Chinese web pages because it is utf8 encoding by default, and then you will need to use the transcoder.
Install iconv-go
go get /djimenez/iconv-go
How to use
func ExampleScrape() { res, err := (baseUrl) if err != nil { (()) } else { defer () utfBody, err := (, "gb2312", "utf-8") if err != nil { (()) } else { doc, err := (utfBody) // You can use doc to obtain the structure data in the web page // for example ("li").Each(func(i int, s *) { (i, ()) }) } } }
Advanced
Some websites will set cookies, Referer and other verifications, and can set the requested header information before sending http request.
This is not something in goquery. If you want to know more, you can check the methods under the net/http package in golang and other information.
baseUrl:="" client:=&{} req, err := ("GET", baseUrl, nil) ("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36") ("Referer", baseUrl) ("Cookie", "your cookie") // You can also set cookies through ()res, err := (req) defer () //In the end, you can directly pass res to goquery to parse the web pagedoc, err := (res)
Summarize
The above is the entire content of this article. I hope that the content of this article has certain reference value for everyone's study or work. If you have any questions, you can leave a message to communicate. Thank you for your support.
refer to
- /PuerkitoBio/goquery
- /PuerkitoBio/goquery/issues/185
- /PuerkitoBio/goquery/wiki/Tips-and-tricks#handle-non-utf8-html-pages