introduction
Recently, I have studied Go crawler-related knowledge and used the goquery library, especially when selecting and finding matching content for the crawled HTML. Goquery selectors are used in a lot, and there are many infrequently used but very useful selectors. Here is a summary for reference.
If you have done front-end development before, you will be familiar with jquery. Goquery is similar to jquery, which is the go version implementation of jquery. Using it, it is very convenient to process HTML.
Selector based on HTML Element elements
This is simpler, based ona
,p
If you want to select these basic HTML elements, you can use the Element name as a selector directly.
for example("div")
。
func main() { html := `<body> <div>DIV1</div> <div>DIV2</div> <span>SPAN</span> </body> ` dom,err:=((html)) if err!=nil{ (err) } ("div").Each(func(i int, selection *) { (()) }) }
The above example can be used todiv
The elements are filtered out, andbody
,span
Not filtered.
ID selector
This is the most frequently used, similar to the above example, there are twodiv
In fact, we only need one of them, so we only need to give this mark a unique oneid
That's it, so we can use itid
Selector, precisely positioned.
func main() { html := `<body> <div >DIV1</div> <div>DIV2</div> <span>SPAN</span> </body> ` dom,err:=((html)) if err!=nil{ (err) } ("#div1").Each(func(i int, selection *) { (()) }) }
Element ID Selector
id
Selector to#
Start with the elementsid
The value of(#id)
, I will abbreviate the following examples asFind(#id)
, As you all know, this is just for the goquery selector.
What if there are the same IDs, but they belong to different HTML elements respectively? There is a good way to combine it with Element. For example, we filter elements asdiv
,andid
yesdiv1
elements can be usedFind(div#div1)
Such filters perform filtering.
Therefore, the syntax of this type of filter isFind(element#id)
, this is a commonly used combination method. For example, the filters mentioned later can also be combined in this way.
Class selector
class
It is also a commonly used property in HTML, we can use itclass
Selector to quickly filter the required HTML elements, its usage andID
The selector is similar toFind(".class")
。
func main() { html := `<body> <div >DIV1</div> <div class="name">DIV2</div> <span>SPAN</span> </body> ` dom,err:=((html)) if err!=nil{ (err) } (".name").Each(func(i int, selection *) { (()) }) }
In the above example, filter it outclass
forname
This onediv
element.
Element Class Selector
class
Selector andid
Like selectors, they can also be used in combination with HTML elements, and their syntax is similar.Find()
, so that the elements of a specific element can be filtered and the class can be specified.
Property Selector
An HTML element has its own attributes and attribute values, so we can also filter elements by attributes and values.
func main() { html := `<body> <div>DIV1</div> <div class="name">DIV2</div> <span>SPAN</span> </body> ` dom,err:=((html)) if err!=nil{ (err) } ("div[class]").Each(func(i int, selection *) { (()) }) }
In the example we passdiv[class]
This selector filters out the Element asdiv
And there isclass
This attribute, so the first onediv
Not filtered.
The example above just now uses whether there is a certain attribute as a filter. Similarly, we can filter out elements whose attribute is a certain value.
("div[class=name]").Each(func(i int, selection *) { (()) })
So we can filter outclass
This property value isname
ofdiv
element.
Of course, here weclass
As an example, you can also use other attributes, such ashref
There are many things to do, custom attributes are also OK.
In addition to being completely equal, there are other matching methods, which are similar to the usage methods. Here we will list them together and will not give any examples.
Selector | illustrate |
---|---|
Find(“div[lang]”) | Filter div elements with lang attribute |
Find(“div[lang=zh]”) | Filter div elements with lang attribute zh |
Find(“div[lang!=zh]”) | Filter div elements whose lang attribute is not equal to zh |
Find(“div[lang¦=zh]”) | Filter div elements with lang attributes starting with zh or zh- |
Find(“div[lang*=zh]”) | Filter the lang attribute containing the div element of the string zh |
Find(“div[lang~=zh]”) | Filter the lang attribute contains the div element of the word zh, and the words are separated by spaces. |
Find(“div[lang$=zh]”) | Filter the div elements with lang attribute ending in zh, case sensitive |
Find(“div[lang^=zh]”) | Filter div elements with lang attribute starting with zh, case sensitive |
The above is the usage of attribute filters. They all take one attribute filter as an example. Of course, you can also use multiple attribute filters in combination, such as:Find("div[id][lang=zh]")
, just connect it with multiple brackets. When there are multiple attribute filters, elements that meet these filters must be filtered out.
parent>child selector
If we want to filter out the child elements that meet the criteria under an element, we can use the child element filter, and its syntax isFind("parent>child")
, indicates that the most direct (first-level) child element that meets the child condition under the parent element.
func main() { html := `<body> <div lang="ZH">DIV1</div> <div lang="zh-cn">DIV2</div> <div lang="en">DIV3</div> <span> <div>DIV4</div> </span> </body> ` dom,err:=((html)) if err!=nil{ (err) } ("body>div").Each(func(i int, selection *) { (()) }) }
The above examples are filtered outbody
Under this parent element, the most direct child element that meets the criteriadiv
,turn outDIV1、DIV2、DIV3
,AlthoughDIV4
Toobody
child elements of , but not first level, so they will not be filtered.
So the question is, I just want toDIV4
What should I do if I filter it out? I just need to filter it outbody
All of the followingdiv
Elements, whether they are level one, level two or level N. There is a way, goquery takes into account, just need to put the greater than (>
) Just change it to space. For example, in the above example, just change it to the following selector.
("body div").Each(func(i int, selection *) { (()) })
prev+next adjacent selector
Assuming that the element we want to filter is not regular, but the previous element of the element is regular, we can use this next adjacent selector to make the selection.
func main() { html := `<body> <div lang="zh">DIV1</div> <p>P1</p> <div lang="zh-cn">DIV2</div> <div lang="en">DIV3</div> <span> <div>DIV4</div> </span> <p>P2</p> </body> ` dom,err:=((html)) if err!=nil{ (err) } ("div[lang=zh]+p").Each(func(i int, selection *) { (()) }) }
This example demonstrates this usage, we want to choose<p>P1</p>
This element, but there is no rule, we found that<div lang="zh">DIV1</div>
Very regular and can be chosen, so we can adoptFind("div[lang=zh]+p")
Reach the choiceP
The purpose of the element.
The syntax of this selector is("prev+next")
, there is a plus sign (+) in the middle, and the + sign is also a selector.
prev~next selector
There are neighbors, and brother selectors do not necessarily require neighbors, as long as they share a parent element.
("div[lang=zh]~p").Each(func(i int, selection *) { (()) })
Just the example, just need to+
Change number to~
You can get the numberP2
Also filtered out, becauseP2
、P1
andDIV1
They are all brothers.
The syntax of the brother selector is("prev~next")
, that is, the adjacent selector+
Change it to~
。
Content Filter
Sometimes we use selectors to select them, and hope to filter them again. At this time, we use filters. There are many filters. Let’s talk about content filters first.
("div:contains(DIV2)").Each(func(i int, selection *) { (()) })
Find(":contains(text)")
Indicates that the filtered element must contain the specified text. In our example, we require the selected one.div
Elements to includeDIV2
Text, then there is only oneDIV2
Elements meet the requirements.
In addition,Find(":empty")
It means that none of the filtered elements can have child elements (including text elements), and only those elements that do not contain any child elements.
Find(":has(selector)")
andcontains
It's almost the same, but this contains element nodes.
("span:has(div)").Each(func(i int, selection *) { (()) })
The above example shows that filtering out the inclusiondiv
Elementalspan
node.
:first-child filter
:first-child
Filter, syntax isFind(":first-child")
, means that if the filtered element is the first child of their parent element, if not, it will not be filtered out.
func main() { html := `<body> <div lang="zh">DIV1</div> <p>P1</p> <div lang="zh-cn">DIV2</div> <div lang="en">DIV3</div> <span> <div style="display:none;">DIV4</div> <div>DIV5</div> </span> <p>P2</p> <div></div> </body> ` dom,err:=((html)) if err!=nil{ (err) } ("div:first-child").Each(func(i int, selection *) { (()) }) }
In the above example, we useFind("div")
Will filter out alldiv
Elements, but we added:first-child
After that, there is onlyDIV1
andDIV4
because only these two are the first child elements of their parent element, and the othersDIV
Not satisfied.
:first-of-type filter
:first-child
The selector limit is relatively dead, and it must be the first child element. If there are other elements in front of the element, it cannot be used.:first-child
Now, at this time:first-of-type
It comes in handy. It requires that as long as it is the first one of this type, we will fine-tune the above example.
func main() { html := `<body> <div lang="zh">DIV1</div> <p>P1</p> <div lang="zh-cn">DIV2</div> <div lang="en">DIV3</div> <span> <p>P2</p> <div>DIV5</div> </span> <div></div> </body> ` dom,err:=((html)) if err!=nil{ (err) } ("div:first-of-type").Each(func(i int, selection *) { (()) }) }
The change is very simple, put the original oneDIV4
Change it toP2
, if we still use:first-child
,DIV5
It cannot be filtered out because it is not the first child element, and there is another one in front of it.P2
. We use:first-of-type
You can achieve your goal because it requires that you are the first of the same type.DIV5
That's itdiv
The first element of the type,P2
nodiv
Type, ignored.
:last-child and :last-of-type filters
These two are exactly the same as the above:first-child
、:first-of-type
On the contrary, it means the last one. I won’t give any more examples here, you can try it yourself.
:nth-child(n) filter
This indicates that the filtered element is the nth element of its parent element, n starts with 1. So we can know:first-child
and:nth-child(1)
It is equal. By specifyingn
, we are very flexible in selecting the elements we need.
func main() { html := `<body> <div lang="zh">DIV1</div> <p>P1</p> <div lang="zh-cn">DIV2</div> <div lang="en">DIV3</div> <span> <p>P2</p> <div>DIV5</div> </span> <div></div> </body> ` dom,err:=((html)) if err!=nil{ (err) } ("div:nth-child(3)").Each(func(i int, selection *) { (()) }) }
This example will filter outDIV2
,becauseDIV2
It is its parent elementbody
The third child element of .
:nth-of-type(n) filter
:nth-of-type(n)
and:nth-child(n)
Similar, except that it represents the nth of the same type element, so:nth-of-type(1)
and:first-of-type
They are equal, you can try it yourself, I won't give any examples here.
nth-last-child(n) and :nth-last-of-type(n) filters
These two are similar to the above, but they are calculated in reverse order, and the last element is regarded as the first one. Everyone test it yourself to see the effect, it is obvious.
:only-child filter
Find(":only-child")
Filter, literally, can be guessed that it represents the filtered element. Among its parent elements, only it itself, its parent element has no other child elements, will be filtered out by a match.
func main() { html := `<body> <div lang="zh">DIV1</div> <span> <div>DIV5</div> </span> </body> ` dom,err:=((html)) if err!=nil{ (err) } ("div:only-child").Each(func(i int, selection *) { (()) }) }
In the exampleDIV5
It can be filtered out because it is its parent elementspan
Reach the only child element, butDIV1
That's not, so you can't filter it out.
:only-of-type filter
The above example, if you want to filter outDIV1
what to do? AvailableFind(":only-of-type")
, because it is the only one in its parent elementdiv
Element, this is:only-of-type
The filter needs to do, as long as there is only one element of the same type, it can be filtered out. Change the above example to:only-of-type
Try it and see if there isDIV1
。
Selector or (|) operation
If we want to filter outdiv
,span
What to do if you wait for the elements? At this time, multiple selectors can be used in combination and divided by comma (,).Find("selector1, selector2, selectorN")
Indicates that as long as one of the selectors is satisfied, it can be filtered out, that is, the selector or (|) operation operation.
func main() { html := `<body> <div lang="zh">DIV1</div> <span> <div>DIV5</div> </span> </body> ` dom,err:=((html)) if err!=nil{ (err) } ("div,span").Each(func(i int, selection *) { (()) }) }
summary
goquery is a necessary tool for parsing HTML web pages. In the process of crawling web pages, flexibly using different selectors of goquery can make our crawling work twice the result with half the effort and greatly improve the efficiency of crawlers.
The above is the detailed content of the golang goquery selector usage example. For more information about golang goquery selector selector, please follow my other related articles!