xpath expression
1. xpath syntax
<bookstore> <book> <title lang="eng">Harry Potter</title> <price>999</price> </book> <book> <title lang="eng">Learning XML</title> <price>888</price> </book> </bookstore>
1.1 Selection of nodes
XPath uses path expressions to select nodes or sets of nodes in an XML document. These path expressions are very similar to those we see in regular computer file systems.
When using the chrome plugin to select tags, when selected, the selected tag will add the attribute class="xh-highlight"
The most useful expressions are listed below:
displayed formula | descriptive |
---|---|
nodename | Selects the element. |
/ | Selection from the root node, or transition between elements and elements. |
// | Selects nodes in the document from the current node of the matching selection, regardless of their position. |
. | Selects the current node. |
.. | Selects the parent of the current node. |
@ | Select the attribute. |
text() | Selected text. |
an actual example
path expression | in the end |
---|---|
bookstore | Select the bookstore element. |
/bookstore | Selects the root element bookstore. note: If the path starts with a forward slash ( / ), the path always represents an absolute path to an element! |
bookstore/book | Selects all book elements that are children of bookstore. |
//book | Selects all book child elements, regardless of their position in the document. |
bookstore//book | Selects all book elements that are descendants of the bookstore element, regardless of where they are located under the bookstore. |
//book/title/@lang | Select the value of the lang attribute in the title under all books. |
//book/title/text() | Select the text of the title underneath all the books. |
- Select all text under h1
- //h1/text()
- Get the href of all a tags
- //a/@href
- Get the text of the title under the head in html.
- /html/head/title/text()
- Get the href of the link tag under the head in html.
- /html/head/link/@href
1.2 Finding a specific node
path expression | in the end |
---|---|
//title[@lang="eng"] | Selects all title elements with a lang attribute value of eng |
/bookstore/book[1] | Selects the first book element that is a child of bookstore. |
/bookstore/book[last()] | Selects the last book element that is a child of bookstore. |
/bookstore/book[last()-1] | Selects the penultimate book element that is a child of bookstore. |
/bookstore/book[position()>1] | Select the book element under bookstore, starting with the second one |
//book/title[text()='Harry Potter'] | Select all the title elements under the book, just the title element with the text Harry Potter. |
/bookstore/book[price>35.00]/title | Select all title elements of the book element of the bookstore element where the price element has a value greater than 35.00. |
Note: In xpath, the position of the first element is 1, the position of the last element is last(), and the penultimate element is last()-1.
1.3 Selection of unknown nodes
XPath wildcards can be used to select unknown XML elements.
wildcard character (computing) | descriptive |
---|---|
* | Match any element node. |
@* | Match any attribute node. |
node() | Match any type of node. |
an actual example
In the following table, we list some path expressions and the results of these expressions:
path expression | in the end |
---|---|
/bookstore/* | Selects all children of the bookstore element. |
//* | Selects all elements in the document. |
//title[@*] | Selects all title elements with attributes. |
1.4 Selection of a number of paths
By using the "|" operator in a path expression, you can select several paths.
an actual example
In the following table, we list some path expressions and the results of these expressions:
path expression | in the end |
---|---|
//book/title | //book/price | Selects all title and price elements of the book element. |
//title | //price | Select all title and price elements in the document. |
/bookstore/book/title | //price | Selects all title elements of the book element that are part of the bookstore element, as well as all price elements in the document. |
Example:
from lxml import etree text = ''' <div> <ul> <li class="item-1"><a href="" rel="external nofollow" >first item</a></li> <li class="item-1"><a href="" rel="external nofollow" >second item</a></li> <li class="item-inactive"><a href="" rel="external nofollow" >third item</a></li> <li class="item-1"><a href="" rel="external nofollow" >fourth item</a></li> <li class="item-0"><a href="" rel="external nofollow" >fifth item</a> </ul> </div> ''' html = (text) #Get a list of href's and title's href_list = ("//li[@class='item-1']/a/@href") title_list = ("//li[@class='item-1']/a/text()") # Assembled into a dictionary for href in href_list: item = {} item["href"] = href item["title"] = title_list[href_list.index(href)] print(item) # If you get a node, the return is an element object, you can continue to use the xpath method, which we can later in the data extraction process: first grouped according to a label, grouped and then extracted data li_list = ("//li[@class='item-1']") #Continue data extraction in each group for li in li_list: item = {} item["href"] = ("./a/@href")[0] if len(("./a/@href"))>0 else None item["title"] = ("./a/text()")[0] if len(("./a/text()"))>0 else None print(item)
This is the whole content of this article.