SoFunction
Updated on 2024-10-30

How Python xpath expressions implement data manipulation

xpath expression

1. xpath syntax

<bookstore>
<book>
 <title lang="eng">Harry Potter</title>
 <price>999</price>
</book>
<book>
 <title lang="eng">Learning XML</title>
 <price>888</price>
</book>
</bookstore>

1.1 Selection of nodes

XPath uses path expressions to select nodes or sets of nodes in an XML document. These path expressions are very similar to those we see in regular computer file systems.

When using the chrome plugin to select tags, when selected, the selected tag will add the attribute class="xh-highlight"

The most useful expressions are listed below:

displayed formula descriptive
nodename Selects the element.
/ Selection from the root node, or transition between elements and elements.
// Selects nodes in the document from the current node of the matching selection, regardless of their position.
. Selects the current node.
.. Selects the parent of the current node.
@ Select the attribute.
text() Selected text.

an actual example

path expression in the end
bookstore Select the bookstore element.
/bookstore Selects the root element bookstore. note: If the path starts with a forward slash ( / ), the path always represents an absolute path to an element!
bookstore/book Selects all book elements that are children of bookstore.
//book Selects all book child elements, regardless of their position in the document.
bookstore//book Selects all book elements that are descendants of the bookstore element, regardless of where they are located under the bookstore.
//book/title/@lang Select the value of the lang attribute in the title under all books.
//book/title/text() Select the text of the title underneath all the books.
  • Select all text under h1
  • //h1/text()
  • Get the href of all a tags
  • //a/@href
  • Get the text of the title under the head in html.
  • /html/head/title/text()
  • Get the href of the link tag under the head in html.
  • /html/head/link/@href

1.2 Finding a specific node

path expression in the end
//title[@lang="eng"] Selects all title elements with a lang attribute value of eng
/bookstore/book[1] Selects the first book element that is a child of bookstore.
/bookstore/book[last()] Selects the last book element that is a child of bookstore.
/bookstore/book[last()-1] Selects the penultimate book element that is a child of bookstore.
/bookstore/book[position()>1] Select the book element under bookstore, starting with the second one
//book/title[text()='Harry Potter'] Select all the title elements under the book, just the title element with the text Harry Potter.
/bookstore/book[price>35.00]/title Select all title elements of the book element of the bookstore element where the price element has a value greater than 35.00.

Note: In xpath, the position of the first element is 1, the position of the last element is last(), and the penultimate element is last()-1.

1.3 Selection of unknown nodes

XPath wildcards can be used to select unknown XML elements.

wildcard character (computing) descriptive
* Match any element node.
@* Match any attribute node.
node() Match any type of node.

an actual example

In the following table, we list some path expressions and the results of these expressions:

path expression in the end
/bookstore/* Selects all children of the bookstore element.
//* Selects all elements in the document.
//title[@*] Selects all title elements with attributes.

1.4 Selection of a number of paths

By using the "|" operator in a path expression, you can select several paths.

an actual example

In the following table, we list some path expressions and the results of these expressions:

path expression in the end
//book/title | //book/price Selects all title and price elements of the book element.
//title | //price Select all title and price elements in the document.
/bookstore/book/title | //price Selects all title elements of the book element that are part of the bookstore element, as well as all price elements in the document.

Example:

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

html = (text)

#Get a list of href's and title's
href_list = ("//li[@class='item-1']/a/@href")
title_list = ("//li[@class='item-1']/a/text()")


# Assembled into a dictionary
for href in href_list:
  item = {}
  item["href"] = href
  item["title"] = title_list[href_list.index(href)]
  print(item)

# If you get a node, the return is an element object, you can continue to use the xpath method, which we can later in the data extraction process: first grouped according to a label, grouped and then extracted data
li_list = ("//li[@class='item-1']")

#Continue data extraction in each group
for li in li_list:
  item = {}
  item["href"] = ("./a/@href")[0] if len(("./a/@href"))>0 else None
  item["title"] = ("./a/text()")[0] if len(("./a/text()"))>0 else None
  print(item)

This is the whole content of this article.