SoFunction
Updated on 2025-03-03

Python uses xpath to extract data from parsed content

1. Preface

In the previous article, we have taught you how to obtain the original content of the data we need. Today, we will introduce one of the methods used to extract the required data xpath. In the future, we will explain bs4 (beautifulsoup), re regular expressions.

2. Text

XPath uses path expressions to select nodes or node sets in HTML/XML documents. The node is selected along the path (path) or steps.

Use an lxml library in python: Downloadpip install lxml

Select a node

expression describe
nodename Select all children of this node.
/ Select from the root node (take child node).
// Select nodes in the document from the current node selected by the match, regardless of their location (take descendant nodes).
. Select the current node.
.. Select the parent node of the current node.
@ Select attributes.

Path expression

Path expression result
bookstore Select all children of the bookstore element.
/bookstore Select the root element bookstore. Note: If the path starts with a forward slash ( / ), this path always represents the absolute path to a certain element!
bookstore/book Select all book elements that belong to the child element of the bookstore.
//book Select all book child elements regardless of their location in the document.
bookstore//book Select all book elements that belong to the descendants of the bookstore element, regardless of where they are located under the bookstore.
//@lang Select all attributes named lang.

predicate

Predicate is used to find a specific node or a node containing a specified value.

The predicate is embedded in square brackets.

Path expression result
/bookstore/book[1] Select the first book element that belongs to the bookstore child element.
/bookstore/book[last()] Select the last book element that belongs to the bookstore child element.
/bookstore/book[last()-1] Select the penultimate book element that belongs to the bookstore child element.
/bookstore/book[position()<3] Select the first two book elements that belong to the bookstore element.
//title[@lang] Select all title elements with attributes named lang.
//title[@lang='eng'] Select all title elements, and these elements have a lang attribute with a value of eng.

Select unknown nodes

Wildcard describe
* Match any element node.
@* Match any attribute node.
node() Match any type of node.

--- In the following table, we list some path expressions, and the results of these expressions:

Path expression result
/bookstore/* Select all child elements of the bookstore element.
//* Select all elements in the document.
//title[@*] Select all title elements with attributes.

Select several nodes

By using the "|" operator in a path expression, you can select several paths.

Path expression result
//book/title //book/price Select all title and price elements of the book element.
//title //price Select all title and price elements in the document.
/bookstore/book/title //price Select all title elements of the book element that belongs to the bookstore element, and all price elements in the document.

III. Example

Here is a sample code

 # -*- coding:utf-8 -*-
 import requests
 from lxml import etree
 ​
 ​
 class DouGuo(object):
     def __init__(self):
          = "/caipu/%E5%AE%B6%E5%B8%B8%E8%8F%9C/0/20"
          = {
             "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
         }
 ​
     def get_data_index(self):
         response = (, headers=)
         ="utf-8"
         if response.status_code == 200:
             return 
         else:
             return None
 ​
     def parse_data_index(self, response):
         html = (response)
         data_list = ('//ul[@class="cook-list"]//li[@class="clearfix"]')
         for data in data_list:
             # Extract text values             title = ("./div/a/text()")[0]
             major = ("./div/p/text()")[0]
             # Extract attribute values             head = ("./div/div[2]/a/img/@alt")[0]
             score = ("./div/div[1]//span/text()")[0]
             print(f"title: {title}\nmajor: {major}\nhead:{head}\nscore:{score}\n\n")
 ​
     def run(self):
         response = self.get_data_index()
         # print(response)
         self.parse_data_index(response)
 ​
 if __name__ == '__main__':
     spider = DouGuo()
     ()

4. Conclusion

You can try to grab this url/ershoufang/

GetPage 1Just use data, and you can also think about how to obtain multiple pages to realize the page turn function.

This is the article about Python using xpath to extract data from parsed content. For more related Python xpath data extraction content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!