1. Preface
In the previous article, we have taught you how to obtain the original content of the data we need. Today, we will introduce one of the methods used to extract the required data xpath. In the future, we will explain bs4 (beautifulsoup), re regular expressions.
2. Text
XPath uses path expressions to select nodes or node sets in HTML/XML documents. The node is selected along the path (path) or steps.
Use an lxml library in python: Downloadpip install lxml
Select a node
expression | describe |
---|---|
nodename | Select all children of this node. |
/ | Select from the root node (take child node). |
// | Select nodes in the document from the current node selected by the match, regardless of their location (take descendant nodes). |
. | Select the current node. |
.. | Select the parent node of the current node. |
@ | Select attributes. |
Path expression
Path expression | result |
---|---|
bookstore | Select all children of the bookstore element. |
/bookstore | Select the root element bookstore. Note: If the path starts with a forward slash ( / ), this path always represents the absolute path to a certain element! |
bookstore/book | Select all book elements that belong to the child element of the bookstore. |
//book | Select all book child elements regardless of their location in the document. |
bookstore//book | Select all book elements that belong to the descendants of the bookstore element, regardless of where they are located under the bookstore. |
//@lang | Select all attributes named lang. |
predicate
Predicate is used to find a specific node or a node containing a specified value.
The predicate is embedded in square brackets.
Path expression | result |
---|---|
/bookstore/book[1] | Select the first book element that belongs to the bookstore child element. |
/bookstore/book[last()] | Select the last book element that belongs to the bookstore child element. |
/bookstore/book[last()-1] | Select the penultimate book element that belongs to the bookstore child element. |
/bookstore/book[position()<3] | Select the first two book elements that belong to the bookstore element. |
//title[@lang] | Select all title elements with attributes named lang. |
//title[@lang='eng'] | Select all title elements, and these elements have a lang attribute with a value of eng. |
Select unknown nodes
Wildcard | describe |
---|---|
* | Match any element node. |
@* | Match any attribute node. |
node() | Match any type of node. |
--- In the following table, we list some path expressions, and the results of these expressions:
Path expression | result |
---|---|
/bookstore/* | Select all child elements of the bookstore element. |
//* | Select all elements in the document. |
//title[@*] | Select all title elements with attributes. |
Select several nodes
By using the "|" operator in a path expression, you can select several paths.
Path expression | result | |
---|---|---|
//book/title | //book/price | Select all title and price elements of the book element. |
//title | //price | Select all title and price elements in the document. |
/bookstore/book/title | //price | Select all title elements of the book element that belongs to the bookstore element, and all price elements in the document. |
III. Example
Here is a sample code
# -*- coding:utf-8 -*- import requests from lxml import etree class DouGuo(object): def __init__(self): = "/caipu/%E5%AE%B6%E5%B8%B8%E8%8F%9C/0/20" = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36" } def get_data_index(self): response = (, headers=) ="utf-8" if response.status_code == 200: return else: return None def parse_data_index(self, response): html = (response) data_list = ('//ul[@class="cook-list"]//li[@class="clearfix"]') for data in data_list: # Extract text values title = ("./div/a/text()")[0] major = ("./div/p/text()")[0] # Extract attribute values head = ("./div/div[2]/a/img/@alt")[0] score = ("./div/div[1]//span/text()")[0] print(f"title: {title}\nmajor: {major}\nhead:{head}\nscore:{score}\n\n") def run(self): response = self.get_data_index() # print(response) self.parse_data_index(response) if __name__ == '__main__': spider = DouGuo() ()
4. Conclusion
You can try to grab this url/ershoufang/
GetPage 1
Just use data, and you can also think about how to obtain multiple pages to realize the page turn function.
This is the article about Python using xpath to extract data from parsed content. For more related Python xpath data extraction content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!