preamble
When we crawl the content of the web page, usually crawl a whole page of content, and we just need the web page in the part of the content, so how to go to extract it? This chapter will take you to learn the use of xpath plugin. Go to the content of the web page for extraction.
(i) What is xpath
xpath is a language to find information in the XML document , xpath can be used to traverse the elements and attributes in the XML document , the mainstream browsers support xpath, because the html page in the DOM represents the XHTML document .
The xpath language is based on the tree structure of XML documents and provides the ability to browse the tree, selecting nodes by diverse criteria. Thus finding the data we want.
First we need to install the xpath plugin in chrome.
It can be downloaded by searching the Google App Store.
After installation, restart your browser and press the shortcut key Ctrl + Shift + X. A black box will appear on the web page to indicate success!
(ii) Basic syntax of xpath Path queries.
// : find all descendants regardless of hierarchy
/ : find direct child nodes
predicate query
//div[@id]
//div[@id=“maincontent”]
Attribute Search
//@class
fuzzy query
//div[contains(@id, “he”)]
//div[starts -with(@id, “he”)]
Content Search
//div/h1/text()
(iii) lxml library
lxml is a python parsing library , support for HTML and XML parsing , support for XPath parsing , and parsing efficiency is very high .
We need to install the lxml library in pycharm before using it.
Just enter the command in the terminal:
pip install lxml -i /simple
Note: must be installed in the environment we are currently using
(iv) Use of the lxml library Importing
from lxml import etree
Parsing local files
tree = () #Parsing local files
Parsing server response files
tree = (content) #Parsing Web Files
Return results
result = ('//div/div/@aria-label')[0]
Note: The result type returned by xpath is a list, when the result has many values we can use subscripts to get the values we want.
(v) Demonstrations
import from lxml import etree import url ='/s?' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36' } cre_data = { 'wd' : 'Write keywords here' } data = (cre_data) url = url + data request = (url = url , headers = headers ) response = (request) content = ().decode('utf-8') print(content) # tree = () # parse local files tree = (content) # Parsing web files result = ('//div/div/@aria-label')[0] print(result)
The above is python crawler lxml library parsing xpath web page process example of the details, more about python crawler lxml library parsing xpath web page information please pay attention to my other related articles!