SoFunction
Updated on 2024-12-19

python crawler lxml library parsing xpath web page process example

preamble

When we crawl the content of the web page, usually crawl a whole page of content, and we just need the web page in the part of the content, so how to go to extract it? This chapter will take you to learn the use of xpath plugin. Go to the content of the web page for extraction.

(i) What is xpath

xpath is a language to find information in the XML document , xpath can be used to traverse the elements and attributes in the XML document , the mainstream browsers support xpath, because the html page in the DOM represents the XHTML document .

The xpath language is based on the tree structure of XML documents and provides the ability to browse the tree, selecting nodes by diverse criteria. Thus finding the data we want.

First we need to install the xpath plugin in chrome.
It can be downloaded by searching the Google App Store.

After installation, restart your browser and press the shortcut key Ctrl + Shift + X. A black box will appear on the web page to indicate success!

(ii) Basic syntax of xpath Path queries.

// : find all descendants regardless of hierarchy
/ : find direct child nodes

predicate query

//div[@id]
//div[@id=“maincontent”]

Attribute Search

//@class

fuzzy query

//div[contains(@id, “he”)]
//div[starts -with(@id, “he”)]

Content Search

//div/h1/text()

(iii) lxml library

lxml is a python parsing library , support for HTML and XML parsing , support for XPath parsing , and parsing efficiency is very high .
We need to install the lxml library in pycharm before using it.
Just enter the command in the terminal:

pip install lxml -i /simple

Note: must be installed in the environment we are currently using

(iv) Use of the lxml library Importing

from lxml import etree

Parsing local files

tree = () #Parsing local files

Parsing server response files

tree = (content) #Parsing Web Files

Return results

result = ('//div/div/@aria-label')[0]

Note: The result type returned by xpath is a list, when the result has many values we can use subscripts to get the values we want.

(v) Demonstrations

import 
from lxml import etree
import 
url ='/s?'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
}
cre_data = {
    'wd' : 'Write keywords here'
}
data = (cre_data)
url = url + data
request = (url = url , headers = headers )
response = (request)
content = ().decode('utf-8')
print(content)
# tree = () # parse local files
tree = (content)  # Parsing web files
result = ('//div/div/@aria-label')[0]
print(result)

The above is python crawler lxml library parsing xpath web page process example of the details, more about python crawler lxml library parsing xpath web page information please pay attention to my other related articles!