Detailed explanation of the steps to use etree in lxml library in python

1. etree introduction

lxml The library is a powerful XML processing library in Python. In short, the etree module provides a simple and flexible API to parse and manipulate XML/HTML documents.

Official website: The Tutorial
Installation: pip install lxml

2. xpath parsing html/xml

1. The first step is to use etree to connect to html/xml code/file.

grammar:

root = (xml code) #xml access
root = (html code) #html access
From lxml import etree

from lxml import etree

root = ("<root>data</root>")
print()
#root
print((root))
#b'<root>data</root>'
 
root = ("<p>data</p>")
print()
#html
print((root))
#b'<html><body><p>data</p></body></html>'

2. xpath expression positioning

xpath Use path expressions to select nodes in HTML/XML documents. The node is selected along the path or step. The most useful path expressions are listed below:

expression	describe
/	Select from the root node (take child node)
//	Any node, regardless of location (take descendant node)
.	Select the current node
…	Select the parent node of the current node
@	Select attributes
contains(@ attribute, "contained content")	Fuzzy query
text()	Text content

① xpath combined with attribute positioning

(".//Tag name [@attribute='attribute value']") #Note that this returns a list! !
[]: means you want to find elements based on attributes
@: The key followed by the attribute indicates which attribute to locate it

from lxml import etree
 
ht = """&lt;html&gt;
  &lt;head&gt;
    &lt;title&gt;This is a sample document&lt;/title&gt;
  &lt;/head&gt;
  &lt;body&gt;
    &lt;h1 class="title"&gt;Hello!&lt;/h1&gt;
    &lt;p&gt;This is a paragraph with &lt;b&gt;bold&lt;/b&gt; text in it!&lt;/p&gt;
    &lt;p&gt;This is another paragraph, with a
      &lt;a href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" &gt;link&lt;/a&gt;.&lt;/p&gt;
    &lt;p&gt;Here are some reserved characters: &amp;lt;spam&amp;amp;egg&amp;gt;.&lt;/p&gt;
    &lt;p&gt;And finally an embedded XHTML fragment.&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;"""
 
html = (ht)
 
title = (".//h1[@class='title']")[0] #Get the first element in the listprint((title))
#b'&lt;h1 class="title"&gt;Hello!&lt;/h1&gt;\n    '
print(('class'))
# title

② xpath text positioning and acquisition

ele = (".//Tag name[text()='text value']")[0]
text1 = #get element text 1, ele is the positioned element
text2 = ("string(.//Tag name[@attribute='attribute value'])") #Get element text 2, return text
text3 = (".//Tag name [@attribute='attribute value']/text()") #Get element text 3, return text list

title1 = (".//h1[text()='Hello!']")[0] #Get the first element in the listtext1 = 
print(text1)
#Hello!
text2 = ("string(.//h1[@class='title'])")
print(text2)
#Hello!
text3 = (".//h1[@class='title']/text()") #Return to listprint(text3)
#['Hello!']

③ xpath hierarchical positioning

In actual development, if the required element does not have basic attributes such as id, name, class, etc., then we need to position adjacent elements. First, we can position adjacent elements, and then position the final element through hierarchical relationships.

(".//Parent element tag name [@Parent element attribute = 'Parent element attribute value']/Sub-element tag name") #The top to bottom hierarchical relationship, the goal is child elements
(".//Sub-element tag name [@Sub-element attribute = 'Sub-element attribute value']/parent::parent element tag name") #Parent-element tag name, the goal is to insert the code slice here for the parent element
(".//Element tag name [@Element attribute = 'Element attribute value']//preceding-sibling::Brother element tag name") #Brother element positioning, the target is brother element
(".//Element tag name [@Element attribute = 'Element attribute value']//Following-sibling::Little brother element tag name") #Little brother element positioning, the target is the younger brother element

from lxml import etree
 
ht = """&lt;html&gt;
  &lt;head&gt;
    &lt;title&gt;This is a sample document&lt;/title&gt;
  &lt;/head&gt;
  &lt;body&gt;
    &lt;h1 class="title"&gt;Hello!&lt;/h1&gt;
    &lt;p&gt;This is a paragraph with &lt;b&gt;bold&lt;/b&gt; text in it!&lt;/p&gt;
    &lt;p class="para"&gt;This is another paragraph, with a
      &lt;a href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" &gt;link&lt;/a&gt;.&lt;/p&gt;
    &lt;p&gt;Here are some reserved characters: &lt;spam&amp;egg&gt;.&lt;/p&gt;
    &lt;p&gt;And finally an embedded XHTML fragment.&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;"""
 
html = (ht)
 
 
ele1 = (".//p[@class='para']/a")[0] #High-to-lower relationshipprint((ele1))
#b'&lt;a href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" &gt;link&lt;/a&gt;.'
 
ele2 = (".//a[@href='']/parent::p")[0]#Father-son element positioningprint((ele2))
#b'&lt;p class="para"&gt;This is another paragraph, with a\n      &lt;a href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" &gt;link&lt;/a&gt;.&lt;/p&gt;\n    '
 
ele3 = (".//p[@class='para']//preceding-sibling::p")[0] #Brother Element Positioningprint((ele3))
#b'&lt;p&gt;This is a paragraph with &lt;b&gt;bold&lt;/b&gt; text in it!&lt;/p&gt;\n    '
 
ele4 = (".//p[@class='para']//following-sibling::p") #Brother Element Positioningfor ele in ele4:
    print((ele))
    #b'&lt;p&gt;Here are some reserved characters: &lt;spam&amp;egg&gt;.&lt;/p&gt;\n    '
    #b'&lt;p&gt;And finally an embedded XHTML fragment.&lt;/p&gt;\n  '

④ xpath index positioning

There are two main ways to index position with etree combined with xpath, mainly because () returns a list.

("xpath expression")[0] #get the first element in the list
("xpath expression")[-1] #get the last element in the list
("xpath expression")[-2] #get the penultimate element in the list

ele1 = (".//body/p")[0]
print((ele1))
#b'<p>This is a paragraph with <b>bold</b> text in it!</p>\n    '
 
ele1 = (".//body/p")[-1]
print((ele1))
#b'<p>And finally an embedded XHTML fragment.</p>\n  '

Syntax 2:

("xpath expression[1]")[0] #get the first element
("xpath expression[last()]")[0] #get the last element

("xpath expression[last()-1]")[0] #get the penultimate element

 Note：andpythonThe concept of list index is different，xpath The tag index is from1start；pythonThe index of the list is from0start。

⑤ xpath fuzzy matching

Sometimes we encounter situations where the attribute value is too long. At this time, we can handle it through fuzzy matching, only part of the attribute value is needed.

(".//Tag name [start-with(@ attribute, 'start-start')]") #Match start
(".//Tag name [ends-with(@ attribute, 'attribute value end')]") #Match end

(".//Tag name [contains(text(), 'part text')]") #Contains partial text

 Note：ends-withThe method is xpath 2.0 Syntax of，and etree Only supported xpth 1.0，So it may not succeed。

ele1 = (".//p[starts-with(@class,'par')]")[0] #Match the beginningprint((ele1))
#b'&lt;p class="para"&gt;This is another paragraph, with a\n      &lt;a href="" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" &gt;link&lt;/a&gt;.&lt;/p&gt;\n    '
 
ele2 = (".//p[ends-with(@class, 'ara')]")[0] #Match the endprint((ele2))
 
ele3 = (".//p[contains(text(),'is a paragraph with')]")[0] #Include "is a paragraph with"print((ele3))
#b'&lt;p&gt;This is a paragraph with &lt;b&gt;bold&lt;/b&gt; text in it!&lt;/p&gt;\n    '

Summarize

This is the article about the steps of using etree in lxml library in python. For more related content on python lxml library, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!