Practical Guide to Efficiently Parsing and Manipulating XML/HTML in Python

Preface

In the Python ecosystem, lxml is a powerful and widely used library for efficient parsing and manipulating XML and HTML documents. Whether you are dealing with simple HTML pages or complex XML data structures, lxml provides a powerful tool set, including XPath, XSLT transformations, and CSS selector support. This article starts with the basic installation of lxml, and gradually explains in-depth how to parse documents, extract data, modify document structure, and covers advanced operations such as processing large documents and using namespaces. Whether you are just starting out with lxml or want to gain insight into its advanced features, this article will provide you with a complete reference.

1. Installation of lxml

InstalllxmlThe module is very simple, you can usepipTools to complete. The following are the specific installation steps:

(I) Install using pip

If you are using Python package managerpip, you can run the following command directly in the terminal or command prompt:

pip install lxml

(II) If you are using conda

If you are usingAnacondaorMiniconda, can be usedcondaCome to install:

conda install lxml

(III) Problems that may be encountered during installation

Compilation issues：lxmlRelying on C librarylibxml2andlibxslt, If you encounter errors during installation, it may be that the system is missing these dependencies. In most cases,pipThis problem will be solved automatically, but if it cannot be installed successfully, you can install these libraries manually.
Windows Users：lxmlThe Windows version will generally automatically include the necessary binary dependencies, so installation on Windows does not require special configuration. If you encounter problems, you can use precompiled binary files (usually throughpipAutomatically handled during installation).

(IV) Verify installation

After the installation is complete, you can import it in the Python interpreterlxmlTo verify whether the installation is successful:

import lxml

If no error is reported, the installation is successful.

2. Beginner of lxml module

lxmlModules are a very powerful Python library, mainly used to parse and manipulate XML and HTML documents. It is efficient and easy to use and supports features such as XPath and XSLT. The following islxmlA guide to get started quickly.

(I) Basic usage

1. Parsing HTML documents

lxmlHTML documents can be parsed from strings or files.

from lxml import etree
 
html_string = """
&lt;html&gt;
  &lt;body&gt;
    &lt;h1&gt;Welcome to lxml!&lt;/h1&gt;
    &lt;div class="content"&gt;This is a test.&lt;/div&gt;
  &lt;/body&gt;
&lt;/html&gt;
"""
 
# Use HTML parserparser = ()
tree = (html_string, parser)
 
# Print parsed HTML documentprint((tree, pretty_print=True).decode("utf-8"))

This example shows how to parse a document tree from an HTML string.

2. Parsing XML documents

lxmlThe same applies to parsing XML documents.

xml_string = """
&lt;root&gt;
  &lt;element key="value"&gt;This is an element&lt;/element&gt;
&lt;/root&gt;
"""
 
# parse XML stringstree = (xml_string)
 
# Print parsed XML documentprint((tree, pretty_print=True).decode("utf-8"))

3. Parsing from file

In addition to parsing from strings, you can also read and parse documents directly from files:

# parse HTML filestree = ("", parser)
 
# parse XML filestree = ("")

(II) Use XPath to extract data

lxmlSupports XPath, which is very suitable for extracting specific information from documents.

# Extract the content of all div elementsdiv_content = ("//div[@class='content']/text()")
print(div_content)  # Output: ['This is a test.'] 
# Extract the content of h1 elementh1_content = ("//h1/text()")
print(h1_content)  # Output: ['Welcome to lxml!']

(III) Create and modify XML/HTML documents

1. Create a new document

Can be usedlxmlTo create a new XML/HTML document and add elements and attributes to it:

# Create root elementroot = ("root")
 
# Add child elementschild = (root, "child")
 = "This is a child element."
 
# Set properties("class", "highlight")
 
# Print the generated XML documentprint((root, pretty_print=True).decode("utf-8"))

2. Modify existing documents

You can modify the document after parsing it, such as adding new elements or changing the text content:

# Add a new div elementnew_div = ("div", )
new_div.text = "This is a new div."
().append(new_div)
 
# Print the modified documentprint((tree, pretty_print=True).decode("utf-8"))

(IV) Write to the file

You can also write parsed or modified content to a file:

# Write the tree to a file("", pretty_print=True, method="html", encoding="utf-8")

(V) Summary of the introduction to lxml module

lxmlis a very efficient XML/HTML parsing and processing tool. With the above basic operations, you can quickly get started and use it to parse, extract, create and modify documents.

3. In-depth practice of lxml

To grasp it in depthlxmlModules need to understand their advanced features, such as more complex XPath queries, using CSS selectors, processing and converting large XML/HTML documents, and performing XSLT conversions. Here are some examples of in-depth exercises.

(I) Advanced XPath Query

In practical use, we may need to write more complex XPath queries to extract specific data. Here are some exercise examples:

from lxml import etree
 
html_string = """
&lt;html&gt;
  &lt;body&gt;
    &lt;div class="content"&gt;
        &lt;p class="intro"&gt;Welcome to lxml!&lt;/p&gt;
        &lt;p class="text"&gt;lxml is powerful.&lt;/p&gt;
        &lt;a href="" rel="external nofollow"  rel="external nofollow" &gt;Example&lt;/a&gt;
    &lt;/div&gt;
    &lt;div class="footer"&gt;
        &lt;p&gt;Contact us at: info@&lt;/p&gt;
    &lt;/div&gt;
  &lt;/body&gt;
&lt;/html&gt;
"""
 
parser = ()
tree = (html_string, parser)
 
# 1. Extract the content of all <p> elementsparagraphs = ("//p/text()")
print(paragraphs)
 
# 2. Extract the content of the <p> element with the class attribute 'intro'intro_paragraph = ("//p[@class='intro']/text()")
print(intro_paragraph)
 
# 3. Extract the href attributes of all linkslinks = ("//a/@href")
print(links)

(II) Use CSS selector

lxmlIt also supports CSS selectors and can be usedcssselectThe module implements a query method similar to jQuery. First, make sure you have installedcssselect：

pip install cssselect

Then, you can use:

from lxml import etree
 
html_string = """
&lt;html&gt;
  &lt;body&gt;
    &lt;div class="content"&gt;
        &lt;p class="intro"&gt;Welcome to lxml!&lt;/p&gt;
        &lt;p class="text"&gt;lxml is powerful.&lt;/p&gt;
        &lt;a href="" rel="external nofollow"  rel="external nofollow" &gt;Example&lt;/a&gt;
    &lt;/div&gt;
  &lt;/body&gt;
&lt;/html&gt;
"""
 
parser = ()
tree = (html_string, parser)
 
# Select all <p> elementsparagraphs = ("p")
for p in paragraphs:
    print()
 
# Select the <p> element with class="intro"intro_paragraph = ("")
print(intro_paragraph[0].text)
 
# Select all linkslinks = ("a")
for link in links:
    print(("href"))

(III) Processing large XML documents

For large XML documents, you can useiterparseCome to parse line by line, which can save memory and improve efficiency.

large_xml_string = """
 <root>
   <item ><name>Item 1</name></item>
   <item ><name>Item 2</name></item>
   <item ><name>Item 3</name></item>
   <!-- More content -->
 </root>
 """
 
context = ((large_xml_string.encode('utf-8')), events=('end',), tag='item')
 
for event, elem in context:
    # Print the content of each item    name = ("name").text
    item_id = ("id")
    print(f"ID: {item_id}, Name: {name}")
 
    # Clear processed elements to free memory    ()

(IV) Use XSLT to convert

lxmlSupports the use of XSLT (extended stylesheet language conversion) to convert XML documents. This is very useful when processing XML data.

xslt_string = """
&lt;xsl:stylesheet version="1.0" xmlns:xsl="http:///1999/XSL/Transform"&gt;
  &lt;xsl:template match="/"&gt;
    &lt;html&gt;
      &lt;body&gt;
        &lt;h2&gt;Transformed XML Data&lt;/h2&gt;
        &lt;ul&gt;
          &lt;xsl:for-each select="root/item"&gt;
            &lt;li&gt;
              &lt;xsl:value-of select="name"/&gt;
            &lt;/li&gt;
          &lt;/xsl:for-each&gt;
        &lt;/ul&gt;
      &lt;/body&gt;
    &lt;/html&gt;
  &lt;/xsl:template&gt;
&lt;/xsl:stylesheet&gt;
"""
 
xml_string = """
&lt;root&gt;
  &lt;item&gt;&lt;name&gt;Item 1&lt;/name&gt;&lt;/item&gt;
  &lt;item&gt;&lt;name&gt;Item 2&lt;/name&gt;&lt;/item&gt;
  &lt;item&gt;&lt;name&gt;Item 3&lt;/name&gt;&lt;/item&gt;
&lt;/root&gt;
"""
 
# parse XML and XSLTxml_doc = (xml_string)
xslt_doc = (xslt_string)
 
# Create an XSLT convertertransform = (xslt_doc)
result_tree = transform(xml_doc)
 
# Print the converted resultprint(str(result_tree))

(V) Modify and reconstruct XML documents

You can uselxmlTo traverse and modify existing documents, such as inserting new nodes, deleting nodes, or modifying attributes.

# Modify XML documentsxml_string = """
&lt;library&gt;
  &lt;book  available="yes"&gt;&lt;title&gt;Python Programming&lt;/title&gt;&lt;/book&gt;
  &lt;book  available="no"&gt;&lt;title&gt;Advanced Mathematics&lt;/title&gt;&lt;/book&gt;
&lt;/library&gt;
"""
 
tree = (xml_string)
 
# Add a <author> element to all booksfor book in ("//book"):
    author = ("author")
     = "Unknown"
    (author)
 
# Modify the title of the bookbook_to_modify = ("//book[@id='2']/title")[0]
book_to_modify.text = "Advanced Calculus"
 
# Delete all available="no" booksfor book in ("//book[@available='no']"):
    ().remove(book)
 
# Print the final XMLprint((tree, pretty_print=True).decode("utf-8"))

(VI) Handle namespace

lxmlCan handle namespaces in XML documents, which is very useful when parsing complex XML documents.

xml_string = """
&lt;root xmlns:h="http:///TR/html4/"&gt;
  &lt;h:table&gt;
    &lt;h:tr&gt;
      &lt;h:td&gt;Cell 1&lt;/h:td&gt;
      &lt;h:td&gt;Cell 2&lt;/h:td&gt;
    &lt;/h:tr&gt;
  &lt;/h:table&gt;
&lt;/root&gt;
"""
 
# Define namespacens = {'h': 'http:///TR/html4/'}
 
tree = (xml_string)
 
# Extract all h:td elementscells = ("//h:td/text()", namespaces=ns)
print(cells)  # Output: ['Cell 1', 'Cell 2']

(VII) In-depth practice and summary of lxml

lxml is a very powerful library suitable for handling a variety of XML and HTML documents. By mastering XPath, CSS selector, XSLT conversion, large document analysis and other functions, you can handle different data structures flexibly and efficiently. I hope these in-depth exercises can help you further understand and apply lxml! If you have any other questions or need a more in-depth example, feel free to ask me!

4. Summary

lxml is an efficient, flexible and powerful Python library suitable for processing a variety of XML and HTML documents. By mastering the basic usage of lxml, you can quickly parse documents, extract data, create and modify document structure. After deep learning, you can also use XPath, XSLT, and CSS selectors to handle complex data queries and conversions, and even optimize the parsing efficiency of large files. Hopefully, the examples and exercises in this article can help you better understand and apply lxml and become your right-hand assistant in data processing and document parsing. If you encounter any problems during use or need more in-depth examples, feel free to ask questions!

The above is the detailed content of the practical guide for efficient parsing and manipulating XML/HTML in Python. For more information about Python parsing and manipulating XML/HTML, please pay attention to my other related articles!