Summary of methods for processing XML using Shell parsing

Preface

A few days ago, I encountered a need to parse and process XML files while working. At that time, considering that the logic was relatively complicated, I slowly used Java. However, this requirement often changes. After each change, the code of the jar package must be found again. After the modification, the original jar package must be replaced. First, it is inconvenient to modify, second, it is inconvenient to save the code in a unified manner, and third, it is inconvenient to view the functions of the jar package.

In fact, for this relatively flexible function, the most convenient and efficient way is to use some scripting languages, such as python, ruby, etc., which are highly developed and can also handle some complex logic. However, for various reasons, some machines at work do not have interpreters for these languages installed. Therefore, it was a last resort to study a wave of methods for parsing XML using shell scripts.

After all, shells are not suitable for dealing with complex logic, but for some simple search and replacement requirements, it is quite convenient to use shells.

I mainly use the following three tools:

xmllint
xpath
xml2

The following are the usages of these three tools for easy reference later.

xmllint

Brief description

xmllint is actually a gadget implemented by a C language library function called libxml2. Therefore, it is relatively efficient, has good support for different systems, and has relatively complete functions. It generally belongs to the libxml2-utils software package, so it is similar tosudo apt install libxml2-utilsThe command can be installed.

Function

xmllint supports at least the following commonly used functions:

Support xpath query statements
Supports interactive query of shell class
Support xml format verification
Supports verification of dtd and xsd on xml
Support encoding conversion
Support xml formatting
Supports despace compression
Support time efficiency statistics

In fact, the most commonly used functions are three - xpath query, space removal, formatting, and verification.

For example, currently there are:

<books>
  <book >
    <name>book1</name>
    <price>100</price>
  </book>
  <book >
    <name>book2</name>
    <price>200</price>
  </book>
  <book ><name>book3</name><price>300</price>
  </book>
</books>

Execute xpath query:

myths@business:~$ xmllint --xpath "//book[@id=2]/name/text()" 
book2

Remove the space:

myths@business:~$ xmllint --noblanks 
<?xml version="1.0"?>
<books><book ><name>book1</name><price>100</price><license/></book><book ><name>book2</name><price>200</price></book><book ><name>book3</name><price>300</price></book></books>

format:

myths@business:~$ xmllint --format 
<?xml version="1.0"?>
<books>
 <book >
 <name>book1</name>
 <price>100</price>
 <license/>
 </book>
 <book >
 <name>book2</name>
 <price>200</price>
 </book>
 <book >
 <name>book3</name>
 <price>300</price>
 </book>
</books>

xsd verification:

myths@business:~$ cat 
<?xml version="1.0" encoding="utf-8"?>
<xs:schema  xmlns="" xmlns:xs="http:///2001/XMLSchema" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata">
 <xs:element name="books" msdata:IsDataSet="true" msdata:Locale="en-US">
 <xs:complexType>
  <xs:choice minOccurs="0" maxOccurs="unbounded">
  <xs:element name="book">
   <xs:complexType>
   <xs:sequence>
    <xs:element name="name" type="xs:string" minOccurs="0" msdata:Ordinal="0" />
    <xs:element name="price" type="xs:string" minOccurs="0" msdata:Ordinal="1" />
   </xs:sequence>
   <xs:attribute name="id" type="xs:string" />
   </xs:complexType>
  </xs:element>
  </xs:choice>
 </xs:complexType>
 </xs:element>
</xs:schema>
 
myths@business:~$ xmllint --noout --schema  
 validates

Notice:The verification result information is output to stderr. The tool will echo the original file to stdout by default. You can add the –noout parameter to turn off stdout echo.

Streaming:

xmllint is to pass file names by default. If we want to pass data by piped file streams, we can do this:

myths@business:~$ cat  |xmllint --format -
<?xml version="1.0"?>
<?xml version="1.0"?>
<books>
 <book >
 <name>book1</name>
 <price>100</price>
 <license/>
 </book>
 <book >
 <name>book2</name>
 <price>200</price>
 </book>
 <book >
 <name>book3</name>
 <price>300</price>
 </book>
</books>

xpath

Brief description

The xpath tool is actually a packaged perl script, and it only has about 200 lines. Its function is relatively special, which provides xpath query function. It generally belongs to the libxml-xpath-perl software package, so it is similar tosudo apt install libxml-xpath-perlThe command can be installed. Systems like suse will also come with their own.

Function

The versions installed in different systems may be different, but the basic functions are similar:

myths@business:~$ xpath -e '//book/name/text()' 
Found 3 nodes in :
-- NODE --
book1
-- NODE --
book2
-- NODE --
book3

By default, the query results will be output to stdout and the description information will be output to stderr. If you can redirect stderr to /dev/null for the sake of easy collection of results, or add the -q parameter:

myths@business:~$ xpath -e '//book/name/text()'  2>/dev/null
book1
book2
book3
myths@business:~$ xpath -q -e '//book/name/text()' 
book1
book2
book3

It is important that xpath has a little difference compared to xmllint's xpath function. If xpath matches multiple results, then xpath will output in a branch, while xmllint will rub it into a line:

myths@business:~$ xmllint --xpath "//book/name/text()" 
book1book2book3

xml2

Brief description

I don't think many people know the xml2 tool, but in fact, it can work miraculously with other commands in some scenarios. The developer's blog seems to have been lost, but it is estimated that it should be written in C and libxml2 libraries. It is usually in the xml2 software package, so commands like sudo apt install xml2 can be installed.

Function

This tool contains six commands: xml2, 2xml, html2, 2html, csv2, 2csv, and its function is also very unix, which is to convert the xml, html, and csv formats to a format he calls "flat format". For example:

myths@business:~$ cat  |xml2
/books/book/@id=1
/books/book/name=book1
/books/book/price=100
/books/book
/books/book/@id=2
/books/book/name=book2
/books/book/price=200
/books/book
/books/book/@id=3
/books/book/name=book3
/books/book/price=300
myths@business:~$ cat  |xml2|2xml
<books><book ><name>book1</name><price>100</price></book><book ><name>book2</name><price>200</price></book><book ><name>book3</name><price>300</price></book></books>

This custom format is very simple and clever. Some represent new nodes (/books/books), some represent assign values to nodes (/books/book/name=book1), and some represent assign values to node attributes (/books/book/@id=1). The writing style is very similar to xpath but not exactly the same. And putting two corresponding commands together can achieve idempotence.

So what's the use of this conversion command? In fact, we often encounter some demands for creating xml files, but it is very troublesome to generate dynamically in the xml format. At this time, it is very convenient to use flat format to make a transit:

#!/usr/bin/env bash
tempFile=$(mktemp )
function addBook(){
 id=$1
 name=$2
 price=$3
 echo "/books/book">>$tempFile
 echo "/books/book/@id=$id">>$tempFile
 echo "/books/book/name=$name">>$tempFile
 echo "/books/book/price=$price">>$tempFile
}
function main(){
 addBook 1 book1 100
 addBook 2 book2 200
 addBook 3 book3 300
 cat $tempFile|2xml|xmllint --format --output new_sample.xml -
 rm $tempFile
}
main "$@"

The above code generates the same new_sample.xml as new_sample.xml.

Summarize

The above is the entire content of this article. I hope that the content of this article has certain reference value for everyone's study or work. If you have any questions, you can leave a message to communicate. Thank you for your support.