SoFunction
Updated on 2025-04-07

How to get XML document size

XML documents are not deterministic from format to size. Some may have only a few lines, while others have several megabytes. You may wonder if you need to know the size of the XML document. And when performance becomes the primary issue, knowing the size of XML documents is a must-do.


From a performance perspective, there are two types of methods for processing XML documents. The batch processing method takes a short time to parse into groups of documents. The real-time method is to process documents in real time. The performance of the batch processing method can be measured by how many documents are processed within a certain time, while the performance of the real-time mode uses a similar measurement method, but how long it takes to process a document to be calculated.


Scenarios scene
Imagine you have a system that works in real time, such as a web server. This system needs to receive orders sent by customers in real time and respond to this order immediately.

This system obviously cannot be performed in batch processing. Let’s just estimate it, assuming that this is a very simple order with only ten items, the generated XML documents will be relatively small, and each document is about 4KB. In this case, use DOM to parse the received document.

If your orders are only a few per hour, then system performance is not a problem for you. But in the long run, one day the orders will be too high to make you realize that the system performance must be improved.

Now you start thinking about improving performance to accommodate growing loads. Your order documents are already very small, and it doesn't make any practical sense to merge them into larger documents. From a vertical perspective, you can improve the processing capacity of the existing system; from a horizontal perspective, you can add more systems to spread the load.

Let’s look at another completely different area, what you are dealing with now is a large data warehouse. It's completely different from a web server. You now use FTP to transfer XML documents with an average size of 300MB. If you still use DOM to parse XML documents, you will soon have big trouble. Instead, it will be much better if you use SAX, which can parse incoming XML documents directly without having to load them all into memory in advance.


Change the document size
Sometimes you will encounter special situations that need to change the XML document size. Imagine that like just now, you have a web server that processes XML documents in real time, and at this time, all the documents are 400MB instead of 4KB. You can't use the DOM method because that takes up too much memory. But because this is a real-time system, performance is very important. You can use SAX, but it takes time to allow and a powerful processor.

In this case, you can improve system execution performance by changing the document size. For example, you can divide a 400MB document into 10 40MB or 40 10MB small documents, which is more efficient than processing a 400MB document. In this way, you can use DOM to read the file into memory for processing and respond to each document request in a timely manner. It can also clear irrelevant documents.

There are similar situations in batch processing methods. Imagine you are processing thousands of 4KB size documents through batch processing in DOM. The best way is to merge a thousand files into one 4MB file. Because each document loading takes up system time (either DOM or SAX). By combining a thousand documents into one, you only need to load one document, and the time it takes is only one thousandth of the original.