Detailed explanation of Python3's operations to read and process super large files

need:

Xiao Ming is a Python beginner. After learning how to read files in Python, he wants to do a small exercise: calculate the number of numeric characters (0~9) in a file.

Scenario 1: Small file processing

Suppose there is now a small file for testing small_file.txt, which contains a row of random strings:

feiowe9322nasd9233rl
aoeijfiowejf8322kaf9a
...

Code example: file_process.py

def count_digits(fname):
    """Count how many numeric characters are contained in the file"""
    count = 0
   
     with open(fname) as file:
        for line in file:
            for s in line:
                if ():
                    count += 1
    return count
 
 
fname = "./small_file.txt"
print(count_digits(fname))

Running results:

# Run the scriptpython3 ./file_process.py
 
# Output result13

Scenario 2: Large file processing

Suppose our large file big_file.txt now has 5G size and all the text is on one line.

Big_file.txt

df2if283rkwefh... &lt;Remaining 5 GB size&gt; ...

But I found that the same program took more than a minute to give the result, and the entire execution process consumed all the 4G memory of the laptop.

Problem analysis:

Why is the efficiency so much lower when the same code is used in large files? The reason is hidden in Xiao Ming's method of reading the file.

The file reading method used in the code can be regarded as the "standard practice" in Python: first use the with open (fine_name) context manager syntax to obtain a file object, and then iterate over it with a for loop to obtain the content in the file line by line. Why does this file reading become the standard? This is because it has two benefits:

(1) with context manager will automatically close the file descriptor;

(2) When iterating over file objects, the content is returned line by line and will not occupy too much memory.

However, although this standard practice is good, it is not without its shortcomings. If there is no newline in the file being read, then the (2nd) benefit listed above will no longer hold. After the line break character is missing, the program does not know when it will be interrupted when it traversed the file object. In the end, it can only generate a huge string object at once, which consumes a lot of time and memory in vain. This is why the count_digits() function becomes extremely slow when processing big_file.txt.

To solve this problem, we need to put this "standard practice" of reading files aside for the time being.

Solution:

Use the while loop to add read() method to read in chunks.

In addition to directly traversing the file object to read the file content line by line, we can also call the underlying () method. Unlike directly iterating file objects with loops, each call (chunk_size) will immediately read the file content of the chunk_size size from the current cursor position, without waiting for any newline characters to appear. With the help of the () method, the optimized code:

def count_digits_v2(fname):
    """Count how many numeric characters are contained in the file, 8 KB is read each time"""
    count = 0
    block_size = 1024 * 8
    with open(fname) as file:
        while True:
            chunk = (block_size)
            # When there is no more content in the file, the read call will return an empty string ''            if not chunk:
                break
            for s in chunk:
                if ():
                    count += 1
    return count
 
 
fname = "./big_file.txt"
print(count_digits_v2(fname))

In the new function, we use a while loop to read the file content, reading up to 8 KB each time. The program no longer needs to splice strings up to gigabytes in memory, and the memory usage will be greatly reduced.

(Gigi byte is a data storage unit that is usually used to represent the capacity size of a large-capacity storage device. It is equal to 1024^3 (1,073,741,824) bytes, or 1,024 megabytes. In the field of computers, it is often used to describe the size of large files, programs, or data sets, such as hard disk capacity, memory capacity, etc.)

Extension: Use Python to read some lines in a very large file

Python file reading has always been a common usage of python, and the general method is to directly readlines to load all rows.
but, For super large files (such as 100G tsv), it will be very slow to load all the lines directly.

If you want to traverse the entire file and process each line, you don't actually need to load all lines at once.

Here is an iterator method to read the file:

file_name = 'all_items.tsv'
start_line = 110000
end_line = 120000
with open(file_name) as f:
	for i in range(0, start_line):
        next(f)
    lines = [next(f) for i in range(start_line, end_line)]
print(len(lines))

For the large file all_items.tsv, we only read lines in a certain interval to process. First use the iterator to scroll to the start_line position, and then start reading.
If we can use an iterator to traverse the lines of this file (and process the lines):

file_name = 'all_items.tsv'
def process_line(line):
    return line
with open(file_name) as f:
    while True:
        try:
            line = next(f)
            process_line(line)
        except StopIteration:
            break

Use a while loop to complete the traversal from beginning to end.

The above is a detailed explanation of the operation of Python3 reading and processing super large files. For more information about Python3 reading and processing super large files, please pay attention to my other related articles!