How to split a large TXT file into 4KB small files in Python

Handling large text files is a common challenge for programmers. Especially when we need to split a TXT file of hundreds of MB or even several GB into small pieces, manual operation is obviously unrealistic. Today we will talk about how to automatically complete this task in Python, especially how to accurately control the size of each split file to 4KB.

Why do you need to split TXT files

In actual development, we may encounter these situations:

Some old systems can only process small-volume text files
You need to split the log file and upload it to cloud storage
Data needs to be sliced when performing distributed processing
When debugging, you need to use small files to quickly test

4KB is a very commonly used split size because it happens to be the default memory page size of many systems, and it is very efficient to process. So the question is: How to use Python to implement this requirement?

Basic version: split by line

Let's first look at the simplest implementation method:

def split_by_line(input_file, output_prefix, chunk_size=4000):
    with open(input_file, 'r', encoding='utf-8') as f:
        file_count = 1
        current_size = 0
        output_file = None
        
        for line in f:
            line_size = len(('utf-8'))
            
            if current_size + line_size > chunk_size:
                if output_file:
                    output_file.close()
                output_file = open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8')
                file_count += 1
                current_size = 0
                
            if not output_file:
                output_file = open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8')
                file_count += 1
                
            output_file.write(line)
            current_size += line_size
            
        if output_file:
            output_file.close()

This script can split files by line, trying to ensure that each file does not exceed the specified size. But there is a problem: it cannot accurately control the file size is exactly 4KB, especially when a certain line is particularly long, a single file may exceed the limit.

Advanced version: Accurately control file size

To achieve more precise control, we need to handle it by bytes rather than by line:

def split_by_size(input_file, output_prefix, chunk_size=4096):
    with open(input_file, 'rb') as f:
        file_count = 1
        while True:
            chunk = (chunk_size)
            if not chunk:
                break
            with open(f"{output_prefix}_{file_count}.txt", 'wb') as out_file:
                out_file.write(chunk)
            file_count += 1

Notice! Here we open the file in binary mode ('rb'), which can accurately control the number of bytes read. However, this may appear garbled in UTF-8-encoded Chinese files, because a Chinese character may be truncated from the middle.

Perfect solution: Support UTF-8 encoding

In order to solve the problem of Chinese garbled code, we need a smarter way to deal with it:

def split_utf8_safely(input_file, output_prefix, chunk_size=4096):
    buffer = ""
    file_count = 1
    current_size = 0
    
    with open(input_file, 'r', encoding='utf-8') as f:
        while True:
            char = (1)
            if not char:
                if buffer:
                    with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file:
                        out_file.write(buffer)
                break
                
            char_size = len(('utf-8'))
            if current_size + char_size > chunk_size:
                with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file:
                    out_file.write(buffer)
                file_count += 1
                buffer = ""
                current_size = 0
                
            buffer += char
            current_size += char_size

This method reads the file character by character to ensure that multibyte characters are not truncated. Although the speed will be slower, it can ensure that all divided files can display Chinese content normally.

Performance optimization: Use buffers

When processing large files, character-by-character reading efficiency is too inefficient. We can use buffers to improve performance:

def split_with_buffer(input_file, output_prefix, chunk_size=4096, buffer_size=1024):
    buffer = ""
    file_count = 1
    current_size = 0
    
    with open(input_file, 'r', encoding='utf-8') as f:
        while True:
            chunk = (buffer_size)
            if not chunk:
                if buffer:
                    with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file:
                        out_file.write(buffer)
                break
                
            buffer += chunk
            while len(('utf-8')) &gt;= chunk_size:
                # Find the largest substring that does not exceed chunk_size                split_pos = 0
                for i in range(1, len(buffer)+1):
                    if len(buffer[:i].encode('utf-8')) &lt;= chunk_size:
                        split_pos = i
                    else:
                        break
                        
                with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file:
                    out_file.write(buffer[:split_pos])
                file_count += 1
                buffer = buffer[split_pos:]
                current_size = 0

Handle special circumstances

In practical applications, we also need to consider some special circumstances:

File header processing: If you need to keep the header information of the original file to each split file
Row integrity: In some scenarios, row integrity needs to be maintained
Memory limit: Memory optimization when handling super large files
Progress display: Adding a progress bar makes long-running tasks more friendly

Here is an example of the implementation of retaining file headers:

def split_with_header(input_file, output_prefix, chunk_size=4096, header_lines=1):
    # Read the file header first    with open(input_file, 'r', encoding='utf-8') as f:
        header = [next(f) for _ in range(header_lines)]
    
    buffer = ""
    file_count = 1
    current_size = len(''.join(header).encode('utf-8'))
    
    with open(input_file, 'r', encoding='utf-8') as f:
        # Skip the read header        for _ in range(header_lines):
            next(f)
            
        while True:
            char = (1)
            if not char:
                if buffer:
                    with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file:
                        out_file.writelines(header)
                        out_file.write(buffer)
                break
                
            char_size = len(('utf-8'))
            if current_size + char_size &gt; chunk_size:
                with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file:
                    out_file.writelines(header)
                    out_file.write(buffer)
                file_count += 1
                buffer = ""
                current_size = len(''.join(header).encode('utf-8'))
                
            buffer += char
            current_size += char_size

Summarize

We have introduced a variety of ways to split TXT files in Python:

Simple splitting files with obvious lines

The most efficient segmentation by bytes but does not support UTF-8

Versions with UTF-8 support are suitable for Chinese text

Buffer version balances performance and accuracy

Special requirements such as retaining file headers requires additional processing

remember! Which method to choose depends on your specific needs. If you are dealing with large GB-level files, it is recommended to use a buffer scheme and consider advanced technologies such as memory mapping. Hope this guide can help you solve the problem of file segmentation!

This is the article about how Python divides large TXT files into 4KB small files. For more related content on Python large files, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!