Handling large text files is a common challenge for programmers. Especially when we need to split a TXT file of hundreds of MB or even several GB into small pieces, manual operation is obviously unrealistic. Today we will talk about how to automatically complete this task in Python, especially how to accurately control the size of each split file to 4KB.
Why do you need to split TXT files
In actual development, we may encounter these situations:
- Some old systems can only process small-volume text files
- You need to split the log file and upload it to cloud storage
- Data needs to be sliced when performing distributed processing
- When debugging, you need to use small files to quickly test
4KB is a very commonly used split size because it happens to be the default memory page size of many systems, and it is very efficient to process. So the question is: How to use Python to implement this requirement?
Basic version: split by line
Let's first look at the simplest implementation method:
def split_by_line(input_file, output_prefix, chunk_size=4000): with open(input_file, 'r', encoding='utf-8') as f: file_count = 1 current_size = 0 output_file = None for line in f: line_size = len(('utf-8')) if current_size + line_size > chunk_size: if output_file: output_file.close() output_file = open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') file_count += 1 current_size = 0 if not output_file: output_file = open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') file_count += 1 output_file.write(line) current_size += line_size if output_file: output_file.close()
This script can split files by line, trying to ensure that each file does not exceed the specified size. But there is a problem: it cannot accurately control the file size is exactly 4KB, especially when a certain line is particularly long, a single file may exceed the limit.
Advanced version: Accurately control file size
To achieve more precise control, we need to handle it by bytes rather than by line:
def split_by_size(input_file, output_prefix, chunk_size=4096): with open(input_file, 'rb') as f: file_count = 1 while True: chunk = (chunk_size) if not chunk: break with open(f"{output_prefix}_{file_count}.txt", 'wb') as out_file: out_file.write(chunk) file_count += 1
Notice! Here we open the file in binary mode ('rb'), which can accurately control the number of bytes read. However, this may appear garbled in UTF-8-encoded Chinese files, because a Chinese character may be truncated from the middle.
Perfect solution: Support UTF-8 encoding
In order to solve the problem of Chinese garbled code, we need a smarter way to deal with it:
def split_utf8_safely(input_file, output_prefix, chunk_size=4096): buffer = "" file_count = 1 current_size = 0 with open(input_file, 'r', encoding='utf-8') as f: while True: char = (1) if not char: if buffer: with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file: out_file.write(buffer) break char_size = len(('utf-8')) if current_size + char_size > chunk_size: with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file: out_file.write(buffer) file_count += 1 buffer = "" current_size = 0 buffer += char current_size += char_size
This method reads the file character by character to ensure that multibyte characters are not truncated. Although the speed will be slower, it can ensure that all divided files can display Chinese content normally.
Performance optimization: Use buffers
When processing large files, character-by-character reading efficiency is too inefficient. We can use buffers to improve performance:
def split_with_buffer(input_file, output_prefix, chunk_size=4096, buffer_size=1024): buffer = "" file_count = 1 current_size = 0 with open(input_file, 'r', encoding='utf-8') as f: while True: chunk = (buffer_size) if not chunk: if buffer: with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file: out_file.write(buffer) break buffer += chunk while len(('utf-8')) >= chunk_size: # Find the largest substring that does not exceed chunk_size split_pos = 0 for i in range(1, len(buffer)+1): if len(buffer[:i].encode('utf-8')) <= chunk_size: split_pos = i else: break with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file: out_file.write(buffer[:split_pos]) file_count += 1 buffer = buffer[split_pos:] current_size = 0
Handle special circumstances
In practical applications, we also need to consider some special circumstances:
- File header processing: If you need to keep the header information of the original file to each split file
- Row integrity: In some scenarios, row integrity needs to be maintained
- Memory limit: Memory optimization when handling super large files
- Progress display: Adding a progress bar makes long-running tasks more friendly
Here is an example of the implementation of retaining file headers:
def split_with_header(input_file, output_prefix, chunk_size=4096, header_lines=1): # Read the file header first with open(input_file, 'r', encoding='utf-8') as f: header = [next(f) for _ in range(header_lines)] buffer = "" file_count = 1 current_size = len(''.join(header).encode('utf-8')) with open(input_file, 'r', encoding='utf-8') as f: # Skip the read header for _ in range(header_lines): next(f) while True: char = (1) if not char: if buffer: with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file: out_file.writelines(header) out_file.write(buffer) break char_size = len(('utf-8')) if current_size + char_size > chunk_size: with open(f"{output_prefix}_{file_count}.txt", 'w', encoding='utf-8') as out_file: out_file.writelines(header) out_file.write(buffer) file_count += 1 buffer = "" current_size = len(''.join(header).encode('utf-8')) buffer += char current_size += char_size
Summarize
We have introduced a variety of ways to split TXT files in Python:
Simple splitting files with obvious lines
The most efficient segmentation by bytes but does not support UTF-8
Versions with UTF-8 support are suitable for Chinese text
Buffer version balances performance and accuracy
Special requirements such as retaining file headers requires additional processing
remember! Which method to choose depends on your specific needs. If you are dealing with large GB-level files, it is recommended to use a buffer scheme and consider advanced technologies such as memory mapping. Hope this guide can help you solve the problem of file segmentation!
This is the article about how Python divides large TXT files into 4KB small files. For more related content on Python large files, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!