Extract all "jsonl" format files in the directory based on python. Traverse a field in the file for extraction and merging.
Implement code
import os import json import time from tqdm import tqdm # Need to install first: pip install tqdm def process_files(): # Set directory path dir_path = r"D:\daku\Keyword Recognition\1623-0000001\zh" # Get and sort the file list file_list = sorted([f for f in (dir_path) if ().endswith('.jsonl')], key=lambda x: ((dir_path, x)), reverse=True) # Sort by file size descending order # Progress statistics total_files = len(file_list) processed_files = 0 total_lines = sum(1 for f in file_list for _ in open((dir_path, f), 'r', encoding='utf-8')) processed_lines = 0 start_time = () # Output file settings output_file = (dir_path, "combined_contents.txt") with open(output_file, "w", encoding="utf-8") as outfile: with tqdm(total=total_lines, desc="Merge progress", unit="line") as pbar: for filename in file_list: file_path = (dir_path, filename) try: with open(file_path, "r", encoding="utf-8") as infile: file_size = (file_path) chunk_size = max(1024 * 1024, file_size // 100) # Dynamically adjust the read block size while True: lines = (chunk_size) if not lines: break for line_num, line in enumerate(lines, 1): line = () if not line: continue try: data = (line) content = ("content", "").replace("\n", " ") # Clear newline characters in the content (content + "\n\n") # Use double line breaks to separate records processed_lines += 1 except : print(f"\nJSONAnalysis failed: {filename} The{processed_lines + 1}OK") except Exception as e: print(f"\nHandle exceptions: {filename} The{processed_lines + 1}OK - {str(e)}") # Progress update (1) if processed_lines % 1000 == 0: elapsed = () - start_time speed = processed_lines / (elapsed + 1e-5) remaining = (total_lines - processed_lines) / (speed + 1e-5) pbar.set_postfix({ 'speed': f"{speed:.1f} lines/s", 'Remaining Time': f"{remaining // 3600:.0f}h {remaining % 3600 // 60:.0f}m" }) processed_files += 1 except Exception as e: print(f"\nUnable to read the file {filename}: {str(e)}") # Generate statistical reports end_time = () print(f"\nMerge is completed!Co-processing {processed_files}/{total_files} A file") print(f"Total records: {processed_lines:,} strip") print(f"time consuming: {end_time - start_time:.2f} Second") print(f"Output file path: {output_file}") if __name__ == "__main__": process_files()
Knowledge extension:
The difference between json file and jsonl file
As we all know, a JSON file is a file that uses the JSON (JavaScript Object Notation) format to store data. It is a structured text format that uses key-value pairs to represent data. A JSON file usually contains a root object that can contain multiple nested objects, arrays, and primitive data types.
JSONL files (JSON Lines) are text file formats with an independent JSON object per line. Each line is a valid JSON object, which is different from json's "list dict". For jsonl, there is no "list", only one line of "dict", separated by line breaks. Compared to JSON files, JSONL files are lighter, with each behavior independent JSON object without commas or other separators. This advantage is that it is convenient for reading one line, and you don’t have to read all the "dicts" in the "list" at one time like json, which saves memory and increases readability. Ordinary json files will be messy after opening. For jsonl, you need to install a jsonlines package by pip.
Example of contents of JSON files:
[{"name": "John", "age": 30}, {"name": "Jane", "age": 25}, {"name": "Bob", "age": 40}]
Example of contents of JSONL files:
{"name": "John", "age": 30} {"name": "Jane", "age": 25} {"name": "Bob", "age": 40}
The main differences are as follows:
JSON file:
- Use braces {} to represent objects and square brackets [] to represent arrays.
- The entire file is a valid JSON object or array.
- Suitable for storing structured data, such as configuration files, API responses, etc.
- Read the entire file at once, parse it into a JSON object, and the data in it can be accessed randomly.
JSONL file:
- Each line is an independent valid JSON object.
- There are no commas or other separators between each line.
- Suitable for storing data recorded independently for each behavior, such as logs, sensor data, log lines, etc.
- Read the file line by line, parse the JSON object line by line, and process one line of data at a time.
JSONL files are suitable for:
- When data is stored independently in behavior units and there is no clear separator between each row of data.
- When data needs to be processed line by line to save memory and improve processing speed.
- When the amount of data is very large and cannot be loaded into memory at one time, the JSONL format provides a way to stream data.
By comparison, JSON files are more suitable for structured data storage and transmission, while JSONL files are more suitable for data storage and processing independently recorded for each behavior.
This is the article about the difference between json files and jsonl files in Python. For more information about the differences between json files and jsonl files, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!
2. Extract multiple text format contents for merging
That is, extract multiple text format files in the directory, merge and deduplicate.
Implement code
import os from chardet import detect def get_safe_encoding(encoding): """Convert detected encodings to safer compatible encodings""" encoding = () if encoding in ['gb2312', 'gbk']: return 'gb18030' # The most comprehensive Chinese coding return encoding def get_file_encoding(file_path): """Get file encoding and automatically upgrade to a safer version""" with open(file_path, 'rb') as f: raw_data = (10000) result = detect(raw_data) # Filter low confidence detection (confidence < 0.8 is considered untrustworthy) if result['confidence'] < 0.8: return 'gb18030' return get_safe_encoding(result['encoding']) def merge_files(directory, output_filename=''): seen_lines = set() output_path = (directory, output_filename) txt_files = [(directory, f) for f in (directory) if ('.txt')] with open(output_path, 'w', encoding='utf-8', errors='ignore') as outfile: for file_path in txt_files: try: # Get security encoding and add error handling file_enc = get_file_encoding(file_path) with open(file_path, 'r', encoding=file_enc, errors='backslashreplace') as infile: # Keep characters that cannot be decoded for line_idx, line in enumerate(infile, 1): try: stripped_line = ('\n') if stripped_line not in seen_lines: (line) seen_lines.add(stripped_line) except Exception as line_err: print(f"document {(file_path)} The {line_idx} Line handling exception: {str(line_err)}") continue except Exception as file_err: print(f"document {(file_path)} Read failed: {str(file_err)}") continue if __name__ == '__main__': target_directory = r'D:\daku\Keyword Recognition\stop6931' merge_files(target_directory) print(f'Merge is completed,输出document:{(target_directory, "")}')
This is the article about extracting jsonl file data fields and merging this article. For more related contents of Python extracting jsonl data fields, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!