Python data cleaning: extracting jsonl file data fields and merging

Extract all "jsonl" format files in the directory based on python. Traverse a field in the file for extraction and merging.

Implement code

import os
import json
import time
from tqdm import tqdm  # Need to install first: pip install tqdm 
 
def process_files():
    # Set directory path    dir_path = r"D:\daku\Keyword Recognition\1623-0000001\zh"
 
    # Get and sort the file list    file_list = sorted([f for f in (dir_path) if ().endswith('.jsonl')],
                       key=lambda x: ((dir_path, x)),
                       reverse=True)  # Sort by file size descending order 
    # Progress statistics    total_files = len(file_list)
    processed_files = 0
    total_lines = sum(1 for f in file_list for _ in open((dir_path, f), 'r', encoding='utf-8'))
    processed_lines = 0
    start_time = ()
 
    # Output file settings    output_file = (dir_path, "combined_contents.txt")
 
    with open(output_file, "w", encoding="utf-8") as outfile:
        with tqdm(total=total_lines, desc="Merge progress", unit="line") as pbar:
            for filename in file_list:
                file_path = (dir_path, filename)
                try:
                    with open(file_path, "r", encoding="utf-8") as infile:
                        file_size = (file_path)
                        chunk_size = max(1024 * 1024, file_size // 100)  # Dynamically adjust the read block size 
                        while True:
                            lines = (chunk_size)
                            if not lines:
                                break
 
                            for line_num, line in enumerate(lines, 1):
                                line = ()
                                if not line:
                                    continue
 
                                try:
                                    data = (line)
                                    content = ("content", "").replace("\n", " ")  # Clear newline characters in the content                                    (content + "\n\n")  # Use double line breaks to separate records                                    processed_lines += 1
                                except :
                                    print(f"\nJSONAnalysis failed: {filename} The{processed_lines + 1}OK")
                                except Exception as e:
                                    print(f"\nHandle exceptions: {filename} The{processed_lines + 1}OK - {str(e)}")
 
                                # Progress update                                (1)
                                if processed_lines % 1000 == 0:
                                    elapsed = () - start_time
                                    speed = processed_lines / (elapsed + 1e-5)
                                    remaining = (total_lines - processed_lines) / (speed + 1e-5)
                                    pbar.set_postfix({
                                        'speed': f"{speed:.1f} lines/s",
                                         'Remaining Time': f"{remaining // 3600:.0f}h {remaining % 3600 // 60:.0f}m"
                                    })
 
                    processed_files += 1
                except Exception as e:
                    print(f"\nUnable to read the file {filename}: {str(e)}")
 
    # Generate statistical reports    end_time = ()
    print(f"\nMerge is completed！Co-processing {processed_files}/{total_files} A file")
    print(f"Total records: {processed_lines:,} strip")
    print(f"time consuming: {end_time - start_time:.2f} Second")
    print(f"Output file path: {output_file}")
 
 
if __name__ == "__main__":
    process_files()

Knowledge extension:

The difference between json file and jsonl file

As we all know, a JSON file is a file that uses the JSON (JavaScript Object Notation) format to store data. It is a structured text format that uses key-value pairs to represent data. A JSON file usually contains a root object that can contain multiple nested objects, arrays, and primitive data types.

JSONL files (JSON Lines) are text file formats with an independent JSON object per line. Each line is a valid JSON object, which is different from json's "list dict". For jsonl, there is no "list", only one line of "dict", separated by line breaks. Compared to JSON files, JSONL files are lighter, with each behavior independent JSON object without commas or other separators. This advantage is that it is convenient for reading one line, and you don’t have to read all the "dicts" in the "list" at one time like json, which saves memory and increases readability. Ordinary json files will be messy after opening. For jsonl, you need to install a jsonlines package by pip.

Example of contents of JSON files:

[{"name": "John", "age": 30},
{"name": "Jane", "age": 25},
{"name": "Bob", "age": 40}]

Example of contents of JSONL files:

{"name": "John", "age": 30}
{"name": "Jane", "age": 25}
{"name": "Bob", "age": 40}

The main differences are as follows:

JSON file:

Use braces {} to represent objects and square brackets [] to represent arrays.
The entire file is a valid JSON object or array.
Suitable for storing structured data, such as configuration files, API responses, etc.
Read the entire file at once, parse it into a JSON object, and the data in it can be accessed randomly.

JSONL file:

Each line is an independent valid JSON object.
There are no commas or other separators between each line.
Suitable for storing data recorded independently for each behavior, such as logs, sensor data, log lines, etc.
Read the file line by line, parse the JSON object line by line, and process one line of data at a time.

JSONL files are suitable for:

When data is stored independently in behavior units and there is no clear separator between each row of data.
When data needs to be processed line by line to save memory and improve processing speed.
When the amount of data is very large and cannot be loaded into memory at one time, the JSONL format provides a way to stream data.

By comparison, JSON files are more suitable for structured data storage and transmission, while JSONL files are more suitable for data storage and processing independently recorded for each behavior.

This is the article about the difference between json files and jsonl files in Python. For more information about the differences between json files and jsonl files, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!

2. Extract multiple text format contents for merging

That is, extract multiple text format files in the directory, merge and deduplicate.

Implement code

import os
from chardet import detect
 
 
def get_safe_encoding(encoding):
    """Convert detected encodings to safer compatible encodings"""
    encoding = ()
    if encoding in ['gb2312', 'gbk']:
        return 'gb18030'  # The most comprehensive Chinese coding    return encoding
 
 
def get_file_encoding(file_path):
    """Get file encoding and automatically upgrade to a safer version"""
    with open(file_path, 'rb') as f:
        raw_data = (10000)
    result = detect(raw_data)
    # Filter low confidence detection (confidence < 0.8 is considered untrustworthy)    if result['confidence'] &lt; 0.8:
        return 'gb18030'
    return get_safe_encoding(result['encoding'])
 
 
def merge_files(directory, output_filename=''):
    seen_lines = set()
    output_path = (directory, output_filename)
 
    txt_files = [(directory, f) for f in (directory) if ('.txt')]
 
    with open(output_path, 'w', encoding='utf-8', errors='ignore') as outfile:
        for file_path in txt_files:
            try:
                # Get security encoding and add error handling                file_enc = get_file_encoding(file_path)
                with open(file_path, 'r',
                          encoding=file_enc,
                          errors='backslashreplace') as infile:  # Keep characters that cannot be decoded                    for line_idx, line in enumerate(infile, 1):
                        try:
                            stripped_line = ('\n')
                            if stripped_line not in seen_lines:
                                (line)
                                seen_lines.add(stripped_line)
                        except Exception as line_err:
                            print(f"document {(file_path)} The {line_idx} Line handling exception: {str(line_err)}")
                            continue
            except Exception as file_err:
                print(f"document {(file_path)} Read failed: {str(file_err)}")
                continue
 
 
if __name__ == '__main__':
    target_directory = r'D:\daku\Keyword Recognition\stop6931'
    merge_files(target_directory)
    print(f'Merge is completed，输出document：{(target_directory, "")}')

This is the article about extracting jsonl file data fields and merging this article. For more related contents of Python extracting jsonl data fields, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!