Python's summary and comparison of processing solutions for different text lengths

Directly upload code + comment

Interested in communication

The effect is being verified.

1. Short text processing (<500tokens)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dimensional small model
def process_short(text):
    """Direct full text encoding"""
    return (text, convert_to_tensor=True)

# Exampleshort_text = "The basic concept of natural language processing"  # Length about 15 tokensvector = process_short(short_text)

2. Medium-length text processing (500-2000 tokens)

from langchain_text_splitters import RecursiveCharacterTextSplitter

def process_medium(text):
    """Overporous blocking strategy"""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        separators=["\n\n", "\n", "。", "！", "？"]
    )
    chunks = splitter.split_text(text)
    return [(chunk) for chunk in chunks]

# Examplemedium_text = "History of development of machine learning... (about 1500 words)"  # About 1800 tokenschunk_vectors = process_medium(medium_text)

3. Long text processing (2000-20000 tokens)

import spacy

def process_long(text):
    """Semantic chunking + summary enhancement"""
    # Loading semantic segmentation model    nlp = ("zh_core_web_sm") 
    doc = nlp(text)
    
    # Split by paragraph    chunks = [ for sent in ]
    
    # Generate a chapter summary    summary_model = SentenceTransformer('uer/sbert-base-chinese-nli')
    summaries = [summary_model.encode(chunk[:200]) for chunk in chunks]
    
    return chunks, summaries

# Examplelong_text = "Artificial Intelligence Technology White Paper... (about 20,000 words)"  # About 20,000 tokenstext_chunks, summary_vecs = process_long(long_text)

4. Extra-long text processing (20000-200000 tokens)

import faiss
import numpy as np

class HierarchicalIndex:
    def __init__(self):
        # Two-level index structure        self.summary_index = faiss.IndexFlatL2(384)
        self.chunk_index = (
            faiss.IndexFlatL2(384), 384, 100, 16, 8
        )
         = []

    def add_document(self, text):
        # Generate paragraph level summary        chunks, summaries = process_long(text)
        
        # Build index        summary_vecs = (summaries).astype('float32')
        chunk_vecs = ([(c) for c in chunks]).astype('float32')
        
        self.summary_index.add(summary_vecs)
        self.chunk_index.add(chunk_vecs)
        (chunks)

    def search(self, query, k=5):
        # Search the summary layer first        query_vec = (query).astype('float32')
        _, sum_indices = self.summary_index.search(([query_vec]), 10)
        
        # Search related blocks        target_chunks = [self.chunk_index.reconstruct(i) for i in sum_indices]
        target_chunks = (target_chunks).astype('float32')
        _, chunk_indices = self.chunk_index.search(target_chunks, k)
        
        return [[i] for i in chunk_indices]

#User Examplehindex = HierarchicalIndex()
hindex.add_document("Technical documents in a certain field... (about 150,000 words)")  # About 200,000 tokensresults = ("The application of deep learning in medical imaging")

5. Massive text processing (>200,000 tokens)

import  as dd
from  import Client

def process_extreme(file_path):
    """Distributed Processing Solution"""
    client = Client(n_workers=4)  # Start Dask cluster    
    # chunked reading    df = dd.read_parquet(file_path, chunksize=100000)  
    
    # Parallel encoding    df['vector'] = df['text'].map_partitions(
        lambda s: (),
        meta=('vector', object)
    )
    
    # Build a distributed index    df.to_parquet("encoded_data.parquet", engine="pyarrow")
    
# Example (processing 1 million texts)process_extreme("massive_data.parquet")

Performance optimization comparison table=

Text length	Processing strategies	Index Type	Response time	Memory consumption
<500	Direct encoding	FlatIndex	<10ms	1MB
2000	Overlapping chunking	IVF+PQ	50-100ms	50MB
20000	Semantic chunking + summary index	Secondary index	200-500ms	300MB
200000	Hierarchical index	IVFOPQ+ProductQuant	1-2s	2GB
>200000	Distributed Processing	Sharding Index	10s+	Cluster resources

Key processing technology

Sliding window:passchunk_overlapPreserve context continuity
Semantic chunking: Use spacy for sentence boundary detection
Hierarchical index:Abstract layer accelerates the rough screen, block layer ensures accuracy
Quantitative compression: PQ algorithm reduces memory usage (precision loss

The above is the detailed content of Python's summary and comparison of the processing solutions for different text lengths. For more information about Python text processing, please pay attention to my other related articles!