SoFunction
Updated on 2025-04-08

Python's summary and comparison of processing solutions for different text lengths

Directly upload code + comment

Interested in communication

The effect is being verified.

1. Short text processing (<500tokens)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dimensional small model
def process_short(text):
    """Direct full text encoding"""
    return (text, convert_to_tensor=True)

# Exampleshort_text = "The basic concept of natural language processing"  # Length about 15 tokensvector = process_short(short_text)

2. Medium-length text processing (500-2000 tokens)

from langchain_text_splitters import RecursiveCharacterTextSplitter

def process_medium(text):
    """Overporous blocking strategy"""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        separators=["\n\n", "\n", "。", "!", "?"]
    )
    chunks = splitter.split_text(text)
    return [(chunk) for chunk in chunks]

# Examplemedium_text = "History of development of machine learning... (about 1500 words)"  # About 1800 tokenschunk_vectors = process_medium(medium_text)

3. Long text processing (2000-20000 tokens)

import spacy

def process_long(text):
    """Semantic chunking + summary enhancement"""
    # Loading semantic segmentation model    nlp = ("zh_core_web_sm") 
    doc = nlp(text)
    
    # Split by paragraph    chunks = [ for sent in ]
    
    # Generate a chapter summary    summary_model = SentenceTransformer('uer/sbert-base-chinese-nli')
    summaries = [summary_model.encode(chunk[:200]) for chunk in chunks]
    
    return chunks, summaries

# Examplelong_text = "Artificial Intelligence Technology White Paper... (about 20,000 words)"  # About 20,000 tokenstext_chunks, summary_vecs = process_long(long_text)

4. Extra-long text processing (20000-200000 tokens)

import faiss
import numpy as np

class HierarchicalIndex:
    def __init__(self):
        # Two-level index structure        self.summary_index = faiss.IndexFlatL2(384)
        self.chunk_index = (
            faiss.IndexFlatL2(384), 384, 100, 16, 8
        )
         = []

    def add_document(self, text):
        # Generate paragraph level summary        chunks, summaries = process_long(text)
        
        # Build index        summary_vecs = (summaries).astype('float32')
        chunk_vecs = ([(c) for c in chunks]).astype('float32')
        
        self.summary_index.add(summary_vecs)
        self.chunk_index.add(chunk_vecs)
        (chunks)

    def search(self, query, k=5):
        # Search the summary layer first        query_vec = (query).astype('float32')
        _, sum_indices = self.summary_index.search(([query_vec]), 10)
        
        # Search related blocks        target_chunks = [self.chunk_index.reconstruct(i) for i in sum_indices]
        target_chunks = (target_chunks).astype('float32')
        _, chunk_indices = self.chunk_index.search(target_chunks, k)
        
        return [[i] for i in chunk_indices]

#User Examplehindex = HierarchicalIndex()
hindex.add_document("Technical documents in a certain field... (about 150,000 words)")  # About 200,000 tokensresults = ("The application of deep learning in medical imaging")

5. Massive text processing (>200,000 tokens)

import  as dd
from  import Client

def process_extreme(file_path):
    """Distributed Processing Solution"""
    client = Client(n_workers=4)  # Start Dask cluster    
    # chunked reading    df = dd.read_parquet(file_path, chunksize=100000)  
    
    # Parallel encoding    df['vector'] = df['text'].map_partitions(
        lambda s: (),
        meta=('vector', object)
    )
    
    # Build a distributed index    df.to_parquet("encoded_data.parquet", engine="pyarrow")
    
# Example (processing 1 million texts)process_extreme("massive_data.parquet")

Performance optimization comparison table=

Text length Processing strategies Index Type Response time Memory consumption
<500 Direct encoding FlatIndex <10ms 1MB
2000 Overlapping chunking IVF+PQ 50-100ms 50MB
20000 Semantic chunking + summary index Secondary index 200-500ms 300MB
200000 Hierarchical index IVFOPQ+ProductQuant 1-2s 2GB
>200000 Distributed Processing Sharding Index 10s+ Cluster resources

Key processing technology

  • Sliding window:passchunk_overlapPreserve context continuity
  • Semantic chunking: Use spacy for sentence boundary detection
  • Hierarchical index:Abstract layer accelerates the rough screen, block layer ensures accuracy
  • Quantitative compression: PQ algorithm reduces memory usage (precision loss

The above is the detailed content of Python's summary and comparison of the processing solutions for different text lengths. For more information about Python text processing, please pay attention to my other related articles!