Directly upload code + comment
Interested in communication
The effect is being verified.
1. Short text processing (<500tokens)
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') # 384-dimensional small model def process_short(text): """Direct full text encoding""" return (text, convert_to_tensor=True) # Exampleshort_text = "The basic concept of natural language processing" # Length about 15 tokensvector = process_short(short_text)
2. Medium-length text processing (500-2000 tokens)
from langchain_text_splitters import RecursiveCharacterTextSplitter def process_medium(text): """Overporous blocking strategy""" splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=["\n\n", "\n", "。", "!", "?"] ) chunks = splitter.split_text(text) return [(chunk) for chunk in chunks] # Examplemedium_text = "History of development of machine learning... (about 1500 words)" # About 1800 tokenschunk_vectors = process_medium(medium_text)
3. Long text processing (2000-20000 tokens)
import spacy def process_long(text): """Semantic chunking + summary enhancement""" # Loading semantic segmentation model nlp = ("zh_core_web_sm") doc = nlp(text) # Split by paragraph chunks = [ for sent in ] # Generate a chapter summary summary_model = SentenceTransformer('uer/sbert-base-chinese-nli') summaries = [summary_model.encode(chunk[:200]) for chunk in chunks] return chunks, summaries # Examplelong_text = "Artificial Intelligence Technology White Paper... (about 20,000 words)" # About 20,000 tokenstext_chunks, summary_vecs = process_long(long_text)
4. Extra-long text processing (20000-200000 tokens)
import faiss import numpy as np class HierarchicalIndex: def __init__(self): # Two-level index structure self.summary_index = faiss.IndexFlatL2(384) self.chunk_index = ( faiss.IndexFlatL2(384), 384, 100, 16, 8 ) = [] def add_document(self, text): # Generate paragraph level summary chunks, summaries = process_long(text) # Build index summary_vecs = (summaries).astype('float32') chunk_vecs = ([(c) for c in chunks]).astype('float32') self.summary_index.add(summary_vecs) self.chunk_index.add(chunk_vecs) (chunks) def search(self, query, k=5): # Search the summary layer first query_vec = (query).astype('float32') _, sum_indices = self.summary_index.search(([query_vec]), 10) # Search related blocks target_chunks = [self.chunk_index.reconstruct(i) for i in sum_indices] target_chunks = (target_chunks).astype('float32') _, chunk_indices = self.chunk_index.search(target_chunks, k) return [[i] for i in chunk_indices] #User Examplehindex = HierarchicalIndex() hindex.add_document("Technical documents in a certain field... (about 150,000 words)") # About 200,000 tokensresults = ("The application of deep learning in medical imaging")
5. Massive text processing (>200,000 tokens)
import as dd from import Client def process_extreme(file_path): """Distributed Processing Solution""" client = Client(n_workers=4) # Start Dask cluster # chunked reading df = dd.read_parquet(file_path, chunksize=100000) # Parallel encoding df['vector'] = df['text'].map_partitions( lambda s: (), meta=('vector', object) ) # Build a distributed index df.to_parquet("encoded_data.parquet", engine="pyarrow") # Example (processing 1 million texts)process_extreme("massive_data.parquet")
Performance optimization comparison table=
Text length | Processing strategies | Index Type | Response time | Memory consumption |
---|---|---|---|---|
<500 | Direct encoding | FlatIndex | <10ms | 1MB |
2000 | Overlapping chunking | IVF+PQ | 50-100ms | 50MB |
20000 | Semantic chunking + summary index | Secondary index | 200-500ms | 300MB |
200000 | Hierarchical index | IVFOPQ+ProductQuant | 1-2s | 2GB |
>200000 | Distributed Processing | Sharding Index | 10s+ | Cluster resources |
Key processing technology
-
Sliding window:pass
chunk_overlap
Preserve context continuity - Semantic chunking: Use spacy for sentence boundary detection
- Hierarchical index:Abstract layer accelerates the rough screen, block layer ensures accuracy
- Quantitative compression: PQ algorithm reduces memory usage (precision loss
The above is the detailed content of Python's summary and comparison of the processing solutions for different text lengths. For more information about Python text processing, please pay attention to my other related articles!