In the search engine's "black box", there is a magic that allows information to be the only ones - reverse indexing. This seemingly cold technical concept is actually like a classification card in a library, allowing each book to be quickly positioned. This article will use Python key to open the wonderful world of inverted indexing.
1. The past and present life of inverted indexing
Imagine you have a library with millions of books. The traditional index is arranged according to bookshelf numbers. If you look for "Python Programming", you have to turn from Area A to Area Z. The inverted index is like a magic card cabinet, each drawer is covered with tags such as "Programming", "Python", and "Algorithm", and you can directly see the location of all related books.
Technology evolution:
- 1950s: Connie M. Weaver first proposed the concept of inverse indexing
- 1990s: Lucene project introduces it into the realm of open source search
- 2010s: Distributed search engines such as Elasticsearch/Solr push it into the era of big data
Core advantages:
- Query speed is increased by 100-1000 times (compared to sequential scanning)
- Supports complex boolean queries (AND/OR/NOT)
- Natural adaptation TF-IDF and other sorting algorithm
2. The three realms of Python implementation
We will gradually evolve industrial-grade inverted indexing through three versions.
Primary version: Dictionary nested list
def build_index(docs): index = {} for doc_id, content in enumerate(docs): words = () for word in words: if word not in index: index[word] = [] index[word].append(doc_id) return index #User Exampledocs = ["hello world", "python is fun", "hello python"] print(build_index(docs)) # Output: {'hello': [0, 2], 'world': [0], 'python': [1, 2], ...}
Problem: Repeated vocabulary is not processed, query efficiency decreases linearly with data growth
Evolution version: sorting deduplication + binary search
from collections import defaultdict def build_optimized_index(docs): index = defaultdict(list) for doc_id, content in enumerate(docs): seen = set() words = () for word in words: if word not in seen: (word) index[word].append(doc_id) # Sort each word list for word in index: index[word].sort() return index # Query Optimizationdef search(index, word): if word in index: return index[word] return [] # Use binary search to optimize querydef binary_search(arr, target): low, high = 0, len(arr)-1 while low <= high: mid = (low+high)//2 if arr[mid] == target: return mid elif arr[mid] < target: low = mid +1 else: high = mid -1 return -1 # Example: Find a document containing "python"docs = ["hello world", "python is fun", "hello python", "python tutorial"] index = build_optimized_index(docs) print(search(index, "python")) # Output [1, 2, 3]
Key Improvements:
Use collection deduplication to reduce storage space
Sorting the word list, support binary search (time complexity O(log n))
Query efficiency is 5-10 times
Ultimate version: compressed storage + boolean query
import bisect from typing import List, Dict class InvertedIndex: def __init__(self): = {} # Type: Dict[str, List[int]] self.doc_counts = {} # Type: Dict[str, int] def add_document(self, doc_id: int, content: str): words = () seen = set() for word in words: if word not in seen: (word) if word not in : [word] = [] self.doc_counts[word] = 0 # Use bisect insert to keep order ([word], doc_id) self.doc_counts[word] +=1 def search(self, query: str) -> List[int]: if " AND " in query: terms = (" AND ") results = self._search_single(terms[0]) for term in terms[1:]: results = self._intersect(results, self._search_single(term)) return results elif " OR " in query: terms = (" OR ") results = [] for term in terms: results = self._union(results, self._search_single(term)) return results else: return self._search_single(query) def _search_single(self, term: str) -> List[int]: if term in : return [term] return [] def _intersect(self, a: List[int], b: List[int]) -> List[int]: # Use the double pointer method to find intersection i = j = 0 result = [] while i < len(a) and j < len(b): if a[i] == b[j]: (a[i]) i +=1 j +=1 elif a[i] < b[j]: i +=1 else: j +=1 return result def _union(self, a: List[int], b: List[int]) -> List[int]: # Use the merge method to find union result = [] i = j = 0 while i < len(a) and j < len(b): if a[i] == b[j]: (a[i]) i +=1 j +=1 elif a[i] < b[j]: (a[i]) i +=1 else: (b[j]) j +=1 result += a[i:] result += b[j:] return list(sorted(set(result))) # Dereorder #User Exampleindex = InvertedIndex() docs = [ "Python is great for data science", "Java is popular for enterprise applications", "JavaScript rules the web development", "Python and JavaScript are both scripting languages" ] for doc_id, doc in enumerate(docs): index.add_document(doc_id, doc) print(("Python AND scripting")) # Output [3]print(("Python OR Java")) # Output [0,1,3]
Industrial-grade optimization:
- Compressed storage: Use differential encoding (such as [1,3,5] as [1,2,2])
- Bloom filter: Quickly determine whether the term exists
- Distributed storage: fragmented storage by first letter of word
- Caching mechanism: LRU caches popular query results
3. Performance comparison and selection suggestions
Implementation method | Build time | Query time | Memory usage | Applicable scenarios |
---|---|---|---|---|
Dictionary nested list | O(n) | O(n) | high | Small dataset/teaching demonstration |
Sort list + binary | O(n log n) | O(log n) | middle | Medium size/simple query |
Compressed storage + boolean query | O(n log n) | O(k log n) | Low | Production environment/complex query |
Selection suggestions:
- Personal project: Junior version is sufficient
- Small and medium-sized applications: Evolution version + Bloom filter
- Big data scenario: Ultimate version + distributed storage (such as Redis cluster)
4. Practical application: build a mini search engine
class SimpleSearchEngine: def __init__(self): = InvertedIndex() = [] def add_document(self, content: str): doc_id = len() (content) .add_document(doc_id, content) def search(self, query: str) -> List[str]: doc_ids = (query) return [[doc_id] for doc_id in doc_ids] #User Exampleengine = SimpleSearchEngine() engine.add_document("Python is a versatile language") engine.add_document("JavaScript dominates web development") engine.add_document("Python and machine learning go hand in hand") print(("Python AND machine")) # Output:['Python and machine learning go hand in hand']
Expansion direction:
- Add TF-IDF sorting
- Support phrase query ("machine learning" as a whole)
- Integrated Chinese word participle (using Jieba library)
- Implement paging function
5. Philosophical thinking about inverted indexing
The essence of inverted indexing is the classic practice of changing space to time. By precalculating the relationship between stored terms and documents, it converts the O(n) operations that originally needed to traverse all documents into O(1) or O(log n) searches. This idea is everywhere in computer technology:
- Database index (B+ tree)
- Cache Mechanism (Redis)
- CDN content distribution
- Merkle tree of blockchain
Mastering the implementation principle of inverted indexing will not only help you understand search engines, but also cultivate sensitivity to the general design pattern of "precomputing-storage-fast query".
Conclusion
From simple dictionary implementations to industrial-grade solutions that support complex queries, we have witnessed the flexibility and power of Python in reverse indexing implementations. Next time you enter keywords in the search box, you might as well imagine the inverted indexes that work silently behind them. They are like countless classification card cabinets, accurately navigating in the ocean of data. And Python is one of the best tools for building these magic card cabinets.
This is the article about this article about taking you to play with the inverted index in Python. For more related content on Python inverted index, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!