One article will take you to play with the inverted index in Python

In the search engine's "black box", there is a magic that allows information to be the only ones - reverse indexing. This seemingly cold technical concept is actually like a classification card in a library, allowing each book to be quickly positioned. This article will use Python key to open the wonderful world of inverted indexing.

1. The past and present life of inverted indexing

Imagine you have a library with millions of books. The traditional index is arranged according to bookshelf numbers. If you look for "Python Programming", you have to turn from Area A to Area Z. The inverted index is like a magic card cabinet, each drawer is covered with tags such as "Programming", "Python", and "Algorithm", and you can directly see the location of all related books.

Technology evolution:

1950s: Connie M. Weaver first proposed the concept of inverse indexing
1990s: Lucene project introduces it into the realm of open source search
2010s: Distributed search engines such as Elasticsearch/Solr push it into the era of big data

Core advantages:

Query speed is increased by 100-1000 times (compared to sequential scanning)
Supports complex boolean queries (AND/OR/NOT)
Natural adaptation TF-IDF and other sorting algorithm

2. The three realms of Python implementation

We will gradually evolve industrial-grade inverted indexing through three versions.

Primary version: Dictionary nested list

def build_index(docs):
    index = {}
    for doc_id, content in enumerate(docs):
        words = ()
        for word in words:
            if word not in index:
                index[word] = []
            index[word].append(doc_id)
    return index
 
#User Exampledocs = ["hello world", "python is fun", "hello python"]
print(build_index(docs))
# Output: {'hello': [0, 2], 'world': [0], 'python': [1, 2], ...}

Problem: Repeated vocabulary is not processed, query efficiency decreases linearly with data growth

Evolution version: sorting deduplication + binary search

from collections import defaultdict
 
def build_optimized_index(docs):
    index = defaultdict(list)
    for doc_id, content in enumerate(docs):
        seen = set()
        words = ()
        for word in words:
            if word not in seen:
                (word)
                index[word].append(doc_id)
    # Sort each word list    for word in index:
        index[word].sort()
    return index
 
# Query Optimizationdef search(index, word):
    if word in index:
        return index[word]
    return []
 
# Use binary search to optimize querydef binary_search(arr, target):
    low, high = 0, len(arr)-1
    while low &lt;= high:
        mid = (low+high)//2
        if arr[mid] == target:
            return mid
        elif arr[mid] &lt; target:
            low = mid +1
        else:
            high = mid -1
    return -1
 
# Example: Find a document containing "python"docs = ["hello world", "python is fun", "hello python", "python tutorial"]
index = build_optimized_index(docs)
print(search(index, "python"))  # Output [1, 2, 3]

Key Improvements:

Use collection deduplication to reduce storage space

Sorting the word list, support binary search (time complexity O(log n))

Query efficiency is 5-10 times

Ultimate version: compressed storage + boolean query

import bisect
from typing import List, Dict
 
class InvertedIndex:
    def __init__(self):
         = {}  # Type: Dict[str, List[int]]        self.doc_counts = {}  # Type: Dict[str, int] 
    def add_document(self, doc_id: int, content: str):
        words = ()
        seen = set()
        for word in words:
            if word not in seen:
                (word)
                if word not in :
                    [word] = []
                    self.doc_counts[word] = 0
                # Use bisect insert to keep order                ([word], doc_id)
                self.doc_counts[word] +=1
 
    def search(self, query: str) -&gt; List[int]:
        if " AND " in query:
            terms = (" AND ")
            results = self._search_single(terms[0])
            for term in terms[1:]:
                results = self._intersect(results, self._search_single(term))
            return results
        elif " OR " in query:
            terms = (" OR ")
            results = []
            for term in terms:
                results = self._union(results, self._search_single(term))
            return results
        else:
            return self._search_single(query)
 
    def _search_single(self, term: str) -&gt; List[int]:
        if term in :
            return [term]
        return []
 
    def _intersect(self, a: List[int], b: List[int]) -&gt; List[int]:
        # Use the double pointer method to find intersection        i = j = 0
        result = []
        while i &lt; len(a) and j &lt; len(b):
            if a[i] == b[j]:
                (a[i])
                i +=1
                j +=1
            elif a[i] &lt; b[j]:
                i +=1
            else:
                j +=1
        return result
 
    def _union(self, a: List[int], b: List[int]) -&gt; List[int]:
        # Use the merge method to find union        result = []
        i = j = 0
        while i &lt; len(a) and j &lt; len(b):
            if a[i] == b[j]:
                (a[i])
                i +=1
                j +=1
            elif a[i] &lt; b[j]:
                (a[i])
                i +=1
            else:
                (b[j])
                j +=1
        result += a[i:]
        result += b[j:]
        return list(sorted(set(result)))  # Dereorder 
#User Exampleindex = InvertedIndex()
docs = [
    "Python is great for data science",
    "Java is popular for enterprise applications",
    "JavaScript rules the web development",
    "Python and JavaScript are both scripting languages"
]
 
for doc_id, doc in enumerate(docs):
    index.add_document(doc_id, doc)
 
print(("Python AND scripting"))  # Output [3]print(("Python OR Java"))        # Output [0,1,3]

Industrial-grade optimization:

Compressed storage: Use differential encoding (such as [1,3,5] as [1,2,2])
Bloom filter: Quickly determine whether the term exists
Distributed storage: fragmented storage by first letter of word
Caching mechanism: LRU caches popular query results

3. Performance comparison and selection suggestions

Implementation method	Build time	Query time	Memory usage	Applicable scenarios
Dictionary nested list	O(n)	O(n)	high	Small dataset/teaching demonstration
Sort list + binary	O(n log n)	O(log n)	middle	Medium size/simple query
Compressed storage + boolean query	O(n log n)	O(k log n)	Low	Production environment/complex query

Selection suggestions:

Personal project: Junior version is sufficient
Small and medium-sized applications: Evolution version + Bloom filter
Big data scenario: Ultimate version + distributed storage (such as Redis cluster)

4. Practical application: build a mini search engine

class SimpleSearchEngine:
    def __init__(self):
         = InvertedIndex()
         = []
 
    def add_document(self, content: str):
        doc_id = len()
        (content)
        .add_document(doc_id, content)
 
    def search(self, query: str) -&gt; List[str]:
        doc_ids = (query)
        return [[doc_id] for doc_id in doc_ids]
 
#User Exampleengine = SimpleSearchEngine()
engine.add_document("Python is a versatile language")
engine.add_document("JavaScript dominates web development")
engine.add_document("Python and machine learning go hand in hand")
 
print(("Python AND machine"))  
# Output：['Python and machine learning go hand in hand']

Expansion direction:

Add TF-IDF sorting
Support phrase query ("machine learning" as a whole)
Integrated Chinese word participle (using Jieba library)
Implement paging function

5. Philosophical thinking about inverted indexing

The essence of inverted indexing is the classic practice of changing space to time. By precalculating the relationship between stored terms and documents, it converts the O(n) operations that originally needed to traverse all documents into O(1) or O(log n) searches. This idea is everywhere in computer technology:

Database index (B+ tree)
Cache Mechanism (Redis)
CDN content distribution
Merkle tree of blockchain

Mastering the implementation principle of inverted indexing will not only help you understand search engines, but also cultivate sensitivity to the general design pattern of "precomputing-storage-fast query".

Conclusion

From simple dictionary implementations to industrial-grade solutions that support complex queries, we have witnessed the flexibility and power of Python in reverse indexing implementations. Next time you enter keywords in the search box, you might as well imagine the inverted indexes that work silently behind them. They are like countless classification card cabinets, accurately navigating in the ocean of data. And Python is one of the best tools for building these magic card cabinets.

This is the article about this article about taking you to play with the inverted index in Python. For more related content on Python inverted index, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!