SoFunction
Updated on 2025-03-04

Implementation of Python3 multi-pattern matching problem

In Python 3,Aho-Corasickis an efficient multi-modal string matching algorithm that can find multiple pattern strings in text at one time. This algorithm searches with a linear time complexity, which is very suitable for dealing with multi-keyword matching problems, such as sensitive word detection, text filtering, URL parsing in web crawlers, etc.

Third-party libraries can be used in Pythonahocorasickto implement this algorithm.

Main features

  • Multi-pattern matching: Find multiple patterns in text at once.
  • Quickly build a dictionary: The constructed automaton can store pattern strings.
  • Linear time matching: The time complexity of the search is (O(n + m)), where (n) is the text length and (m) is the sum of all pattern string lengths.
  • Typical Applications
    • Sensitive word detection
    • Pattern recognition of log or streaming data
    • Text filtering or replacement

Install ahocorasick library

pip install pyahocorasick

Example of usage

The following example shows how to useahocorasickPerform multi-pattern matching:

1. Build an Aho-Corasick Automation

import ahocorasick

# Initialize the automatonautomaton = ()

# Add pattern stringpatterns = ["he", "she", "his", "hers"]
for idx, pattern in enumerate(patterns):
    automaton.add_word(pattern, (idx, pattern))

# Build an automatonautomaton.make_automaton()

2. Match pattern string

text = "ushers"

# Find patterns in textfor end_index, (idx, pattern) in (text):
    start_index = end_index - len(pattern) + 1
    print(f"Found pattern '{pattern}' from index {start_index} to {end_index}")

Output:

Found pattern 'she' from index 1 to 3
Found pattern 'he' from index 2 to 3
Found pattern 'hers' from index 2 to 5

3. Check whether a string exists

if "his" in automaton:
    print("Pattern 'his' exists in the automaton")

4. Match sensitive words (actual use cases)

sensitive_words = ["bad", "ugly", "harm"]
automaton = ()

for word in sensitive_words:
    automaton.add_word(word, word)

automaton.make_automaton()

# Detect sensitive wordstext = "This is a bad example of an ugly behavior."
matches = []
for _, word in (text):
    (word)

print("Sensitive words found:", matches)

Output:

Sensitive words found: ['bad', 'ugly']

Ahocorasick’s core method

  • add_word(word, value): Add a pattern string to the automaton.
  • make_automaton(): To build an automaton, it must be called after adding all the patterns.
  • iter(text): Find the pattern in the given text, returning the matching end position and the corresponding value of the pattern.
  • get(item): Get the value of a certain pattern.
  • __contains__(word): Check whether a pattern exists in the automaton.

Summarize

ahocorasickIt is a tool that efficiently solves multi-pattern matching problem, especially suitable for scenarios where large-scale texts need to be quickly matched and searched. If you need to deal with similar problems, Aho-Corasick is one of the algorithms that is well worth learning and applying.

This is the end of this article about the implementation of Python 3 multi-pattern matching problem. For more related Python 3 multi-pattern matching content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!