In Python 3,Aho-Corasick
is an efficient multi-modal string matching algorithm that can find multiple pattern strings in text at one time. This algorithm searches with a linear time complexity, which is very suitable for dealing with multi-keyword matching problems, such as sensitive word detection, text filtering, URL parsing in web crawlers, etc.
Third-party libraries can be used in Pythonahocorasick
to implement this algorithm.
Main features
- Multi-pattern matching: Find multiple patterns in text at once.
- Quickly build a dictionary: The constructed automaton can store pattern strings.
- Linear time matching: The time complexity of the search is (O(n + m)), where (n) is the text length and (m) is the sum of all pattern string lengths.
-
Typical Applications:
- Sensitive word detection
- Pattern recognition of log or streaming data
- Text filtering or replacement
Install ahocorasick library
pip install pyahocorasick
Example of usage
The following example shows how to useahocorasick
Perform multi-pattern matching:
1. Build an Aho-Corasick Automation
import ahocorasick # Initialize the automatonautomaton = () # Add pattern stringpatterns = ["he", "she", "his", "hers"] for idx, pattern in enumerate(patterns): automaton.add_word(pattern, (idx, pattern)) # Build an automatonautomaton.make_automaton()
2. Match pattern string
text = "ushers" # Find patterns in textfor end_index, (idx, pattern) in (text): start_index = end_index - len(pattern) + 1 print(f"Found pattern '{pattern}' from index {start_index} to {end_index}")
Output:
Found pattern 'she' from index 1 to 3
Found pattern 'he' from index 2 to 3
Found pattern 'hers' from index 2 to 5
3. Check whether a string exists
if "his" in automaton: print("Pattern 'his' exists in the automaton")
4. Match sensitive words (actual use cases)
sensitive_words = ["bad", "ugly", "harm"] automaton = () for word in sensitive_words: automaton.add_word(word, word) automaton.make_automaton() # Detect sensitive wordstext = "This is a bad example of an ugly behavior." matches = [] for _, word in (text): (word) print("Sensitive words found:", matches)
Output:
Sensitive words found: ['bad', 'ugly']
Ahocorasick’s core method
-
add_word(word, value)
: Add a pattern string to the automaton. -
make_automaton()
: To build an automaton, it must be called after adding all the patterns. -
iter(text)
: Find the pattern in the given text, returning the matching end position and the corresponding value of the pattern. -
get(item)
: Get the value of a certain pattern. -
__contains__(word)
: Check whether a pattern exists in the automaton.
Summarize
ahocorasick
It is a tool that efficiently solves multi-pattern matching problem, especially suitable for scenarios where large-scale texts need to be quickly matched and searched. If you need to deal with similar problems, Aho-Corasick is one of the algorithms that is well worth learning and applying.
This is the end of this article about the implementation of Python 3 multi-pattern matching problem. For more related Python 3 multi-pattern matching content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!