introduce
This article mainly uses Python to implement homophone correction for simple Chinese participle participle. In the current case, only one word is allowed to be wrong. If you are interested, you can continue to optimize it. The specific steps are as follows:
- First prepare a file and put a Chinese participle in each line. The file here is /Users/wys/Desktop/ in the following code. You can change it to yourself and then run the code
- A prefix tree class will be built to implement the insert function, insert all standard word segments into the prefix tree, and another search function will be implemented to search word segments.
- Find 10 homophones for each character in the input wrong participle and replace each character with 10 homophones. As a result, you can get up to n*10 participles. n is the length of the participle, because some pronunciations may not have 10 homophones.
- Search these participles through the prefix tree. If you can search, use them as correct and return them.
Code
import re,pinyin from Pinyin2Hanzi import DefaultDagParams from Pinyin2Hanzi import dag class corrector(): def __init__(self): self.re_compile = (r'[\u4e00-\u9fff]') = DefaultDagParams() # Read words in the file def getData(self): words = [] with open("/Users/wys/Desktop/") as f: for line in (): word = (" ")[0] if word and len(word) > 2: res = self.re_compile.findall(word) if len(res) == len(word): ## All are guaranteed to be participle made of Chinese characters (word) return words # Convert each pinyin into 10 candidate Chinese characters with homophones, def pinyin_2_hanzi(self, pinyinList): result = [] words = dag(, pinyinList, path_num=10) for item in words: res = # Convert results (res[0]) return result # Obtain the result of the converted word candidate result def getCandidates(self, phrase): chars = {} for c in phrase: chars[c] = self.pinyin_2_hanzi((c, format='strip', delimiter=',').split(',')) replaces = [] for c in phrase: for x in chars[c]: ((c, x)) return set(replaces) # Obtain the correct result after error correction def getCorrection(self, words): result = [] for word in words: for word in (word): if (word): (word) break return result class Node: def __init__(self): = False = {} class Trie(object): def __init__(self): = Node() def insert(self, words): for word in words: cur = for w in word: if w not in : [w] = Node() cur = [w] = True def search(self, word): cur = for w in word: if w not in : return False cur = [w] if == False: return False return True if __name__ == '__main__': # Initialize the corrector c = corrector() # Get words words = () # Initialize the prefix tree Tree = Trie() # Insert all words into the prefix tree (words) # test print((['Zhaotang Street','Zhuantang Sister','Chuantang Street']))
result
The print result is:
['Zhuantang Street', 'Zhuantang Street', 'Zhuantang Street']
It can be seen that all corrected successfully and have certain effects. We will continue to optimize later.
This is the end of this article about the simple implementation of Python Chinese error correction. For more related Python Chinese error correction content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!