SoFunction
Updated on 2025-03-01

Simple implementation of error correction in Python

introduce

This article mainly uses Python to implement homophone correction for simple Chinese participle participle. In the current case, only one word is allowed to be wrong. If you are interested, you can continue to optimize it. The specific steps are as follows:

  • First prepare a file and put a Chinese participle in each line. The file here is /Users/wys/Desktop/ in the following code. You can change it to yourself and then run the code
  • A prefix tree class will be built to implement the insert function, insert all standard word segments into the prefix tree, and another search function will be implemented to search word segments.
  • Find 10 homophones for each character in the input wrong participle and replace each character with 10 homophones. As a result, you can get up to n*10 participles. n is the length of the participle, because some pronunciations may not have 10 homophones.
  • Search these participles through the prefix tree. If you can search, use them as correct and return them.

Code

import re,pinyin
from Pinyin2Hanzi import DefaultDagParams
from Pinyin2Hanzi import dag

class corrector():
    def __init__(self):
        self.re_compile = (r'[\u4e00-\u9fff]')
         = DefaultDagParams()

    # Read words in the file    def getData(self):
        words = []
        with open("/Users/wys/Desktop/") as f:
            for line in ():
                word = (" ")[0]
                if word and len(word) > 2:
                    res = self.re_compile.findall(word)
                    if len(res) == len(word): ## All are guaranteed to be participle made of Chinese characters                        (word)
        return words

    # Convert each pinyin into 10 candidate Chinese characters with homophones,    def pinyin_2_hanzi(self, pinyinList):
        result = []
        words = dag(, pinyinList, path_num=10)
        for item in words:
            res =   # Convert results            (res[0])
        return result

    # Obtain the result of the converted word candidate result    def getCandidates(self, phrase):
        chars = {}
        for c in phrase:
            chars[c] = self.pinyin_2_hanzi((c, format='strip', delimiter=',').split(','))
        replaces = []
        for c in phrase:
            for x in chars[c]:
                ((c, x))
        return set(replaces)

    # Obtain the correct result after error correction    def getCorrection(self, words):
        result = []
        for word in words:
            for word in (word):
                if (word):
                    (word)
                    break
        return result

class Node:
    def __init__(self):
         = False
         = {}


class Trie(object):
    def __init__(self):
         = Node()

    def insert(self, words):
        for word in words:
            cur = 
            for w in word:
                if w not in :
                    [w] = Node()
                cur = [w]

             = True

    def search(self, word):
        cur = 
        for w in word:
            if w not in :
                return False
            cur = [w]

        if  == False:
            return False
        return True

if __name__ == '__main__':
    # Initialize the corrector    c = corrector()
    # Get words    words = ()
    # Initialize the prefix tree    Tree = Trie()
    # Insert all words into the prefix tree    (words)
    # test    print((['Zhaotang Street','Zhuantang Sister','Chuantang Street']))

result

The print result is:
['Zhuantang Street', 'Zhuantang Street', 'Zhuantang Street']

It can be seen that all corrected successfully and have certain effects. We will continue to optimize later.

This is the end of this article about the simple implementation of Python Chinese error correction. For more related Python Chinese error correction content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!