SoFunction
Updated on 2024-10-30

Python Text Similarity Calculation of Edit Distance Details

Edit Distance

Edit Distance, also known as Levenshtein Distance, is the minimum number of editing operations required to convert from one to the other between two strings. Editing operations include replacing one character with another, inserting a character, and deleting a character. In general, the smaller the edit distance, the more similar the two strings are.

For example, to convert the word kitten to sitting: (the edit distance between 'kitten' and 'sitting' is 3)

     sitten (k→s)

     sittin (e→i)

     sitting (→g)

The Levenshtein package in Python makes it easy to calculate edit distances

Package installation:pip install python-Levenshtein

Let's use it:

# -*- coding:utf-8 -*-
import Levenshtein
texta = "Alan Turing's biography.
textb = 'Biography of Alan Turing'
print (texta,textb)

The above program execution results in 3, but only one character is changed, why does this happen?

The reason is that Python sees these two strings as string type, and in string type, a Chinese character is represented in three bytes under the default utf-8 encoding.

The solution is to convert the string to unicode format, which will return the correct result of 1.

# -*- coding:utf-8 -*-
import Levenshtein
texta = u"Alan Turing's biography.
textb = u'Biography of Alan Turing'
print (texta,textb)

The next section focuses on the role of several methods of weight retention:

(str1, str2)

Calculates the edit distance (also known as the Levenshtein distance). It is a description of the minimum number of operations to transform from one string to another, in which the operations include insertion, deletion, and substitution. Algorithm implementation: dynamic programming.

(str1, str2)

Calculate the Hamming distance. It is required that str1 and str2 must be of the same length. It is to describe the number of different characters in the corresponding position between two equal-length strings.

(str1, str2)

Calculate the Levenstein ratio. Calculation formular = (sum – ldist) / sum, where sum is the sum of the lengths of str1 and str2 strings, and ldist is the class edit distance. Note that here is the class edit distance, in the class edit distance delete, insert still +1, but replace +2.

(s1, s2)

Calculating Jaro Distance, Jaro Distance is said to be used to determine if two names on a health record are the same, it is also said to be used for census purposes, let's take a look at the definition of Jaro Distance.

The Jaro Distance of two given strings S1 and S2 is:


where m is the number of characters matched by s1, s2 and t is the number of permutations.

Two characters from S1 and S2, respectively, if separated by no more than

For example, the characters of MARTHA and MARHTA are matched, but among these matched characters, T and H have to be transposed in order to change MARTHA to MARHTA, then T and H are the different order of matched characters. , then T and H are matching characters in a different order.t=2/2=1

The Jaro Distance of the two strings is:


Levenshtein.jaro_winkler(s1, s2)

Calculate the Jaro-Winkler Distance, while Jaro-Winkler gives a higher score to strings that are the same as far as the starting part is concerned, he defines a prefix p that is given to both strings, and if the prefixed part has the same length ι, then the Jaro-Winkler Distance is:


dj is the Jaro Distance of two strings

ι is the same length as the prefix, but specifies a maximum of 4

p, on the other hand, is a constant for adjusting the score, which is specified to be no more than 25, or else dw may be greater than 1. Winkler defines this constant as 0.1

Thus, the Jaro-Winkler Distance of MARTHA and MARHTA mentioned above is:

dw = 0.944 + (3 * 0.1(1 − 0.944)) = 0.961

Points where I personally feel the algorithm could be refined:

Removal of stop words (mainly the effect of punctuation)

When analyzing Chinese, is it better to compare by word than by character?

summarize

The above is the entire content of this article, I hope that the content of this article on everyone to learn or use python can help, if there are questions you can leave a message to exchange.

Other references:

/wiki/Jaro%E2%80%93Winkler_distance

/courses/LT1/2011/slides/#Levenshtein-inverse