Detailed explanation of the common text similarity calculation method in Python

1. The longest common subsequence

The longest common subsequence (LCS) refers to a continuous sequence that appears together in two or more sequences, which maintains the same order and continuity in multiple sequences. In computer science, finding the longest common subsequence is a classic problem, usually solved by dynamic programming algorithms.
The steps for dynamic programming algorithms to solve the problem of longest common subsequence are as follows:

Initialize the state array: Create a two-dimensional array dp with its size (m+1)×(n+1), where m and n are the lengths of the two sequences respectively. dp[i][j] represents the length of the longest common subsequence of the first i characters of sequence 1 and the first j characters of sequence 2.

Fill the state array: traverses two sequences, for each pair of characters, if they are the same, dp[i][j] = dp[i-1][j-1] + 1; if different, dp[i][j] = max(dp[i-1][j], dp[i][j-1]).

Find the longest common subsequence: The last element of the dp array dp[m][n] is the length of the longest common subsequence of the two sequences. Specific subsequences can be found through backtracking.

The python code is as follows:

## Longest common subsequence calculates the longest common substring -----------------------------------def LCS(str_a, str_b):
	if len(str_a) == 0 or len(str_b) == 0:
		return 0
	dp = [0 for _ in range(len(str_b) + 1)]
	for i in range(1, len(str_a) + 1):
		left_up = 0
		dp[0] = 0
		for j in range(1, len(str_b) + 1):
			left = dp[j-1]
			up = dp[j]
			if str_a[i-1] == str_b[j-1]:
				dp[j] = left_up + 1
			else:
				dp[j] = max([left, up])
			left_up = up
	return dp[len(str_b)]

#Convert the longest common subsequence to a value between 0-1, the closer the result is to 1, the greater the similaritydef LCS_Score(str_a, str_b):
	return (LCS(str_a, str_b)*2/(len(str_a)+len(str_b)),2)

#LCS_Score(str_a, str_b)    


## Calculate similarity according to the longest common subsequence for 2 columns of dataframe## df: Data source variable## col_name1, col_name2: 2 column names used to calculate similarity## sensitivity_score_name: column name of the returned similarity result## Returns the data frame, then sensitivity_score_name is the calculation result of using this calculation method to calculate the similarity of text of 2 columns.def df_simarity_lcs(df , col_name1 , col_name2 , simarity_score_name):
    df[simarity_score_name] = list(map(lambda str_a, str_b:LCS_Score(str_a, str_b),df[col_name1],df[col_name2]))
    return df

2. Jaccard Similarity

The similarity is calculated based on the ratio of the intersection to the union of the set.

Suitable for comparisons of short text or keyword lists.

## Use the set method to calculate the similarity of 2 setsdef similarity(a, b):
    try:
        return len(a &amp; b) / len(a | b)
    except ZeroDivisionError:
        return -1e-4

## Use the set method to calculate the text similarity of 2 columns in a data frame## df: Data source variable## col_name1, col_name2: 2 column names used to calculate similarity## sensitivity_score_name: column name of the returned similarity result## Returns the data frame, then sensitivity_score_name is the calculation result of using this calculation method to calculate the similarity of text of 2 columns.#Calculate similarity according to the set method for 2 columns of dataframedef df_simarity_jh(df , col_name1 , col_name2 , simarity_score_name):
    df[simarity_score_name] = list(map(lambda str_a, str_b:similarity(set(str_a), set(str_b)),df[col_name1],df[col_name2]))
    return df

3. Cosine Similarity

Evaluate their similarity by calculating the included cosine values of two text vectors in space.

Usually used in combination with bag-of-word model (BOW) or TF-IDF.

## vec1, vec2: The vector to be calculated## Returns the similarity of 2 vectorsdef cosine_simi(vec1, vec2):
    from scipy import spatial
    return 1 - (vec1, vec2)

## Calculate similarity according to the longest common subsequence for 2 columns of dataframe## df: Data source variable## col_name1, col_name2: 2 column names used to calculate similarity## sensitivity_score_name: column name of the returned similarity result## Returns the data frame, then sensitivity_score_name is the calculation result of using this calculation method to calculate the similarity of text of 2 columns.def df_simarity_cosine(df , col_name1 , col_name2 , simarity_score_name):
    df[simarity_score_name] = list(map(lambda str_a, str_b:cosine_simi(str_a, str_b),df[col_name1],df[col_name2]))
    return df

4. Supplementary method

TF-IDF

TF-IDF is a statistical method used to evaluate the importance of words in a document set. It can represent text as a vector, thereby calculating cosine similarity.

from sklearn.feature_extraction.text import TfidfVectorizer

def calculate_tfidf_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    corpus = [text1, text2]
    vectors = vectorizer.fit_transform(corpus)
    similarity = cosine_similarity(vectors)
    return similarity[0][1]

text1 = "I love Python programming"
text2 = "Python programming is great"

tfidf_cosine_similarity = calculate_tfidf_cosine_similarity(text1, text2)
print(tfidf_cosine_similarity)

Word2Vec

Word2Vec is a model that represents words as vectors that capture semantic relationships between words. Using a pre-trained word vector model, you can calculate the similarity between texts.

import  as api
from gensim import matutils
import numpy as np

def calculate_word2vec_similarity(text1, text2):
    model = ("word2vec-google-news-300")
    tokens1 = ()
    tokens2 = ()
    vec1 = ([model[token] for token in tokens1 if token in model], axis=0)
    vec2 = ([model[token] for token in tokens2 if token in model], axis=0)
    return (vec1, vec2)

text1 = "I love Python programming"
text2 = "Python programming is great"

word2vec_similarity = calculate_word2vec_similarity(text1, text2)
print(word2vec_similarity)

Doc2Vec

Doc2Vec is a model that represents a document as a vector that captures the semantic relationships between documents. Similar to Word2Vec, you can use a pre-trained Doc2Vec model to calculate similarity between texts.

from  import Doc2Vec
from .doc2vec import TaggedDocument

def calculate_doc2vec_similarity(text1, text2):
    corpus = [TaggedDocument((), ["text1"]), TaggedDocument((), ["text2"])]
    model = Doc2Vec(corpus, vector_size=100, window=5, min_count=1, workers=4)
    vec1 = ["text1"]
    vec2 = ["text2"]
    return (vec1, vec2)

text1 = "I love Python programming"
text2 = "Python programming is great"

doc2vec_similarity = calculate_doc2vec_similarity(text1, text2)
print(doc2vec_similarity)

These methods can be selected and combined according to specific needs, providing powerful text similarity calculation capabilities for natural language processing tasks. In practical applications, you may encounter multiple scenarios, such as recommendation systems, automatic question and answers, and text clustering. In these scenarios, it is crucial to choose the appropriate method of calculating text similarity.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained model based on Transformers, used to capture context-sensitive word representations. The text can be represented as a vector through the BERT model and then the cosine similarity can be calculated.

from sentence_transformers import SentenceTransformer

def calculate_bert_similarity(text1, text2):
    model = SentenceTransformer("bert-base-nli-mean-tokens")
    embeddings = ([text1, text2])
    similarity = cosine_similarity(embeddings)
    return similarity[0][1]

text1 = "I love Python programming"
text2 = "Python programming is great"

bert_similarity = calculate_bert_similarity(text1, text2)
print(bert_similarity)

This is the end of this article about the detailed explanation of the common text similarity calculation method in Python. For more related Python text similarity calculation content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!