Three ways to detect similarity between two text files in Python

Detection of the similarity of two text files is a common task and can be used in scenarios such as text deduplication and plagiarism detection. Python provides a variety of ways to implement this functionality, including based on string matching, word frequency statistics, and machine learning. The following are several commonly used methods and their implementations.

1. Methods based on string matching

1.1 Levenshtein Distance

Principle: Calculate the edit distance between two strings (number of insert, delete, and replace operations).

Advantages: Simple and intuitive.

Disadvantages: The calculation complexity is high and it is not suitable for long text.

import Levenshtein

def similarity_levenshtein(text1, text2):
    distance = (text1, text2)
    max_len = max(len(text1), len(text2))
    return 1 - (distance / max_len)

# Read the filewith open("", "r") as f1, open("", "r") as f2:
    text1 = ()
    text2 = ()

similarity = similarity_levenshtein(text1, text2)
print(f"Similarity (Levenshtein): {similarity:.2f}")

1.2 Jaccard Similarity

Principle: Calculate the ratio of the intersection and union of two sets.

Advantages: Suitable for dealing with similarity at short text or word level.

Disadvantages: Ignore word order and semantics.

Case 1:

def similarity_jaccard(text1, text2):
    set1 = set(())
    set2 = set(())
    intersection = (set2)
    union = (set2)
    return len(intersection) / len(union)

# Read the filewith open("", "r") as f1, open("", "r") as f2:
    text1 = ()
    text2 = ()

similarity = similarity_jaccard(text1, text2)
print(f"Similarity (Jaccard): {similarity:.2f}")

Case 2:

Jaccard Similarity measures similarity by comparing the ratio of intersection to union of two sets. For text, words in text can be regarded as collection elements. The following two methods measure the similarity of text from different perspectives, and you can choose the appropriate method according to actual needs. Remember to replace and with the file path you actually want to compare.

import Levenshtein

def compare_text_files_edit_distance(file1_path, file2_path):
    try:
        with open(file1_path, 'r', encoding='utf-8') as file1:
            text1 = ()
        with open(file2_path, 'r', encoding='utf-8') as file2:
            text2 = ()

        distance = (text1, text2)
        max_length = max(len(text1), len(text2))
        similarity = 1 - (distance / max_length)
        return similarity
    except FileNotFoundError:
        print("Error: File not found!")
    except Exception as e:
        print(f"mistake: 发生了一个未知mistake: {e}")
    return None

if __name__ == "__main__":
    file1_path = ''
    file2_path = ''
    similarity = compare_text_files_edit_distance(file1_path, file2_path)
    if similarity is not None:
        print(f"The similarity between the two files is: {similarity:.2f}")

2. Methods based on word frequency statistics

2.1 Cosine similarity

Principle: Denote text as word frequency vectors and calculate the cosine similarity between vectors.

Advantages: Suitable for processing long texts and considering word frequency information.

Disadvantages: Ignore word order and semantics.

from sklearn.feature_extraction.text import CountVectorizer
from  import cosine_similarity

def similarity_cosine(text1, text2):
    vectorizer = CountVectorizer().fit_transform([text1, text2])
    vectors = ()
    return cosine_similarity([vectors[0]], [vectors[1]])[0][0]

# Read the filewith open("", "r") as f1, open("", "r") as f2:
    text1 = ()
    text2 = ()

similarity = similarity_cosine(text1, text2)
print(f"Similarity (Cosine): {similarity:.2f}")

2.2 TF-IDF similarity

Principle: Denote the text as a TF-IDF vector and calculate the cosine similarity between vectors.

Advantages: Consider the importance of words and is suitable for long texts.

Disadvantages: Ignore word order and semantics.

from sklearn.feature_extraction.text import TfidfVectorizer
from  import cosine_similarity

def similarity_tfidf(text1, text2):
    vectorizer = TfidfVectorizer().fit_transform([text1, text2])
    vectors = ()
    return cosine_similarity([vectors[0]], [vectors[1]])[0][0]

# Read the filewith open("", "r") as f1, open("", "r") as f2:
    text1 = ()
    text2 = ()

similarity = similarity_tfidf(text1, text2)
print(f"Similarity (TF-IDF): {similarity:.2f}")

3. Semantic-based approach

3.1 Word2Vec + cosine similarity

Principle: Denote the text as the average value of the word vectors and calculate the cosine similarity between vectors.

Advantages: Consider semantic information.

Disadvantages: Pre-trained word vector model is required.

from  import KeyedVectors
import numpy as np

# Load pretrained word vector modelword2vec_model = KeyedVectors.load_word2vec_format("path/to/", binary=True)

def text_to_vector(text):
    words = ()
    vectors = [word2vec_model[word] for word in words if word in word2vec_model]
    return (vectors, axis=0) if vectors else (word2vec_model.vector_size)

def similarity_word2vec(text1, text2):
    vec1 = text_to_vector(text1)
    vec2 = text_to_vector(text2)
    return (vec1, vec2) / ((vec1) * (vec2))

# Read the filewith open("", "r") as f1, open("", "r") as f2:
    text1 = ()
    text2 = ()

similarity = similarity_word2vec(text1, text2)
print(f"Similarity (Word2Vec): {similarity:.2f}")

3.2 BERT + Cosine Similarity

Principle: Use a pre-trained BERT model to represent text as a vector, and calculate the cosine similarity between vectors.

Advantages: Consider context semantic information.

Disadvantages: High computational complexity and requires GPU acceleration.

from transformers import BertTokenizer, BertModel
import torch
import numpy as np

# Load pretrained BERT model and word participletokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def text_to_bert_vector(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()

def similarity_bert(text1, text2):
    vec1 = text_to_bert_vector(text1)
    vec2 = text_to_bert_vector(text2)
    return (vec1, ) / ((vec1) * (vec2))

# Read the filewith open("", "r") as f1, open("", "r") as f2:
    text1 = ()
    text2 = ()

similarity = similarity_bert(text1, text2)
print(f"Similarity (BERT): {similarity:.2f}")

4. Summary

Choose the right method according to your needs:

If you need to quickly calculate the similarity of short texts, you can use Levenshtein distance or Jaccard similarity.

If you need to process long text and take into account word frequency information, you can use cosine similarity or TF-IDF similarity.

If you need to consider semantic information, you can use Word2Vec or BERT.

This is the end of this article about three methods for Python to detect similarity between two text files. For more related content on Python to detect similarity between texts, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!