查找许多产品评论之间关键字的相似性以检测重复项

问题描述 投票:0回答:1

我有来自多个网站的一系列产品评论,并且正在尝试识别可能重复的评论(即所使用的词语非常相似)。我知道这里存在很大的潜在歧义空间,但希望有一个合理的解决方案。我对情感分析本身不感兴趣,比如这些评论中有多少是正面的。

我通过 SQL 查询发现了一些欺骗,它们是相同的但不理想。有没有一种简单的方法可以使用像 Spacy(过去使用过一点)或 tiktoken 这样的库来标记它们并进行相似度评分?

spacy openai-api
1个回答
0
投票

有多种方法可以解决这个问题。

1。使用拥抱脸句子相似度模型

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer, util
sentences = ["I'm happy", "I'm full of happiness"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

#Compute embedding for both lists
embedding_1= model.encode(sentences[0], convert_to_tensor=True)
embedding_2 = model.encode(sentences[1], convert_to_tensor=True)

util.pytorch_cos_sim(embedding_1, embedding_2)
## tensor([[0.6003]])

2。您可以使用 n gram match。

首先进行预处理,例如去除停用词、小写、拼音等。然后检查 n gram 匹配并选择阈值。

3.你也可以尝试余弦相似度匹配

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Example: Replace this with your dataset
reviews = [
    "The product is great! I loved it.",
    "I loved the product. It's great!",
    "This is the worst thing I've ever purchased.",
    "Amazing product, will buy again!",
    "Worst thing ever! Do not recommend.",
]

# 1. Convert text to numerical vectors using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english')  # Remove stop words for cleaner vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(reviews)    # Create the TF-IDF matrix

# 2. Compute cosine similarity for all pairs
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

# 3. Flatten the matrix and create a DataFrame of similarity scores
pairs = []
for i in range(len(reviews)):
    for j in range(i + 1, len(reviews)):  # Compare only unique pairs
        pairs.append((i, j, cosine_sim_matrix[i, j]))

similarity_df = pd.DataFrame(pairs, columns=["Review1", "Review2", "Cosine Similarity"])
print(similarity_df)

# 4. Filter pairs with high cosine similarity (e.g., > 0.85)
threshold = 0.85
potential_duplicates = similarity_df[similarity_df["Cosine Similarity"] > threshold]
print("\nPotential Duplicates:")
print(potential_duplicates)
© www.soinside.com 2019 - 2024. All rights reserved.