我有来自多个网站的一系列产品评论,并且正在尝试识别可能重复的评论(即所使用的词语非常相似)。我知道这里存在很大的潜在歧义空间,但希望有一个合理的解决方案。我对情感分析本身不感兴趣,比如这些评论中有多少是正面的。
我通过 SQL 查询发现了一些欺骗,它们是相同的但不理想。有没有一种简单的方法可以使用像 Spacy(过去使用过一点)或 tiktoken 这样的库来标记它们并进行相似度评分?
有多种方法可以解决这个问题。
1。使用拥抱脸句子相似度模型
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
sentences = ["I'm happy", "I'm full of happiness"]
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
#Compute embedding for both lists
embedding_1= model.encode(sentences[0], convert_to_tensor=True)
embedding_2 = model.encode(sentences[1], convert_to_tensor=True)
util.pytorch_cos_sim(embedding_1, embedding_2)
## tensor([[0.6003]])
2。您可以使用 n gram match。
首先进行预处理,例如去除停用词、小写、拼音等。然后检查 n gram 匹配并选择阈值。
3.你也可以尝试余弦相似度匹配
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# Example: Replace this with your dataset
reviews = [
"The product is great! I loved it.",
"I loved the product. It's great!",
"This is the worst thing I've ever purchased.",
"Amazing product, will buy again!",
"Worst thing ever! Do not recommend.",
]
# 1. Convert text to numerical vectors using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english') # Remove stop words for cleaner vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(reviews) # Create the TF-IDF matrix
# 2. Compute cosine similarity for all pairs
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
# 3. Flatten the matrix and create a DataFrame of similarity scores
pairs = []
for i in range(len(reviews)):
for j in range(i + 1, len(reviews)): # Compare only unique pairs
pairs.append((i, j, cosine_sim_matrix[i, j]))
similarity_df = pd.DataFrame(pairs, columns=["Review1", "Review2", "Cosine Similarity"])
print(similarity_df)
# 4. Filter pairs with high cosine similarity (e.g., > 0.85)
threshold = 0.85
potential_duplicates = similarity_df[similarity_df["Cosine Similarity"] > threshold]
print("\nPotential Duplicates:")
print(potential_duplicates)