我正在编写一个推荐 5 种产品的函数。我使用余弦相似度作为相似度度量,并且使用长度为 2 的数组,该数组由每个产品的 t-SNE 特征值组成,例如x 坐标和 y 坐标。
我的输入是产品的名称,我想迭代数据框,计算输入产品与每个产品之间的余弦相似度,然后通过 df.at 将余弦相似度设置为每行的“dist”列。 point_1 的形状为 (1,2),point_2 的形状为 (2,)。
但是,当我使用示例调用该函数时,收到以下错误消息: ValueError: Must have equal len keys and value when set with an iterable
我应该如何修改我的函数来解决这个问题?
def yesstyle_recommender(product, label, df):
#filter df by label
filtered_df = df[df['label'] == label].reset_index().drop('index', axis = 1)
#extract the product name, has to exactly match
myItem = filtered_df[filtered_df['name'].str.contains(product, case=False)]
if myItem.empty:
print("Product not found.")
return None
# extract tsne values for the target item
X = myItem['X'].values
Y = myItem['Y'].values
point_1 = np.array([X,Y]).T
#instantiate 'dist' column to 0
filtered_df['dist'] = 0.0
#iterate through df and calculate cos sim
for i in range(len(filtered_df)):
point_2 = np.array([filtered_df['X'][i], filtered_df['Y'][i]])
filtered_df.at[i, 'dist'] = np.dot(point_1, point_2) / (norm(point_1) * norm(point_2))
#sort by 'dist' in descending order
filtered_df = filtered_df.sort_values('dist', ascending=False)
top_5_recommendations = filtered_df[['product','brand','price','dist']]
return top_5_recommendations
#call function using an example product
yesstyle_recommender('Relief Sun','spf', yesstyle)
您得到的错误来自这样的事实:X 和 Y 都是具有单个元素的数组,但您得到的是 (2,1) 数组,这不适合稍后在代码中进行逐元素操作。您需要正确处理数组的形状,并确保在余弦相似度计算中比较的点具有正确的形状。除非您的数据看起来与我创建的示例完全不同,否则您应该这样做:
import numpy as np
import pandas as pd
from numpy.linalg import norm
def yesstyle_recommender(product, label, df):
filtered_df = df[df['label'] == label].reset_index().drop('index', axis=1)
myItem = filtered_df[filtered_df['name'].str.contains(product, case=False)]
if myItem.empty:
print("Product not found.")
return None
X = myItem['X'].values[0]
Y = myItem['Y'].values[0]
point_1 = np.array([X, Y])
filtered_df['dist'] = 0.0
for i in range(len(filtered_df)):
point_2 = np.array([filtered_df['X'][i], filtered_df['Y'][i]])
filtered_df.at[i, 'dist'] = np.dot(point_1, point_2) / (norm(point_1) * norm(point_2))
filtered_df = filtered_df.sort_values('dist', ascending=False)
top_5_recommendations = filtered_df[['name', 'brand', 'price', 'dist']].head(5)
return top_5_recommendations
data = {
'name': ['Relief Sun', 'Sun Stick', 'Moisture Cream', 'UV Shield', 'Sun Gel', 'SPF Lotion'],
'label': ['spf', 'spf', 'spf', 'spf', 'spf', 'spf'],
'brand': ['BrandA', 'BrandB', 'BrandC', 'BrandD', 'BrandE', 'BrandF'],
'price': [15.99, 20.99, 25.99, 30.99, 35.99, 40.99],
'X': [1.0, 2.0, 1.5, 3.0, 4.0, 5.0],
'Y': [1.0, 1.5, 2.0, 2.5, 3.0, 3.5]
}
yesstyle = pd.DataFrame(data)
recommendations = yesstyle_recommender('Relief Sun', 'spf', yesstyle)
print(recommendations)
这给出了
name brand price dist
0 Relief Sun BrandA 15.99 1.000000
3 UV Shield BrandD 30.99 0.995893
1 Sun Stick BrandB 20.99 0.989949
2 Moisture Cream BrandC 25.99 0.989949
4 Sun Gel BrandE 35.99 0.989949