我想知道如何改进解决冲突的功能。我的想法是,当我在合并时得到三个不同的值时,我为检索到的每个值计算一个分数(value_x、value_y、value_z 等之间的相似度比率),并且具有最大函数分数的元素将是将被放置在 DataFrame 中。
我们举个例子:
value_x value_y value_z
mariana muntean munte
score_value_x = fuzz.ratio(mariana, muntean) + fuzz.ratio(mariana, munte) = 76
score_value_y = fuzz.ratio(muntean, mariana) + fuzz.ratio(muntean, munte) = 126
score_value_z = fuzz.ratio(munte, mariana) + fuzz.ratio(munte, muntean) = 116
因此,value_y = muntean 得分最高,因此“muntean”将被放置在“value”列中。这是一个好方法吗?坏的吗?有什么建议吗?我确信一定有一种更聪明的方法来做到这一点。
import pandas as pd
from fuzzywuzzy import fuzz
# DataFrames A, B, and C
A = pd.DataFrame({'id': ['1', '2', '3', '4', '5', '6'],
'name': ['robert', 'mariana', 'jhon', 'bogdan', 'alex', '9'],
'occupation': ['constructor', 'witch', 'buldozer', 'pirate', 'doctor','']})
B = pd.DataFrame({'id': ['1', '2', '3', '4', '5', '6'],
'name': ['dragos', 'muntean', 'palavra', 'javra', 'caine', '10'],
'age': ['19', '22', '66', '23', '55', '']})
C = pd.DataFrame({'id': ['1', '2', '3', '4', '5', '6'],
'name': ['robi', 'munte', 'pala', 'juva', 'catel', '19'],
'timeLeft': ['30', '28', '36', '5', '100', '']})
# Create a new DataFrame with columns from A, B, and C
nume = pd.DataFrame({
'name_A': A['name'], # Names from DataFrame A
'name_B': B['name'], # Names from DataFrame B
'name_C': C['name'] # Names from DataFrame C
})
scores = []
for _, record in nume.iterrows():
row_scores = []
# For each name in the row, calculate the similarity score by comparing it with the other two names
for i, name in enumerate(record):
# Initialize score for the current name
score = 0
# Compare the current name with all the other names in the row
for j, other_name in enumerate(record):
if i != j:
score += fuzz.ratio(name, other_name)
row_scores.append(score)
scores.append(row_scores)
score_df = pd.DataFrame(scores, columns=['score_A', 'score_B', 'score_C'])
max_scores_names = []
for index, row in score_df.iterrows():
max_index = row.idxmax()
max_col_index = score_df.columns.get_loc(max_index)
max_name = nume.iloc[index, max_col_index] # Get the corresponding name
max_scores_names.append(max_name)
# Add the names with maximum scores to the nume DataFrame
nume['max_score_name'] = max_scores_names
# Print the names DataFrame and scores DataFrame
print(nume)
print()
print(score_df)
name_A name_B name_C max_score_name
0 robert dragos robi robi
1 mariana muntean munte muntean
2 jhon palavra pala palavra
3 bogdan javra juva juva
4 alex caine catel caine
5 9 10 19 19
score_A score_B score_C
0 93 73 100
1 76 126 116
2 0 73 73
3 38 62 87
4 88 104 104
5 67 50 117
我不确定您在这里想做什么,但肯定有更好的方法,只需避免对所有行进行迭代即可。 我将首先合并“id”上的 3 个数据框,然后应用您的分数计算。像这样的东西:
df = pd.merge(
left = pd.merge(
left= A,
right = B,
on = 'id'
),
right = C,
on = 'id')
df.apply(lambda x: fuzz.ratio(x['name_x'],x['name_y']),axis = 1)
如果您需要更详细的解决方案,请提供更多详细信息。