合并数据集时如何高效解决冲突

Question

我想知道如何改进解决冲突的功能。我的想法是，当我在合并时得到三个不同的值时，我为检索到的每个值计算一个分数（value_x、value_y、value_z 等之间的相似度比率），并且具有最大函数分数的元素将是将被放置在 DataFrame 中。

我们举个例子：

value_x  value_y  value_z
mariana  muntean  munte

score_value_x = fuzz.ratio(mariana, muntean) + fuzz.ratio(mariana, munte) = 76
score_value_y = fuzz.ratio(muntean, mariana) + fuzz.ratio(muntean, munte) = 126
score_value_z = fuzz.ratio(munte, mariana) + fuzz.ratio(munte, muntean) = 116

因此，value_y = muntean 得分最高，因此“muntean”将被放置在“value”列中。这是一个好方法吗？坏的吗？有什么建议吗？我确信一定有一种更聪明的方法来做到这一点。

import pandas as pd
from fuzzywuzzy import fuzz

# DataFrames A, B, and C
A = pd.DataFrame({'id': ['1', '2', '3', '4', '5', '6'],
                  'name': ['robert', 'mariana', 'jhon', 'bogdan', 'alex', '9'],
                  'occupation': ['constructor', 'witch', 'buldozer', 'pirate', 'doctor','']})

B = pd.DataFrame({'id': ['1', '2', '3', '4', '5', '6'],
                  'name': ['dragos', 'muntean', 'palavra', 'javra', 'caine', '10'],
                  'age': ['19', '22', '66', '23', '55', '']})

C = pd.DataFrame({'id': ['1', '2', '3', '4', '5', '6'],
                  'name': ['robi', 'munte', 'pala', 'juva', 'catel', '19'],
                  'timeLeft': ['30', '28', '36', '5', '100', '']})

# Create a new DataFrame with columns from A, B, and C
nume = pd.DataFrame({
    'name_A': A['name'],  # Names from DataFrame A
    'name_B': B['name'],  # Names from DataFrame B
    'name_C': C['name']   # Names from DataFrame C
})
scores = []
for _, record in nume.iterrows():
    row_scores = []
    # For each name in the row, calculate the similarity score by comparing it with the other two names
    for i, name in enumerate(record):
        # Initialize score for the current name
        score = 0
        # Compare the current name with all the other names in the row
        for j, other_name in enumerate(record):
            if i != j:
                score += fuzz.ratio(name, other_name)
        row_scores.append(score)
    scores.append(row_scores)
score_df = pd.DataFrame(scores, columns=['score_A', 'score_B', 'score_C'])
max_scores_names = []
for index, row in score_df.iterrows():
    max_index = row.idxmax()  
    max_col_index = score_df.columns.get_loc(max_index)  
    max_name = nume.iloc[index, max_col_index]  # Get the corresponding name
    max_scores_names.append(max_name)

# Add the names with maximum scores to the nume DataFrame
nume['max_score_name'] = max_scores_names

# Print the names DataFrame and scores DataFrame
print(nume)
print()
print(score_df)

输出：

    name_A   name_B name_C max_score_name
0   robert   dragos   robi           robi
1  mariana  muntean  munte        muntean
2     jhon  palavra   pala        palavra
3   bogdan    javra   juva           juva
4     alex    caine  catel          caine
5        9       10     19             19

   score_A  score_B  score_C
0       93       73      100
1       76      126      116
2        0       73       73
3       38       62       87
4       88      104      104
5       67       50      117

Answer 1

我不确定您在这里想做什么，但肯定有更好的方法，只需避免对所有行进行迭代即可。我将首先合并“id”上的 3 个数据框，然后应用您的分数计算。像这样的东西：

df = pd.merge(
left = pd.merge(
    left= A,
    right = B,
    on = 'id'
),
right = C,
on = 'id') 
df.apply(lambda x: fuzz.ratio(x['name_x'],x['name_y']),axis = 1)

如果您需要更详细的解决方案，请提供更多详细信息。

合并数据集时如何高效解决冲突

问题描述投票：0回答：1

输出：

1个回答

最新问题

合并数据集时如何高效解决冲突

问题描述 投票：0回答：1

输出：

1个回答

最新问题

问题描述投票：0回答：1