比较两个数据帧并检查第一个 df 中的组合是否在第二个中

Question

我有两个城市数据数据框，其中有许多行和几列。我正在尝试找到一种方法来查看 dfA 值是否在 dfB 中，然后打印 dfA 中的值以及列表中 dfB 中的值的索引，然后打印 dfA 中不在另一个列表中的值。每行信息的顺序不必在两个 df 中完全一致，但总的来说，它必须具有整体信息。例如，dfA 索引 1 纽约将与 dfB 索引 3 匹配，并且由于在 dfA 中没有亚特兰大的行，但在 dfB 中却有，因此它将打印在第二个列表中。

例如以下：

dfA

索引	第 1 栏	第 2 栏	第 3 栏
0	阿尔伯克基	近乎	87101
1	纽约	纽约	10009
2	迈阿密	FL	33101

dfB

索引	第 1 栏	第 2 栏	第 3 栏
0	近乎	阿尔伯克基	87101
1	亚特兰大	GA	30033
2	旧金山	CA	94016
3	10009	纽约	纽约

Answer 1

实现此目的的一种方法是定义一个函数，该函数将

dfA

的每一行视为列表，并与

dfB

中的相同内容进行比较：

import pandas as pd

data_A = {'Index': [0, 1, 2], 'Column 1': ['Albuquerque', 'New York', 'Miami'], 
          'Column 2': ['NM', 'NY', 'FL'], 'Column 3': ['87101', '10009', '33101']}
dfA = pd.DataFrame(data_A).set_index('Index')

data_B = {'Index': [0, 1, 2, 3], 'Column 1': ['NM', 'Atlanta', 'San Francisco', '10009'], 
          'Column 2': ['Albuquerque', 'GA', 'CA', 'NY'], 'Column 3': ['87101', '30033', '94016', 'New York']}
dfB = pd.DataFrame(data_B).set_index('Index')
print(dfA)
print(dfB)
def match_rows(dfA, dfB):
    in_both = []
    not_in_both = []
    
    for index_a, row_a in dfA.iterrows():
        match_found = False
        for _, row_b in dfB.iterrows():
            if set(row_a) == set(row_b):
                in_both.append((index_a, row_a.to_list()))
                match_found = True
                break
        if not match_found:
            not_in_both.append((index_a, row_a.to_list()))
    
    for index_b, row_b in dfB.iterrows():
        if not any(set(row_b) == set(row_a) for _, row_a in dfA.iterrows()):
            not_in_both.append((index_b, row_b.to_list()))
    
    return in_both, not_in_both

matches, non_matches = match_rows(dfA, dfB)

matches, non_matches


def format_output(matches, non_matches):
    formatted_matches = [
        f"Matching from dfA, Index {index}: {values}" for index, values in matches
    ]
    formatted_non_matches = [
        f"Not Matching from {'dfA' if index in dfA.index else 'dfB'}, Index {index}: {values}"
        for index, values in non_matches
    ]
    return formatted_matches, formatted_non_matches

formatted_matches, formatted_non_matches = format_output(matches, non_matches)

formatted_matches, formatted_non_matches

我在这里引入了第二个函数，以更易于理解的方式格式化输出：

(["Matching from dfA, Index 0: ['Albuquerque', 'NM', '87101']",
  "Matching from dfA, Index 1: ['New York', 'NY', '10009']"],
 ["Not Matching from dfA, Index 2: ['Miami', 'FL', '33101']",
  "Not Matching from dfA, Index 1: ['Atlanta', 'GA', '30033']",
  "Not Matching from dfA, Index 2: ['San Francisco', 'CA', '94016']"])

比较两个数据帧并检查第一个 df 中的组合是否在第二个中

问题描述投票：0回答：1

1个回答

最新问题

比较两个数据帧并检查第一个 df 中的组合是否在第二个中

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1