Pandas 使用索引列逐行比较两个数据帧

Question

我有两个具有相同列名的 CSV，我想获得逐行差异以将其写入 CSV 文件路径。

我还在文件/数据帧中索引了“ID”列。

示例数据框

data1 = {
    'ID': [100, 21, 32, 42, 51, 81],
    'Name': ['A', 'B', 'C', 'D','E','F'],
    'State': [TX, FL, FL, CA, CA, TX ]
}
data2 = {
    'ID': [100, 21, 32, 42, 51, 81],
    'Name': ['A', 'BB', 'C', 'DD','E','F'],  # Difference in the 2nd,4th row
    'State': [TX, TX, FL, CA, CA, TX]
}

df1 = pd.DataFrame(data1)  
df2 = pd.DataFrame(data2)

# Indexed by 'ID'

df1 = df1.set_index('ID')
df2 = df2.set_index('ID')

我的逻辑给了我一个布尔错误。我有多种逻辑，但似乎不起作用。

方法 - 1

# Find common indices between DataFrames
common_index = df1.index.intersection(df2.index)

# Save differences to an output file
output_file_path = 'row_wise_differences.txt'
with open(output_file_path, 'w') as file:
    for idx in common_index:
        differences = []
        for col in df1.columns:
            if df1.loc[idx, col] != df2.loc[idx, col]:
                differences.append(f"{col}: {df1.loc[idx, col]} <> {df2.loc[idx, col]}")
        if differences:
            file.write(f"Index ID: {idx}, Differences: {', '.join(differences)}\n")
        else:
            file.write(f"Index ID: {idx}, No Differences\n")

print(f"Differences saved to {output_file_path}")

方法2

common_index = df1.index.intersection(df2.index)

# Save differences to an output file
output_file_path = 'row_wise_differences.txt'
with open(output_file_path, 'w') as file:
    for idx in common_index:
        differences = [f"{col}: {df1.loc[idx, col]} <> {df2.loc[idx, col]}" for col in df1.columns if df1.loc[idx, col] != df2.loc[idx, col]]
        if differences:
            file.write(f"Index ID: {idx}, Differences: {', '.join(differences)}\n")
        else:
            file.write(f"Index ID: {idx}, No Differences\n")

print(f"Differences saved to {output_file_path}")

方法3

# Create a DataFrame showing differences as 'ID: Column: Value1 <> Value2'
diff_df = df1.loc[common_index][differences].stack().reset_index()
diff_df.columns = ['ID', 'Column', 'Difference']
diff_df['Difference'] = diff_df['Column'] + ': ' + diff_df['Difference'].astype(str)

# Save differences to an output CSV file
output_file_path = 'row_wise_differences.csv'
diff_df.to_csv(output_file_path, index=False)

print(f"Differences saved to {output_file_path}")

预期产出 索引 ID：21，差异：名称：B <> BB，州：FL <> TX 索引 ID：42，差异：名称：D <> DD，州：CA <> CA

只要捕获 df1 和 df2 名称以及行间差异，输出格式并不重要。

请帮我比较逻辑。我的所有逻辑都遇到了下面提到的错误

ValueError: The truth value of a series is ambiguous, Use a.empty, a.bool(), a.item(), a.any() or a.all()

预先感谢您花时间帮助我！

Answer 1

我根本没有收到您描述的错误，但从您的评论来看，似乎唯一真正的问题是您需要获取输出中的所有列。

您可以通过添加一些逻辑来完成此操作，以确保获得输出中的所有列值：

设置数据

import pandas as pd

data1 = {
    'ID': [100, 21, 32, 42, 51, 81],
    'Name': ['A', 'B', 'C', 'D','E','F'],
    'State': ['TX', 'FL', 'FL', 'CA', 'CA', 'TX' ]
}
data2 = {
    'ID': [100, 21, 32, 42, 51, 81],
    'Name': ['A', 'BB', 'C', 'DD','E','F'],  # Difference in the 2nd,4th row
    'State': ['TX', 'TX', 'FL', 'CA', 'CA', 'TX']
}

df1 = pd.DataFrame(data1).set_index("ID")
df2 = pd.DataFrame(data2).set_index("ID")

解决方案

# Find common indices between DataFrames
common_index = df1.index.intersection(df2.index)

# Save differences to an output file
found = False
output_file_path = 'row_wise_differences.txt'
with open(output_file_path, 'w') as file:
    for idx in common_index:
        differences = []
        found = False
        for col in df1.columns:
            if df1.loc[idx, col] != df2.loc[idx, col]:
                found = True
                break
        if found:
            for col in df1.columns:
                differences.append(f"{col}: {df1.loc[idx, col]}/{df2.loc[idx, col]}")
            file.write(f"Index ID: {idx}, Differences: {', '.join(differences)}\n")
        else:
            file.write(f"Index ID: {idx}, No Differences\n")
print(f"Differences saved to {output_file_path}")

替代解决方案

result = df1.compare(df2, align_axis=1, keep_equal=True, result_names=('DF1', 'DF2'))
result.columns =  [f'{col[1]}-{col[0]}' for col in dfa.columns.values]

Pandas 使用索引列逐行比较两个数据帧

问题描述投票：0回答：1

1个回答

设置数据

解决方案

替代解决方案

最新问题

Pandas 使用索引列逐行比较两个数据帧

问题描述 投票：0回答：1

1个回答

设置数据

解决方案

替代解决方案

最新问题

问题描述投票：0回答：1