我想合并两个在合并列中包含重复记录的 DataFrame。这是一个例子: 示例数据框:
将 pandas 导入为 pd 将 numpy 导入为 np
# Sample data for df1 and df2
df1 = pd.DataFrame({'id': [1, 2, 2, 2, 3, 3, 3, 3, 4],
'tr1': [10, 80, 50, 40, 60, 20, 70, 90, 30]})
df2 = pd.DataFrame({'id': [1, 1, 1, 2, 2, 3, 3],
'tr2': [15, 45, 55, 35, 95, 75, 65]})
df = df1.merge(df2, how='outer')
这会产生以下结果:
df
id tr1 tr2
1 10 15.0
1 10 45.0
1 10 55.0
2 80 35.0
2 80 95.0
2 50 35.0
2 50 95.0
2 40 35.0
2 40 95.0
3 60 75.0
3 60 65.0
3 20 75.0
3 20 65.0
3 70 75.0
3 70 65.0
3 90 75.0
3 90 65.0
4 30 NaN
但是,这并不完全正确。例如,id = 1 的 tr1 值不应与每个 tr2 值重复为 10。为了解决这个问题,我想到添加记录编号来区分行,然后在合并后丢弃这些数字:
df1 = pd.DataFrame({'id': [1, 2, 2, 2, 3, 3, 3, 3, 4],
'tr1': [10, 80, 50, 40, 60, 20, 70, 90, 30],
'n1': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df2 = pd.DataFrame({'id': [1, 1, 1, 2, 2, 3, 3],
'tr2': [15, 45, 55, 35, 95, 75, 65],
'n2': [1, 2, 3, 4, 5, 6, 7]})
df = df1.merge(df2, how='outer')
# Remove duplicates based on record numbers
df['tr1'] = np.where(df.duplicated('n1', keep='first'), np.nan, df['tr1'])
df['tr2'] = np.where(df.duplicated('n2', keep='first'), np.nan, df['tr2'])
df = df[['id', 'tr1', 'tr2']]
# Drop rows where both 'tr1' and 'tr2' are NaN
df = df.dropna(subset=['tr1', 'tr2'], how='all')
这会导致:
id tr1 tr2
1 10.0 15.0
1 NaN 45.0
1 NaN 55.0
2 80.0 35.0
2 NaN 95.0
2 50.0 NaN
2 40.0 NaN
3 60.0 75.0
3 NaN 65.0
3 20.0 NaN
3 70.0 NaN
3 90.0 NaN
4 30.0 NaN
这是一个改进,但仍然不是我想要的。例如,id = 2 第二行中的 tr1 不应为 NaN。应该是:
2 80.0 35.0
2 50.0 95.0
2 40.0 NaN
以 id 3 为例:
3 60.0 75.0
3 20.0 65.0
3 70.0 NaN
3 90.0 NaN
预期输出应如下所示:
所需输出:
id tr1 tr2
1 10 15
1 nan 45
1 nan 55
2 80 35
2 50 95
2 40 nan
3 60 75
3 20 65
3 70 nan
3 90 nan
4 30 nan
知道我怎样才能达到它吗?
您可以通过向两个 DataFrame 添加累积计数来唯一地标识合并列中的每个重复值来解决此问题。然后,合并合并列和累积计数上的 DataFrame。
import pandas as pd
def merge_with_repeated_values(df1, df2, merge_column):
"""
Merge DataFrames with repeated values using cumulative count.
Args:
df1, df2 (pd.DataFrame): DataFrames to merge.
merge_column (str): Column to merge on.
Returns:
pd.DataFrame: Merged DataFrame with all aligned rows.
"""
# Add cumulative count to handle repeated values
temp_count_col = '__temp_count__'
df1[temp_count_col] = df1.groupby(merge_column).cumcount()
df2[temp_count_col] = df2.groupby(merge_column).cumcount()
# Merge on both original column and count
merged_df = pd.merge(df1, df2, on=[merge_column, temp_count_col], how='outer')
# Drop the temporary count column
return merged_df.drop(columns=[temp_count_col])