Pandas:合并两个在合并列中具有重复值的数据帧

问题描述 投票:0回答:1

我想合并两个在合并列中包含重复记录的 DataFrame。这是一个例子: 示例数据框:

将 pandas 导入为 pd 将 numpy 导入为 np

# Sample data for df1 and df2
df1 = pd.DataFrame({'id': [1, 2, 2, 2, 3, 3, 3, 3, 4],
                    'tr1': [10, 80, 50, 40, 60, 20, 70, 90, 30]})

df2 = pd.DataFrame({'id': [1, 1, 1, 2, 2, 3, 3],
                    'tr2': [15, 45, 55, 35, 95, 75, 65]})

df = df1.merge(df2, how='outer')

这会产生以下结果:

df
    id   tr1   tr2
    1   10  15.0
    1   10  45.0
    1   10  55.0
    2   80  35.0
    2   80  95.0
    2   50  35.0
    2   50  95.0
    2   40  35.0
    2   40  95.0
    3   60  75.0
    3   60  65.0
    3   20  75.0
    3   20  65.0
    3   70  75.0
    3   70  65.0
    3   90  75.0
    3   90  65.0
    4   30   NaN

但是,这并不完全正确。例如,id = 1 的 tr1 值不应与每个 tr2 值重复为 10。为了解决这个问题,我想到添加记录编号来区分行,然后在合并后丢弃这些数字:

df1 = pd.DataFrame({'id': [1, 2, 2, 2, 3, 3, 3, 3, 4],
                    'tr1': [10, 80, 50, 40, 60, 20, 70, 90, 30],
                    'n1': [1, 2, 3, 4, 5, 6, 7, 8, 9]})

df2 = pd.DataFrame({'id': [1, 1, 1, 2, 2, 3, 3],
                    'tr2': [15, 45, 55, 35, 95, 75, 65],
                    'n2': [1, 2, 3, 4, 5, 6, 7]})

df = df1.merge(df2, how='outer')

# Remove duplicates based on record numbers
df['tr1'] = np.where(df.duplicated('n1', keep='first'), np.nan, df['tr1'])
df['tr2'] = np.where(df.duplicated('n2', keep='first'), np.nan, df['tr2'])
df = df[['id', 'tr1', 'tr2']]

# Drop rows where both 'tr1' and 'tr2' are NaN
df = df.dropna(subset=['tr1', 'tr2'], how='all')

这会导致:

id   tr1   tr2
    1  10.0  15.0
    1   NaN  45.0
    1   NaN  55.0
    2  80.0  35.0
    2   NaN  95.0
    2  50.0   NaN
    2  40.0   NaN
    3  60.0  75.0
    3   NaN  65.0
    3  20.0   NaN
    3  70.0   NaN
    3  90.0   NaN
    4  30.0   NaN

这是一个改进,但仍然不是我想要的。例如,id = 2 第二行中的 tr1 不应为 NaN。应该是:

2  80.0  35.0
2  50.0  95.0
2  40.0   NaN

以 id 3 为例:

3  60.0  75.0
3  20.0  65.0
3  70.0   NaN
3  90.0   NaN

预期输出应如下所示:

所需输出:

id  tr1 tr2
1   10  15
1   nan 45
1   nan 55
2   80  35
2   50  95
2   40  nan
3   60  75
3   20  65
3   70  nan
3   90  nan
4   30  nan

知道我怎样才能达到它吗?

pandas merge
1个回答
0
投票

您可以通过向两个 DataFrame 添加累积计数来唯一地标识合并列中的每个重复值来解决此问题。然后,合并合并列和累积计数上的 DataFrame。

import pandas as pd

def merge_with_repeated_values(df1, df2, merge_column):
    """
    Merge DataFrames with repeated values using cumulative count.
    
    Args:
        df1, df2 (pd.DataFrame): DataFrames to merge.
        merge_column (str): Column to merge on.
    
    Returns:
        pd.DataFrame: Merged DataFrame with all aligned rows.
    """
    # Add cumulative count to handle repeated values
    temp_count_col = '__temp_count__'
    df1[temp_count_col] = df1.groupby(merge_column).cumcount()
    df2[temp_count_col] = df2.groupby(merge_column).cumcount()
    
    # Merge on both original column and count
    merged_df = pd.merge(df1, df2, on=[merge_column, temp_count_col], how='outer')
    
    # Drop the temporary count column
    return merged_df.drop(columns=[temp_count_col])
© www.soinside.com 2019 - 2024. All rights reserved.