Pandas 复杂组通过使用另一个表中的匹配标准

问题描述 投票:0回答:1

我很难以一般的方式描述这个问题,这将使问题标题变得有用。但它就在这里。我正在尝试根据列中的 id 来合并或分组表中的行,这些列根据不同的表声明为属于某些组。由于 ids 是分布在两列中的复合对象,情况变得更加复杂。这是它的一个玩具版本和一个可行的解决方案,但效率非常低:

# Inputs:

columns = ["time", "id1_list", "id2_list", "id3"]
data = [(1, ("A", "B"), (1, 2), 1),
        (1, ("A", "B"), (2, 3), 2),
        (1, ("A", "B"), (4, 5), 3),
        (1, ("A", "B"), (6, 7), 4),        
        (1, ("A", "C"), (1, 1), 5),
        (2, ("A", "B"), (1, 3), 1),
        (2, ("A", "B"), (2, 3), 2),        
        (2, ("A", "B"), (4, 3), 3),
        (2, ("A", "C"), (1, 1), 4)]

merge_cols = ["time", "id1", "id2_lists"]
merge_data = [(1, "A", ((1, 2), (3, 4))),
              (1, "B", ((3, 5),)),
              (2, "A", ((1, 2), (3, 4))),
              (2, "B", ((3, 5),))]

output_columns = ["time", "id3_lists"]
expected_output = [(1, ((1, 2, 3, 5), (4,))),
                   (2, ((1, 2, 3, 4),))]

低效的解决方案:

# Group by time
df_g = pd.DataFrame(data, columns=columns).groupby("time")
df_merge_data_g = pd.DataFrame(merge_data, columns=merge_cols).groupby("time")


def match(g, id3_A, id3_B, df_merge_data_t):
    # print(df_LR_t)
    # Get id match info
    rowA = g.query("id3==@id3_A").iloc[0]
    rowB = g.query("id3==@id3_B").iloc[0]
    id1sA = rowA["id1_list"]
    id1sB = rowB["id1_list"]
    id2sA = rowA["id2_list"]
    id2sB = rowB["id2_list"]
    matched = False
    for id1_A, id2_A in zip(id1sA, id2sA):
        if matched: break           
        for id1_B, id2_B in zip(id1sB, id2sB):
            if matched: break
            if id1_A==id1_B:
                # print(id1_A, id2_A, id1_B, id2_B)
                match_groups = df_merge_data_t.query("id1==@id1_A")["id2_lists"].iloc[0]                 
                # print(match_groups)
                for match_g in match_groups:
                    # print(match_g)
                    # print(id2_A in match_g, id2_B in match_g)
                    if id2_A in match_g and id2_B in match_g:
                        matched = True
                        break
    # print("matched:", matched)
    return matched


def merge(data):
    for x in set(data):
        for y in set(data):
            if x == y:
                continue
            if not x.isdisjoint(y):
                data.remove(x)
                data.remove(y)
                data.add(x.union(y))
                return merge(data)
    return data


def get_match_groups(g):
    #print(g)    
    df_merge_data_t = df_merge_data_g.get_group(g.name)
    
    # Form all pairings of items to be matched
    pairs = list(itertools.combinations(g.id3, 2))
    
    # Check each pair for match
    matched_pairs = set(frozenset(pair) for pair in pairs if match(g, *pair, df_merge_data_t))

    print("matched_pairs")
    print(matched_pairs)
    
    # Merge pairs with common elements to get connected groups of matches
    merged_matches = merge(matched_pairs)
    
    # Add back any items that weren't matched with anything as singleton groups
    unused = set(frozenset((id3,)) for id3 in set(g.id3) if not any(id3 in g for g in merged_matches))
    merged_matches.update(unused)
    
    return merged_matches


out = df_g.apply(get_match_groups, include_groups=False)

输出:

time
1    {(1, 2, 3, 5), (4)}
2         {(1, 2, 3, 4)}
dtype: object

预期输出:

pd.DataFrame(expected_output, columns=output_columns)["id3_lists"]

0    ((1, 2, 3, 5), (4,))
1    ((1, 2, 3, 4))
Name: id3_lists, dtype: object

用语言来说:我们有桌子

data
。我想根据
id3
id1_list
中的匹配标准形成
id2_list
组。基本上,每一行描述一个使用其他对象创建的复合对象,其 id 信息在
id1_list
id2_list
中列出。
id1_list
id2_list
对齐,因此由
zip(id1_list, id2_list)
创建的对描述了贡献对象。

接下来,我们有

merge_data
表。此表描述了这些贡献对象(由
(id1, id2)
对标记的对象)之间的等价性。所以例如
merge_data
的第一列:

(1, "A", ((1, 2), (3, 4)))

表示在时间

1
,对于具有
id1=A
的对象,集合
id2
中具有
(1, 2)
的对象应被视为“匹配”。同样对于
(3, 4)
。如果有帮助的话,这些集合总是不相交的。

因此,目标是,对于每次

t
,收集当时
data
merge_data
中的所有行,然后使用
merge_data
找出
data
的哪些行“匹配”。然后输出一个由每个时间组的标准形成的
id3
组的表格,只要存在任何“链接”,就将所有匹配项合并到组中。

我很高兴重组这些数据,使某种聪明的 pandas 合并操作能够有效地完成所有工作。或者也许是一些图算法。但到目前为止,我想出的只是上面非常慢的强力解决方案。

python pandas dataframe group-by pandas-merge
1个回答
0
投票

您可以使用

NetworkX
将连接作为图形问题来处理。与暴力方法相比,这会降低计算复杂性!

import pandas as pd
import networkx as nx

# Inputs
columns = ["time", "id1_list", "id2_list", "id3"]
data = [(1, ("A", "B"), (1, 2), 1),
        (1, ("A", "B"), (2, 3), 2),
        (1, ("A", "B"), (4, 5), 3),
        (1, ("A", "B"), (6, 7), 4),        
        (1, ("A", "C"), (1, 1), 5),
        (2, ("A", "B"), (1, 3), 1),
        (2, ("A", "B"), (2, 3), 2),        
        (2, ("A", "B"), (4, 3), 3),
        (2, ("A", "C"), (1, 1), 4)]

merge_cols = ["time", "id1", "id2_lists"]
merge_data = [(1, "A", ((1, 2), (3, 4))),
              (1, "B", ((3, 5),)),
              (2, "A", ((1, 2), (3, 4))),
              (2, "B", ((3, 5),))]

# Create DataFrames
df_data = pd.DataFrame(data, columns=columns)
df_merge_data = pd.DataFrame(merge_data, columns=merge_cols)

def find_connected_components(group):
    G = nx.Graph()
    
    # Build the graph
    for _, row in group.iterrows():
        id1s = row['id1_list']
        id3 = row['id3']
        
        matches = df_merge_data.query("time == @group.name and id1 == @id1s")['id2_lists'].values
        for match_group in matches:
            for match in match_group:
                G.add_edge(id3, match)

    return [list(component) for component in nx.connected_components(G)]

# Group by time and apply the function
output = df_data.groupby("time").apply(find_connected_components)

# Prepare the final output format
result = [(time, tuple(sorted(map(tuple, components)))) for time, components in output.iteritems()]

# Finally convert to DataFrame
final_output = pd.DataFrame(result, columns=["time", "id3_lists"])

# And display the output
print(final_output)
© www.soinside.com 2019 - 2024. All rights reserved.