我很难以一般的方式描述这个问题,这将使问题标题变得有用。但它就在这里。我正在尝试根据列中的 id 来合并或分组表中的行,这些列根据不同的表声明为属于某些组。由于 ids 是分布在两列中的复合对象,情况变得更加复杂。这是它的一个玩具版本和一个可行的解决方案,但效率非常低:
# Inputs:
columns = ["time", "id1_list", "id2_list", "id3"]
data = [(1, ("A", "B"), (1, 2), 1),
(1, ("A", "B"), (2, 3), 2),
(1, ("A", "B"), (4, 5), 3),
(1, ("A", "B"), (6, 7), 4),
(1, ("A", "C"), (1, 1), 5),
(2, ("A", "B"), (1, 3), 1),
(2, ("A", "B"), (2, 3), 2),
(2, ("A", "B"), (4, 3), 3),
(2, ("A", "C"), (1, 1), 4)]
merge_cols = ["time", "id1", "id2_lists"]
merge_data = [(1, "A", ((1, 2), (3, 4))),
(1, "B", ((3, 5),)),
(2, "A", ((1, 2), (3, 4))),
(2, "B", ((3, 5),))]
output_columns = ["time", "id3_lists"]
expected_output = [(1, ((1, 2, 3, 5), (4,))),
(2, ((1, 2, 3, 4),))]
低效的解决方案:
# Group by time
df_g = pd.DataFrame(data, columns=columns).groupby("time")
df_merge_data_g = pd.DataFrame(merge_data, columns=merge_cols).groupby("time")
def match(g, id3_A, id3_B, df_merge_data_t):
# print(df_LR_t)
# Get id match info
rowA = g.query("id3==@id3_A").iloc[0]
rowB = g.query("id3==@id3_B").iloc[0]
id1sA = rowA["id1_list"]
id1sB = rowB["id1_list"]
id2sA = rowA["id2_list"]
id2sB = rowB["id2_list"]
matched = False
for id1_A, id2_A in zip(id1sA, id2sA):
if matched: break
for id1_B, id2_B in zip(id1sB, id2sB):
if matched: break
if id1_A==id1_B:
# print(id1_A, id2_A, id1_B, id2_B)
match_groups = df_merge_data_t.query("id1==@id1_A")["id2_lists"].iloc[0]
# print(match_groups)
for match_g in match_groups:
# print(match_g)
# print(id2_A in match_g, id2_B in match_g)
if id2_A in match_g and id2_B in match_g:
matched = True
break
# print("matched:", matched)
return matched
def merge(data):
for x in set(data):
for y in set(data):
if x == y:
continue
if not x.isdisjoint(y):
data.remove(x)
data.remove(y)
data.add(x.union(y))
return merge(data)
return data
def get_match_groups(g):
#print(g)
df_merge_data_t = df_merge_data_g.get_group(g.name)
# Form all pairings of items to be matched
pairs = list(itertools.combinations(g.id3, 2))
# Check each pair for match
matched_pairs = set(frozenset(pair) for pair in pairs if match(g, *pair, df_merge_data_t))
print("matched_pairs")
print(matched_pairs)
# Merge pairs with common elements to get connected groups of matches
merged_matches = merge(matched_pairs)
# Add back any items that weren't matched with anything as singleton groups
unused = set(frozenset((id3,)) for id3 in set(g.id3) if not any(id3 in g for g in merged_matches))
merged_matches.update(unused)
return merged_matches
out = df_g.apply(get_match_groups, include_groups=False)
输出:
time
1 {(1, 2, 3, 5), (4)}
2 {(1, 2, 3, 4)}
dtype: object
预期输出:
pd.DataFrame(expected_output, columns=output_columns)["id3_lists"]
0 ((1, 2, 3, 5), (4,))
1 ((1, 2, 3, 4))
Name: id3_lists, dtype: object
用语言来说:我们有桌子
data
。我想根据 id3
和 id1_list
中的匹配标准形成 id2_list
组。基本上,每一行描述一个使用其他对象创建的复合对象,其 id 信息在 id1_list
和 id2_list
中列出。 id1_list
和 id2_list
对齐,因此由 zip(id1_list, id2_list)
创建的对描述了贡献对象。
接下来,我们有
merge_data
表。此表描述了这些贡献对象(由 (id1, id2)
对标记的对象)之间的等价性。所以例如merge_data
的第一列:
(1, "A", ((1, 2), (3, 4)))
表示在时间
1
,对于具有id1=A
的对象,集合id2
中具有(1, 2)
的对象应被视为“匹配”。同样对于(3, 4)
。如果有帮助的话,这些集合总是不相交的。
因此,目标是,对于每次
t
,收集当时 data
和 merge_data
中的所有行,然后使用 merge_data
找出 data
的哪些行“匹配”。然后输出一个由每个时间组的标准形成的 id3
组的表格,只要存在任何“链接”,就将所有匹配项合并到组中。
我很高兴重组这些数据,使某种聪明的 pandas 合并操作能够有效地完成所有工作。或者也许是一些图算法。但到目前为止,我想出的只是上面非常慢的强力解决方案。
您可以使用
NetworkX
将连接作为图形问题来处理。与暴力方法相比,这会降低计算复杂性!
import pandas as pd
import networkx as nx
# Inputs
columns = ["time", "id1_list", "id2_list", "id3"]
data = [(1, ("A", "B"), (1, 2), 1),
(1, ("A", "B"), (2, 3), 2),
(1, ("A", "B"), (4, 5), 3),
(1, ("A", "B"), (6, 7), 4),
(1, ("A", "C"), (1, 1), 5),
(2, ("A", "B"), (1, 3), 1),
(2, ("A", "B"), (2, 3), 2),
(2, ("A", "B"), (4, 3), 3),
(2, ("A", "C"), (1, 1), 4)]
merge_cols = ["time", "id1", "id2_lists"]
merge_data = [(1, "A", ((1, 2), (3, 4))),
(1, "B", ((3, 5),)),
(2, "A", ((1, 2), (3, 4))),
(2, "B", ((3, 5),))]
# Create DataFrames
df_data = pd.DataFrame(data, columns=columns)
df_merge_data = pd.DataFrame(merge_data, columns=merge_cols)
def find_connected_components(group):
G = nx.Graph()
# Build the graph
for _, row in group.iterrows():
id1s = row['id1_list']
id3 = row['id3']
matches = df_merge_data.query("time == @group.name and id1 == @id1s")['id2_lists'].values
for match_group in matches:
for match in match_group:
G.add_edge(id3, match)
return [list(component) for component in nx.connected_components(G)]
# Group by time and apply the function
output = df_data.groupby("time").apply(find_connected_components)
# Prepare the final output format
result = [(time, tuple(sorted(map(tuple, components)))) for time, components in output.iteritems()]
# Finally convert to DataFrame
final_output = pd.DataFrame(result, columns=["time", "id3_lists"])
# And display the output
print(final_output)