我有 2 个数据框捕获同一数据集的层次结构。 Df1相比Df2更加完整,所以我想以Df1为标准来分析Df2中的层次结构是否正确。然而,这两个数据帧都以不好的方式显示层次结构,因此很难逐行了解完整的结构。
例如。 A公司可能拥有子公司:B、C、D、E,关系是A拥有B拥有C拥有D拥有E。 在Df1中,可能会显示:
| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A | B | C |
| B | C | D | --> new
| C | D | E |
所以如果分解逐行分析,同一个实体可以同时显示为“最终父级”或“子级”,这就变得复杂了。
另一方面,由于Df2不完整,因此它不会拥有所有数据(A、B、C、D、E)。它仅包含部分数据,例如。在本例中为 A、D、E,因此数据框将如下所示
| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A | D | E |
现在我想 (1) 使用 Df1 获得正确/完整的层次结构 (2) 比较并识别 Df1 和 Df2 之间的差距。逻辑如下:
如果 A 拥有 B 拥有 C 拥有 D 拥有 E 并且 Df1 看起来像这样
| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A | B | C |
| C | D | E |
我想添加 1 列将所有相关实体放在一起,并按从最终父级到子级的顺序
| Ultimate Parent | Parent | Child | Hierarchy |
| --------------- | ------ |-------|-------------|
| A | B | C |A, B, C, D, E|
| C | D | E |A, B, C, D, E|
然后将此 Df1 与 Df2 进行比较,并在 Df2 中添加一列来标识差距。最理想(但可选)的情况是有另一列,说明错误的原因。
| Ultimate Parent | Parent | Child | Right/Wrong| Reason |
| --------------- | ------ |-------|------------|-----------------|
| A | D | E | Right | |
| C | B | A | Wrong | wrong hierarchy |
| C | A | B | Wrong | wrong hierarchy | --> new
| G | A | B | Wrong | wrong entities | --> new
| A | F | G | Wrong | wrong entities |
我尝试过多种字符串匹配方法,但我陷入了我认为顺序很重要的步骤和想法,但我不知道当它们相关但分散在不同的行中时如何按顺序比较字符串。
基本上,您需要构建 df1 的网络图才能获得层次结构的理解图。完成此操作后,您需要将 df2 的层次结构与 df1 的层次结构进行比较并最终进行验证。为此,您可以定义函数。您将为 df1 创建一个新列
hierarchies
,并为 df2 创建一个新列 Right/Wrong
、Reason
。 .
import pandas as pd
import networkx as nx
data1 = {
'Ultimate Parent': ['A', 'C'],
'Parent': ['B', 'D'],
'Child': ['C', 'E']
}
df1 = pd.DataFrame(data1)
data2 = {
'Ultimate Parent': ['A', 'C', 'A'],
'Parent': ['D', 'B', 'F'],
'Child': ['E', 'A', 'G']
}
df2 = pd.DataFrame(data2)
G = nx.DiGraph()
for _, row in df1.iterrows():
G.add_edge(row['Parent'], row['Child'])
if row['Ultimate Parent'] != row['Parent']:
G.add_edge(row['Ultimate Parent'], row['Parent'])
def complete_hierarchy(node, graph):
descendants = nx.descendants(graph, node)
descendants.add(node)
return ', '.join(sorted(descendants))
df1['Hierarchy'] = df1['Ultimate Parent'].apply(lambda x: complete_hierarchy(x, G))
def validate_row(row, hierarchy_df, graph):
filtered_hierarchy = hierarchy_df[hierarchy_df['Ultimate Parent'] == row['Ultimate Parent']]
if filtered_hierarchy.empty:
return pd.Series(["Wrong", "wrong entities"])
full_hierarchy = filtered_hierarchy.iloc[0]['Hierarchy']
hierarchy_elements = set(full_hierarchy.split(', '))
if set([row['Parent'], row['Child']]).issubset(graph.nodes()):
if row['Parent'] not in hierarchy_elements or row['Child'] not in hierarchy_elements:
return pd.Series(["Wrong", "wrong hierarchy"])
elif f"{row['Parent']}, {row['Child']}" not in full_hierarchy:
return pd.Series(["Wrong", "wrong hierarchy"])
else:
return pd.Series(["Right", ""])
else:
return pd.Series(["Wrong", "wrong entities"])
df2[['Right/Wrong', 'Reason']] = df2.apply(lambda row: validate_row(row, df1, G), axis=1)
print("Df1 - Complete Hierarchy:")
print(df1)
print("\nDf2 - Validation Results:")
print(df2)
这给了你
Df1 - Complete Hierarchy:
Ultimate Parent Parent Child Hierarchy
0 A B C A, B, C, D, E
1 C D E C, D, E
Df2 - Validation Results:
Ultimate Parent Parent Child Right/Wrong Reason
0 A D E Right
1 C B A Wrong wrong hierarchy
2 A F G Wrong wrong entities
networkx
从 df1
形成有向图,然后简单检查边或节点的存在:
import networkx as nx
# create a graph from df1
pairs = (['Ultimate Parent', 'Parent'], ['Parent', 'Child'])
G1 = nx.compose(*(nx.from_pandas_edgelist(df1, source=x[0], target=x[1],
create_using=nx.DiGraph)
for x in pairs))
# for each edge in df2, check if it exists in the graph
df2['Right/Wrong'] = ['Right' if e in G1.edges else 'Wrong'
for e in zip(df2['Parent'], df2['Child'])]
# for each "wrong", check if the two nodes are in the graph
# if they are, then it's a hierarchy issue
# if they are not, then it's an entity issue
m = df2['Right/Wrong'].eq('Wrong')
df2.loc[m, 'Reason'] = [
'wrong hierarchy' if set(e).issubset(G1) else 'wrong entities'
for e in zip(df2.loc[m, 'Parent'], df2.loc[m, 'Child'])
]
输出:
Ultimate Parent Parent Child Right/Wrong Reason
0 A D E Right nan
1 C B A Wrong wrong hierarchy
2 A F G Wrong wrong entities
将层次结构添加到
df1
:
cc = {frozenset(c): ','.join(nx.topological_sort(G1.subgraph(c)))
for c in nx.weakly_connected_components(G1)}
df1['Hierarchy'] = df1['Child'].map({n: x for c,x in cc.items() for n in c})
输出:
Ultimate Parent Parent Child Hierarchy
0 A B C A,B,C,D,E
1 C D E A,B,C,D,E
可重复输入:
df1 = pd.DataFrame({'Ultimate Parent': ['A', 'C'],
'Parent': ['B', 'D'],
'Child': ['C', 'E']})
df2 = pd.DataFrame({'Ultimate Parent': ['A', 'C', 'A'],
'Parent': ['D', 'B', 'F'],
'Child': ['E', 'A', 'G']
})
图(
df1
):