我有一个数据集如下:
Emp Mgr
0 E1 M1
1 M1 M2
2 M3 M5
3 M2 M5
因此,对于每个用户(行),我需要管理层次结构为:
Emp Mgr Level_01 Level_02 Level_03 Level_04
0 E1 M1 M5 M2 M1 E1
1 M1 M2 M5 M2 M1
2 M3 M5 M5 M3
3 M2 M5 M5 M2
输出类似于:
Emp > 经理(最高级别为他的直接经理)。
例如:对于 EmpA:Mgr1(首席执行官)- Mgr2(总监)- M3(高级经理)- M4(Emp 的直接经理)
我正在使用网络,如这个答案中所述。有 177K 条记录,有 2 个根节点。生成这个层次结构的总时间超过6个小时。如何才能显着减少脚本所花费的时间。
G = nx.from_pandas_edgelist(df, source='Mgr', target='Emp',
create_using=nx.DiGraph)
# find roots (= top managers)
roots = [n for n,d in G.in_degree() if d==0]
df2 = (pd.DataFrame([next((p for root in roots for p in nx.all_simple_paths(G, root, node)), [])[:-1]
for node in df['Emp']], index=df.index)
.rename(columns=lambda x: f'Level_{x+1:02d}')
)
networkx
和纯Python与递归函数的组合重写了逻辑:
df = pd.read_csv('demodata2.csv', skiprows=3, usecols=[0, 1])
G = nx.from_pandas_edgelist(df.dropna(subset='MgrUPN'), source='MgrUPN', target='EmpUPN',
create_using=nx.DiGraph)
G.remove_edges_from(nx.selfloop_edges(G))
parent = {}
# uncomment the prints to see which nodes have no or multiple parents
for n in nx.dfs_postorder_nodes(G):
p = list(G.predecessors(n))
if len(p) == 0:
#print(f'"{n}" has no parent')
pass
else:
if len(p)>1:
#print(f'"{n}" has multiple parents ({p}), picking "{p[0]}"')
pass
parent[n] = p[0]
def get_parents(n):
try:
yield from get_parents(parent[n])
except KeyError:
pass
yield n
out = df.join(pd.DataFrame([list(get_parents(node)) for node in df['EmpUPN']], index=df.index)
.rename(columns=lambda x: f'Level_{x+1:02d}')
)
print(out)
输出:
EmpUPN MgrUPN Level_01 Level_02 Level_03 Level_04 Level_05 Level_06 Level_07 Level_08 Level_09 Level_10 Level_11
0 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] None None None None
1 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] None None None
2 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] None None None None None None None
3 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] None None None None
4 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ...
177920 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] None None
177921 [email protected] NaN [email protected] None None None None None None None None None None
177922 [email protected] NaN [email protected] None None None None None None None None None None
177923 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] None None None
177924 [email protected] NaN [email protected] None None None None None None None None None None
[177925 rows x 13 columns]
177K 行的运行时间:
1.44 s ± 255 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)