我有以下虚拟 df:
import pandas as pd
data = {
'address': [1234, 24389, 4384, 4484, 1234, 24389, 4384, 188],
'old_account': [200, 200, 200, 300, 200, 494, 400, 100],
'new_account': [300, 100, 494, 200, 400, 200, 200, 200]
}
df = pd.DataFrame(data)
print(df)
address old_account new_account
0 1234 200 300
1 24389 200 100
2 4384 200 494
3 4484 300 200
4 1234 200 400
5 24389 494 200
6 4384 400 200
7 188 100 200
A) 我想对它进行排序,使
200
位于 old_account
处,并直接位于下一行的 new_account
处:
200 xxx
xxx 200
B) 我还想对非 200 进行排序,这样我就可以从
300
开始,浏览整个 df 寻找 300
并进行切换:
200 300
300 200
200 300
...
只有当不再有
300
时,我才会去下一个,比如说400
..
200 300
300 200
200 300
...
200 400
400 200
200 400
...
上面的df
应该是这样的:
address old_account new_account
0 1234 200 300
1 4484 300 200
2 24389 200 100
3 188 100 200
4 4384 200 494
5 24389 494 200
6 1234 200 400
7 4384 400 200
如您所见,200 彼此成对角线,非 200 也是如此。
以下代码仅适用于A)。 我没能同时考虑B) 我有以下代码:
import pandas as pd
# Create the initial DataFrame
df= pd.read_csv('dummy_data.csv', sep=';')
# Initiate sorted df
sorted_df = pd.DataFrame(columns=df.columns)
while not df.empty:
# Find the first row where '200' is in 'old_account'
idx_old = df.index[df['old_account'] == 200].min()
if pd.notna(idx_old):
# Add the corresponding row to the sorted result
sorted_df = pd.concat([sorted_df, df.loc[[idx_old]]], ignore_index=True)
# Remove the row from the original DataFrame
df = df.drop(index=idx_old)
# Find the matching row where '200' is in 'new_account'
idx_new = df.index[df['new_account'] == 200].min()
if pd.notna(idx_new):
# Add the corresponding row to the sorted result
sorted_df = pd.concat([sorted_df, df.loc[[idx_new]]], ignore_index=True)
# Remove the row from the original DataFrame
df = df.drop(index=idx_new)
else:
break # If no matching row is found, exit the loop
else:
break # If no more '200' in 'old_account' is found, exit the loop
# Reset the index of the sorted DataFrame
sorted_df.reset_index(drop=True, inplace=True)
print(sorted_df)
看起来您正在尝试在检测到的图中搜索 [欧拉路径]。
您可能想使用
networkx
:
import networkx as nx
G = nx.from_pandas_edgelist(df, source='old_account', target='new_account',
create_using=nx.MultiDiGraph)
tmp = pd.DataFrame(nx.eulerian_circuit(G, keys=True),
columns=['old_account', 'new_account', 'n'])
out = (tmp.merge(df.assign(n=df.groupby(['old_account', 'new_account']).cumcount()))
[df.columns]
)
输出:
address old_account new_account
0 1234 200 400
1 4384 400 200
2 4384 200 494
3 24389 494 200
4 24389 200 100
5 188 100 200
6 1234 200 300
7 4484 300 200