我正在尝试总结/聚合数据框,如下所示。虽然代码给出了正确的结果,但它非常重复,我想避免这种情况。我认为我需要使用
groupby
、agg
、apply
等,但找不到方法来做到这一点。目标是最后计算 df_summ
。我认为我使用了太多带有行选择的中间数据帧,以及太多的 merge
来将结果放在一起。我觉得一定有更简单的方法,但无法弄清楚。
真实的
df_stats
输入数据帧有数百万行,而 df_summ
输出数据帧有几十列。下面显示的输入只是一个最小的可重现示例。
import io
import pandas as pd
TESTDATA="""
enzyme regions N length
AaaI all 10 238045
AaaI all 20 170393
AaaI all 30 131782
AaaI all 40 103790
AaaI all 50 81246
AaaI all 60 62469
AaaI all 70 46080
AaaI all 80 31340
AaaI all 90 17188
AaaI captured 10 292735
AaaI captured 20 229824
AaaI captured 30 193605
AaaI captured 40 163710
AaaI captured 50 138271
AaaI captured 60 116122
AaaI captured 70 95615
AaaI captured 80 73317
AaaI captured 90 50316
AagI all 10 88337
AagI all 20 19144
AagI all 30 11030
AagI all 40 8093
AagI all 50 6394
AagI all 60 4991
AagI all 70 3813
AagI all 80 2759
AagI all 90 1666
AagI captured 10 34463
AagI captured 20 19220
AagI captured 30 15389
AagI captured 40 12818
AagI captured 50 10923
AagI captured 60 9261
AagI captured 70 7753
AagI captured 80 6201
AagI captured 90 4495
"""
df_stats = pd.read_csv(io.StringIO(TESTDATA), sep='\s+')
df_cap_N90 = df_stats[(df_stats['N'] == 90) & (df_stats['regions'] == 'captured')].drop(columns=['regions', 'N'])
df_cap_N50 = df_stats[(df_stats['N'] == 50) & (df_stats['regions'] == 'captured')].drop(columns=['regions', 'N'])
df_all_N50 = df_stats[(df_stats['N'] == 50) & (df_stats['regions'] == 'all') ].drop(columns=['regions', 'N'])
df_summ_cap_N50_all_N50 = pd.merge(df_cap_N50, df_all_N50, on='enzyme', how='inner', suffixes=('_cap_N50', '_all_N50'))
df_summ_cap_N50_all_N50['cap_N50_all_N50'] = (df_summ_cap_N50_all_N50['length_cap_N50'] -
df_summ_cap_N50_all_N50['length_all_N50'])
print(df_summ_cap_N50_all_N50)
df_summ_cap_N90_all_N50 = pd.merge(df_cap_N90, df_all_N50, on='enzyme', how='inner', suffixes=('_cap_N90', '_all_N50'))
df_summ_cap_N90_all_N50['cap_N90_all_N50'] = df_summ_cap_N90_all_N50['length_cap_N90'] - df_summ_cap_N90_all_N50['length_all_N50']
print(df_summ_cap_N90_all_N50)
df_summ = pd.merge(df_summ_cap_N50_all_N50.drop(columns=['length_cap_N50', 'length_all_N50']),
df_summ_cap_N90_all_N50.drop(columns=['length_cap_N90', 'length_all_N50']),
on='enzyme', how='inner')
print(df_summ)
打印:
enzyme length_cap_N50 length_all_N50 cap_N50_all_N50
0 AaaI 138271 81246 57025
1 AagI 10923 6394 4529
enzyme length_cap_N90 length_all_N50 cap_N90_all_N50
0 AaaI 50316 81246 -30930
1 AagI 4495 6394 -1899
enzyme cap_N50_all_N50 cap_N90_all_N50
0 AaaI 57025 -30930
1 AagI 4529 -1899
此问题背后的生物信息学背景说明:
(请跳过此部分,它描述了Python代码背后的领域知识)
上面的代码是多步骤生物信息学项目中的一步,我尝试根据限制性内切酶切割 DNA 的方式找到最佳限制性内切酶。
作为此步骤的输入,我有一个包含限制性酶的表(其名称存储在列
enzyme
中)。我想根据 DNA 切割方式的统计特性对酶进行排名。 regions
列存储两种不同的 DNA 区域类型,我想使用这些酶来区分它们。列 N
是衡量 DNA 切割精细程度的统计数据名称(N10,...,N90),length
是该统计数据的值。 N
统计数据总结了 DNA 片段长度分布(以核苷酸为单位测量),其精神类似于分位数 (10%, ..., 90%)。当我比较酶的时候,我想做一些简单的操作,比如cap_N90_all_N50 = { captured N90 } - { all N50 }
等,然后我通过cap_N50_all_N50
等的组合对酶进行排名
你没有描述逻辑,我不明白你为什么不计算
df_all_N90
。
pivot
/ sub
:
piv = (df_stats.loc[df_stats["N"].isin([50, 90])]
.pivot(index="enzyme", columns=["regions", "N"], values="length"))
out = (piv["captured"].sub(piv[("all", 50)], axis=0)
.add_prefix("cap_N").add_suffix("_all_N50").reset_index())
输出:
print(out)
N enzyme cap_N50_all_N50 cap_N90_all_N50
0 AaaI 57025 -30930
1 AagI 4529 -1899
[2 rows x 3 columns]