我有一段代码可以正确工作(给出预期的答案),但效率低下且不必要地复杂。它使用我想要简化并提高效率的循环,可能使用矢量化运算。它还将数据帧转换为系列,然后再次转换回数据帧 - 另一个需要工作的代码块。换句话说,我想让这段代码变得更加Pythonic。
我用以
# TODO:
开头的注释标记了代码中有问题的地方(如下)。
代码的目标是总结和聚合输入数据框
df
(其中具有两种类型区域的DNA片段长度分布:all
和captured
)。这是一个生物信息学问题,是一个更大项目的一部分,该项目根据酶将某些 DNA 区域切割成指定长度片段的能力对酶进行排名。就这个问题而言,唯一相关的信息是 length
是整数,并且 DNA regions
有两种类型:all
和 captured
。目的是产生纯度为 vs.
df_pur
(纯化 DNA 时长度的截止值)的数据框 length_cutoff
。步骤是:
regions
类型在每个 length_cutoffs
之上的总长度的分数。captured / all
的分数的比率:length_cutoffs
,并将结果存储在数据框中。import io
import pandas as pd
# This is a minimal reproducible example. The real dataset has 2
# columns and 10s of millions of rows. Column 1 is integer, column 2
# has 2 values: 'all' and 'captured':
TESTDATA="""
1 all
49 all
200 all
20 captured
480 captured
2000 captured
"""
df = pd.read_csv(io.StringIO(TESTDATA),
sep='\s+', header=None, names='length regions'.split())
# This is a minimal reproducible example. The real list has ~10
# integer values (cutoffs):
length_cutoffs = [10, 100, 1000]
df_tot_length = pd.DataFrame(columns=['tot_length'])
df_tot_length['tot_length'] = df.groupby(['regions']).length.sum()
df_tot_length.reset_index(inplace=True)
print(df_tot_length)
# regions tot_length
# 0 all 250
# 1 captured 2500
df_frc_tot = pd.DataFrame(columns=['regions', 'length_cutoff', 'sum_lengths'])
regions = df['regions'].unique()
df_index = pd.DataFrame({'regions': regions}).set_index('regions')
# TODO: simplify this loop (vectorize?):
for length_cutoff in length_cutoffs:
df_cur = (pd.DataFrame({'length_cutoff': length_cutoff,
'sum_lengths': df[df['length'] >= length_cutoff]
.groupby(['regions']).length.sum()},
# Prevent dropping rows where no elements
# are selected by the above
# condition. Re-insert the dropped rows,
# use for those sum_lengths = NaN
index=df_index.index)
# Correct the above sum_lengths = NaN to 0:
.fillna(0)).reset_index()
# Undo the effect of `fillna(0)` above, which casts the
# integer column as float:
df_cur['sum_lengths'] = df_cur['sum_lengths'].astype('int')
# TODO: simplify this loop (vectorize?):
for region in regions:
df_cur.loc[df_cur['regions'] == region, 'frc_tot_length'] = (
df_cur.loc[df_cur['regions'] == region, 'sum_lengths'] /
df_tot_length.loc[df_tot_length['regions'] == region, 'tot_length'])
df_frc_tot = pd.concat([df_frc_tot, df_cur], axis=0)
df_frc_tot.reset_index(inplace=True, drop=True)
print(df_frc_tot)
# regions length_cutoff sum_lengths frc_tot_length
# 0 all 10 249 0.996
# 1 captured 10 2500 1.000
# 2 all 100 200 0.800
# 3 captured 100 2480 0.992
# 4 all 1000 0 0.000
# 5 captured 1000 2000 0.800
# TODO: simplify the next 2 statements:
ser_pur = (df_frc_tot.loc[df_frc_tot['regions'] == 'captured', 'frc_tot_length']
.reset_index(drop=True) /
df_frc_tot.loc[df_frc_tot['regions'] == 'all', 'frc_tot_length']
.reset_index(drop=True))
df_pur = pd.DataFrame({'length_cutoff': length_cutoffs, 'purity': ser_pur})
print(df_pur)
# length_cutoff purity
# 0 10 1.004016
# 1 100 1.240000
# 2 1000 inf
IIUC,你可以这样做:
length_cutoffs = [10, 100, 1000]
df["bins"] = pd.cut(
df["length"],
pd.IntervalIndex.from_breaks([-np.inf] + length_cutoffs + [np.inf], closed="left"),
)
out = df.pivot_table(index=["regions", "bins"], values="length", aggfunc="sum")
g = out.groupby(level=0)
out["frc_tot_length"] = (
g["length"].transform(lambda x: [x.iloc[i:].sum() for i in range(len(x))])
) / g["length"].sum()
print(out)
print()
打印:
length frc_tot_length
regions bins
all [-inf, 10.0) 1 1.000
[10.0, 100.0) 49 0.996
[100.0, 1000.0) 200 0.800
[1000.0, inf) 0 0.000
captured [-inf, 10.0) 0 1.000
[10.0, 100.0) 20 1.000
[100.0, 1000.0) 480 0.992
[1000.0, inf) 2000 0.800
然后:
x = out.unstack(level=0)
x = x[("frc_tot_length", "captured")] / x[("frc_tot_length", "all")]
print(x)
打印:
bins
[-inf, 10.0) 1.000000
[10.0, 100.0) 1.004016
[100.0, 1000.0) 1.240000
[1000.0, inf) inf
dtype: float64