使用更简单（矢量化？）操作而不是循环来聚合数据帧

Question

我有一段代码可以正确工作（给出预期的答案），但效率低下且不必要地复杂。它使用我想要简化并提高效率的循环，可能使用矢量化运算。它还将数据帧转换为系列，然后再次转换回数据帧 - 另一个需要工作的代码块。换句话说，我想让这段代码变得更加Pythonic。

我用以

# TODO:

开头的注释标记了代码中有问题的地方（如下）。

代码的目标是总结和聚合输入数据框

df

（其中具有两种类型区域的DNA片段长度分布：

all

和

captured

）。这是一个生物信息学问题，是一个更大项目的一部分，该项目根据酶将某些 DNA 区域切割成指定长度片段的能力对酶进行排名。就这个问题而言，唯一相关的信息是

length

是整数，并且 DNA

regions

有两种类型：

all

和

captured

。目的是产生纯度为

vs.

df_pur（纯化 DNA 时长度的截止值）的数据框

length_cutoff

。步骤是：

计算每种
```
regions
```
类型在每个
```
length_cutoffs
```
之上的总长度的分数。
找到每个
```
captured / all
```
的分数的比率：
```
length_cutoffs
```
，并将结果存储在数据框中。

import io
import pandas as pd

# This is a minimal reproducible example. The real dataset has 2
# columns and 10s of millions of rows. Column 1 is integer, column 2
# has 2 values: 'all' and 'captured':
TESTDATA="""
1   all
49  all
200 all
20  captured
480 captured
2000    captured
"""

df = pd.read_csv(io.StringIO(TESTDATA),
                 sep='\s+', header=None, names='length regions'.split())

# This is a minimal reproducible example. The real list has ~10
# integer values (cutoffs):
length_cutoffs = [10, 100, 1000]

df_tot_length = pd.DataFrame(columns=['tot_length'])
df_tot_length['tot_length'] = df.groupby(['regions']).length.sum()
df_tot_length.reset_index(inplace=True)

print(df_tot_length)

#     regions  tot_length
# 0       all         250
# 1  captured        2500


df_frc_tot = pd.DataFrame(columns=['regions', 'length_cutoff', 'sum_lengths'])
regions = df['regions'].unique()
df_index = pd.DataFrame({'regions': regions}).set_index('regions')

# TODO: simplify this loop (vectorize?):
for length_cutoff in length_cutoffs:
    df_cur = (pd.DataFrame({'length_cutoff': length_cutoff,
                            'sum_lengths': df[df['length'] >= length_cutoff]
                            .groupby(['regions']).length.sum()},
                           # Prevent dropping rows where no elements
                           # are selected by the above
                           # condition. Re-insert the dropped rows,
                           # use for those sum_lengths = NaN
                        index=df_index.index)
              # Correct the above sum_lengths = NaN to 0:
              .fillna(0)).reset_index()
    # Undo the effect of `fillna(0)` above, which casts the
    # integer column as float:
    df_cur['sum_lengths'] = df_cur['sum_lengths'].astype('int')
    # TODO: simplify this loop (vectorize?):
    for region in regions:
        df_cur.loc[df_cur['regions'] == region, 'frc_tot_length'] = (
            df_cur.loc[df_cur['regions'] == region, 'sum_lengths'] /
            df_tot_length.loc[df_tot_length['regions'] == region, 'tot_length'])
    df_frc_tot = pd.concat([df_frc_tot, df_cur], axis=0)

df_frc_tot.reset_index(inplace=True, drop=True)

print(df_frc_tot)

#     regions length_cutoff sum_lengths  frc_tot_length
# 0       all            10         249           0.996
# 1  captured            10        2500           1.000
# 2       all           100         200           0.800
# 3  captured           100        2480           0.992
# 4       all          1000           0           0.000
# 5  captured          1000        2000           0.800

# TODO: simplify the next 2 statements:
ser_pur = (df_frc_tot.loc[df_frc_tot['regions'] == 'captured', 'frc_tot_length']
           .reset_index(drop=True) /
           df_frc_tot.loc[df_frc_tot['regions'] == 'all',      'frc_tot_length']
           .reset_index(drop=True))
df_pur = pd.DataFrame({'length_cutoff': length_cutoffs, 'purity': ser_pur})

print(df_pur)

#    length_cutoff    purity
# 0             10  1.004016
# 1            100  1.240000
# 2           1000       inf

Answer 1

IIUC，你可以这样做：

length_cutoffs = [10, 100, 1000]

df["bins"] = pd.cut(
    df["length"],
    pd.IntervalIndex.from_breaks([-np.inf] + length_cutoffs + [np.inf], closed="left"),
)

out = df.pivot_table(index=["regions", "bins"], values="length", aggfunc="sum")

g = out.groupby(level=0)
out["frc_tot_length"] = (
    g["length"].transform(lambda x: [x.iloc[i:].sum() for i in range(len(x))])
) / g["length"].sum()
print(out)
print()

打印：

                          length  frc_tot_length
regions  bins                                   
all      [-inf, 10.0)          1           1.000
         [10.0, 100.0)        49           0.996
         [100.0, 1000.0)     200           0.800
         [1000.0, inf)         0           0.000
captured [-inf, 10.0)          0           1.000
         [10.0, 100.0)        20           1.000
         [100.0, 1000.0)     480           0.992
         [1000.0, inf)      2000           0.800

然后：

x = out.unstack(level=0)
x = x[("frc_tot_length", "captured")] / x[("frc_tot_length", "all")]
print(x)

打印：

bins
[-inf, 10.0)       1.000000
[10.0, 100.0)      1.004016
[100.0, 1000.0)    1.240000
[1000.0, inf)           inf
dtype: float64

使用更简单（矢量化？）操作而不是循环来聚合数据帧

问题描述投票：0回答：1

1个回答

最新问题

使用更简单（矢量化？）操作而不是循环来聚合数据帧

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1