[使用apply()自定义函数创建新列时的熊猫内存错误

问题描述 投票:1回答:1

计算两次重复的平均log(1 + TPM)的函数

def average_TPM(a,b):
    log_a = np.log(1+a)
    log_b = np.log(1+b)
    if log_a > 0.1 and log_b > 0.1:
        avg = np.mean([log_a,log_b])
    else:
        avg = np.nan
    return avg

将函数应用于df以创建新列

df.loc[:,'leaf'] = df.apply(lambda row:  average_TPM(row['leaf1'],row['leaf2']),axis=1)
df.loc[:,'flag_leaf'] = df.apply(lambda row:  average_TPM(row['flag_leaf1'],row['flag_leaf2']),axis=1)
df.loc[:,'anther'] = df.apply(lambda row:  average_TPM(row['anther1'],row['anther2']),axis=1)
df.loc[:,'premeiotic'] = df.apply(lambda row:  average_TPM(row['premeiotic1'],row['premeiotic2']),axis=1)
df.loc[:,'leptotene'] = df.apply(lambda row:  average_TPM(row['leptotene1'],row['leptotene2']),axis=1)
df.loc[:,'zygotene'] = df.apply(lambda row:  average_TPM(row['zygotene1'],row['zygotene2']),axis=1)
df.loc[:,'pachytene'] = df.apply(lambda row:  average_TPM(row['pachytene1'],row['pachytene2']),axis=1)
df.loc[:,'diplotene'] = df.apply(lambda row:  average_TPM(row['diplotene1'],row['diplotene2']),axis=1)
df.loc[:,'metaphase_I'] = df.apply(lambda row:  average_TPM(row['metaphaseI_1'],row['metaphaseI_2']),axis=1)
df.loc[:,'metaphase_II'] = df.apply(lambda row:  average_TPM(row['metaphaseII_1'],row['metaphaseII_2']),axis=1)
df.loc[:,'pollen'] = df.apply(lambda row:  average_TPM(row['pollen1'],row['pollen2']),axis=1)
python pandas memory-management vectorization apply
1个回答
1
投票

不确定为什么会出现内存错误,但是可以将问题向量化:

#dummy variable
np.random.seed = 2
df = pd.DataFrame(np.random.random(8*4).reshape(8,-1), columns=['a1','a2','b1','b2'])
print (df)
         a1        a2        b1        b2
0  0.416493  0.964483  0.089547  0.218952
1  0.655331  0.468490  0.272494  0.652915
2  0.680433  0.461191  0.919223  0.552074
3  0.077158  0.138839  0.385818  0.462848
4  0.149198  0.912372  0.893708  0.081125
5  0.255422  0.143502  0.466123  0.524544
6  0.842095  0.486603  0.628405  0.686393
7  0.329461  0.714052  0.176126  0.566491

定义要创建的列的列表,然后一次对整个数据使用np.log1p

np.log1p

现在您可以使用col_create = ['a','b'] #what you need to redefine for your problem col_get = [f'{col}{i}'for col in col_create for i in range(1,3)] #to ensure the order od columns arr_log = np.log1p(df[col_get].to_numpy()) 并将新列矢量化比较到np.where

assign
© www.soinside.com 2019 - 2024. All rights reserved.