我有一个数据框,其中包含2014-2018年客户的ID及其费用。我想要的是在数据帧中具有每个ID在2014-2018年的支出平均值。但是,存在一种情况:如果行(2014-2018)中的单元格之一为空,则应返回NaN。因此,我只希望在2014-2018列中的所有5个行单元格都具有数值时才计算平均值。
初始数据框:
2014 2015 2016 2017 2018 ID
100 122.0 324 632 NaN 12.0
120 159.0 54 452 541.0 96.0
NaN 164.0 687 165 245.0 20.0
180 421.0 512 184 953.0 73.0
110 654.0 913 173 103.0 84.0
130 NaN 754 124 207.0 26.0
170 256.0 843 97 806.0 87.0
140 754.0 95 101 541.0 64.0
80 985.0 184 84 90.0 11.0
96 65.0 127 130 421.0 34.0
所需的输出
2014 2015 2016 2017 2018 ID mean
100 122.0 324 632 NaN 12.0 NaN
120 159.0 54 452 541.0 96.0 265.20
NaN 164.0 687 165 245.0 20.0 NaN
180 421.0 512 184 953.0 73.0 450.00
110 654.0 913 173 103.0 84.0 390.60
130 NaN 754 124 207.0 26.0 NaN
170 256.0 843 97 806.0 87.0 434.40
140 754.0 95 101 541.0 64.0 326.20
80 985.0 184 84 90.0 11.0 284.60
96 65.0 127 130 421.0 34.0 167.80
Tried code:->但是,这仅给出了平均值,而忽略了NaN条件。是他们的一些简短的lambda函数,可以将条件添加到代码中吗?
import pandas as pd
import numpy as np
data = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"2014": [100,120,np.nan,180,110,130,170,140,80,96],
"2015": [122,159,164,421,654,np.nan,256,754,985,65],
"2016": [324,54,687,512,913,754,843,95,184,127],
"2017": [632,452,165,184,173,124,97,101,84,130],
"2018": [np.nan,541,245,953,103,207,806,541,90,421]})
print(data)
fiveyear = ["2014", "2015", "2016", "2017", "2018"] -> if a cell in these rows is empty(NaN), then NaN should be in the new 'mean'-column. I only want the mean when, all 5 cells in the row have a numeric value.
data.loc[:, 'mean'] = data[fiveyear].mean(axis=1)
print(data)
在计算平均值之前,请使用dropna
删除行。由于大熊猫在将结果分配回时将在索引上对齐,并且这些行已删除,因此这些丢弃的行的结果为NaN
df['mean'] = df[fiveyear].dropna(how='any').mean(1)
也可以将结果仅对所有非空的行进行mask
df['mean'] = df[fiveyear].mean(1).mask(df[fiveyear].isnull().any(1))