为了在各个地区进行有意义的比较,我想在不同国家爆发爆发的日期之前对COVID-19确诊病例进行标准化。对于任何领土,领土达到或超过10例确诊病例的那一天都被视为“爆发的第0天”。
示例数据框:
[in]
import pandas as pd
confirmed_cases = {'Date':['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20'], 'Australia':[0, 0, 0, 30, 50], 'Albania':[0, 20, 25, 30, 50], 'Algeria':[25, 40, 50, 50, 70]}
df = pd.DataFrame(confirmed_cases)
df
[out]
Date Australia Albania Algeria
0 1/22/20 0 0 25
1 1/23/20 0 20 40
2 1/24/20 0 25 50
3 1/25/20 30 30 50
4 1/26/20 50 50 70
所需结果:
Day Since Outbreak Australia Albania Algeria
0 0 30 20 25
1 1 50 25 40
2 2 NaN 30 50
3 3 NaN 50 50
4 4 NaN NaN 70
有没有办法用简单的Python / Panda代码行执行此任务?
根据第一次运行的值<10,确定需要多少次shift
每列。然后移动它们。 cummin
确保如果间歇值<10,则不会在shift
df = df.drop(columns='Date') # Wont need
s = df.lt(10).cummin().sum()
for col, shift in s.iteritems():
df[col] = df[col].shift(-shift)
df['Days Since'] = range(len(df)) # Duplicative with index...
Australia Albania Algeria Days Since
0 30.0 20.0 25 0
1 50.0 25.0 40 1
2 NaN 30.0 50 2
3 NaN 50.0 50 3
4 NaN NaN 70 4
为每个国家/地区查找第一个非零值的索引值,并将每一列向上移动那么多
df[['Australia', 'Albania', 'Algeria']].apply(lambda x: x.shift(-(x > 0).idxmax()))
Australia Albania Algeria
0 30.0 20.0 25
1 50.0 25.0 40
2 NaN 30.0 50
3 NaN 50.0 50
4 NaN NaN 70
这是一个基于数字的方法:
import numpy
df_sl = df.loc[:,'Australia':].values
m = df_sl==0
out = df_sl[m.argsort(0), np.arange(df_sl.shape[1])].astype(float)
out[out==0] = np.nan
df.loc[:,'Australia':] = out
print(df)
Date Australia Albania Algeria
0 1/22/20 30.0 20.0 25.0
1 1/23/20 50.0 25.0 40.0
2 1/24/20 NaN 30.0 50.0
3 1/25/20 NaN 50.0 50.0
4 1/26/20 NaN NaN 70.0