Pandas:我如何用爆发天数不同的不同国家来标准化COVID-19数据框

问题描述 投票:1回答:4

为了在各个地区进行有意义的比较,我想在不同国家爆发爆发的日期之前对COVID-19确诊病例进行标准化。对于任何领土,领土达到或超过10例确诊病例的那一天都被视为“爆发的第0天”。

示例数据框:

[in]
import pandas as pd
confirmed_cases = {'Date':['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20'], 'Australia':[0, 0, 0, 30, 50], 'Albania':[0, 20, 25, 30, 50], 'Algeria':[25, 40, 50, 50, 70]}
df = pd.DataFrame(confirmed_cases)
df

[out]
    Date    Australia   Albania     Algeria
0   1/22/20        0         0          25
1   1/23/20        0        20          40
2   1/24/20        0        25          50
3   1/25/20       30        30          50
4   1/26/20       50        50          70

所需结果:

    Day Since Outbreak     Australia    Albania     Algeria
0           0                    30         20          25
1           1                    50         25          40
2           2                   NaN         30          50
3           3                   NaN         50          50
4           4                   NaN        NaN          70

有没有办法用简单的Python / Panda代码行执行此任务?

python pandas dataframe normalization database-normalization
4个回答
0
投票

根据第一次运行的值<10,确定需要多少次shift每列。然后移动它们。 cummin确保如果间歇值<10,则不会在shift

中进行计数
df = df.drop(columns='Date')  # Wont need
s = df.lt(10).cummin().sum()

for col, shift in s.iteritems():
    df[col] = df[col].shift(-shift)

df['Days Since'] = range(len(df)) # Duplicative with index...

   Australia  Albania  Algeria  Days Since
0       30.0     20.0       25           0
1       50.0     25.0       40           1
2        NaN     30.0       50           2
3        NaN     50.0       50           3
4        NaN      NaN       70           4

0
投票

为每个国家/地区查找第一个非零值的索引值,并将每一列向上移动那么多

df[['Australia', 'Albania', 'Algeria']].apply(lambda x: x.shift(-(x > 0).idxmax()))

   Australia  Albania  Algeria
0       30.0     20.0       25
1       50.0     25.0       40
2        NaN     30.0       50
3        NaN     50.0       50
4        NaN      NaN       70

0
投票

这是一个基于数字的方法:

import numpy

df_sl = df.loc[:,'Australia':].values
m = df_sl==0
out = df_sl[m.argsort(0), np.arange(df_sl.shape[1])].astype(float)
out[out==0] = np.nan
df.loc[:,'Australia':] = out

print(df)

      Date  Australia  Albania  Algeria
0  1/22/20       30.0     20.0     25.0
1  1/23/20       50.0     25.0     40.0
2  1/24/20        NaN     30.0     50.0
3  1/25/20        NaN     50.0     50.0
4  1/26/20        NaN      NaN     70.0
© www.soinside.com 2019 - 2024. All rights reserved.