前一天数据填写N/A

问题描述 投票:0回答:1

我有一个数据框,它只有工作日的数据。以下是示例数据框:

将 pandas 导入为 pd

df = pd.DataFrame({'BAS_DT': ['2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-05', '2023-01-05', '2023-01-06', '2023-01-07'], 
                   'CUS_NO': ['', '', '900816636', '900816636', '900816946', '900816931', '', '']})
df

    BAS_DT      CUS_NO
0   2023-01-02  
1   2023-01-03  
2   2023-01-04  900816636
3   2023-01-05  900816636
4   2023-01-05  900816946
5   2023-01-05  900816931
6   2023-01-06  
7   2023-01-07  

我想填充

2023-01-06
2023-01-07
2023-01-05
相同。我尝试了
ffill
但它只是填充了最接近 NaN 行的第一行。以下是我想要的输出:

    BAS_DT      CUS_NO
0   2023-01-02  
1   2023-01-03  
2   2023-01-04  900816636
3   2023-01-05  900816636
4   2023-01-05  900816946
5   2023-01-05  900816931
6   2023-01-06  900816636
7   2023-01-06  900816946
8   2023-01-06  900816931
9   2023-01-07  900816636
10  2023-01-07  900816946
11  2023-01-07  900816931       

谢谢你。

python pandas
1个回答
0
投票

向前填充(ffill)方法似乎没有按预期工作,因为日期“2023-01-06”和“2023-01-07”的“CUS_NO”字段没有填充“2023-”中的值01-05'。这可能是由于空字符串未被识别为可以向前填充的 NA 值。

我们需要做的是首先将空字符串替换为实际的 NA 值(None 或 pd.NA),然后在 BAS_DT 为“2023-01-06”或“2023-01-07”的日期上应用填充。我将进行此更正并向您展示更新后的 DataFrame。


import pandas as pd

# Assuming 'df' is your initial DataFrame

# Replace empty strings with NaN to enable forward fill
df['CUS_NO'].replace('', pd.NA, inplace=True)

# Forward fill NaN values for '2023-01-06' and '2023-01-07'
mask = df['BAS_DT'].isin([pd.Timestamp('2023-01-06'), pd.Timestamp('2023-01-07')])
df.loc[mask, 'CUS_NO'] = df.loc[mask, 'CUS_NO'].ffill()

# Duplicate the rows for '2023-01-05' and create new rows for '2023-01-06' and '2023-01-07'
rows_to_duplicate = df[df['BAS_DT'] == pd.Timestamp('2023-01-05')].copy()
rows_to_add = pd.concat([rows_to_duplicate] * 2, ignore_index=True)
rows_to_add['BAS_DT'] = pd.date_range(start='2023-01-06', periods=len(rows_to_add), freq='D')

# Combine the original dataframe with the new rows and sort them
result_df = pd.concat([df, rows_to_add]).sort_values(by='BAS_DT').reset_index(drop=True)

# Filter out the rows for '2023-01-06' and '2023-01-07' only
result_df = result_df[result_df['BAS_DT'] <= pd.Timestamp('2023-01-07')]

# Display the final dataframe
print(result_df)

在初始 DataFrame 设置后运行此代码,它将根据 ' 中的 'CUS_NO' 值,为您提供所需的输出,并为 '2023-01-06' 和 '2023-01-07' 填充 'CUS_NO' 值2023-01-05'.

© www.soinside.com 2019 - 2024. All rights reserved.