我有一个包含每小时测量数据的字典,其中缺少一些条目(间隙)。我当前的方法是创建一个具有每小时日期时间索引并预填充 NaN 的数据框。然后用gasDict替换数据框中的值(见下文)。随后对数据帧进行插值以消除 NaN。
import pandas as pd
import numpy as np
dataRange = pd.date_range(pd.to_datetime('2023-01-01 01:00:00'), pd.to_datetime('2023-01-01 05:00:00'), freq='H')
df = pd.DataFrame(np.nan, index=dataRange, columns=['gas'])
df['gas'] = pd.to_numeric(df['gas'], errors='coerce')
gasDict = {'2023-01-01 01:00:00' : 40,
'2023-01-01 03:00:00' : 20
}
# these 3 methods do not work here
# methods from stackoverflow remap-values-in-pandas-column-with-a-dict-preserve-nans
df1 = df['gas'].map(gasDict).fillna(df['gas'])
print(df1)
df2 = df['gas'].map(gasDict)
print(df2)
df3 = df.replace({'gas': gasDict})
print(df3)
# this code is correct but slow:
for key, value in gasDict.items():
df.at[pd.to_datetime(key)] = value
print(df)
结果(只有最后一个是正确的!):
2023-01-01 01:00:00 NaN
2023-01-01 02:00:00 NaN
2023-01-01 03:00:00 NaN
2023-01-01 04:00:00 NaN
2023-01-01 05:00:00 NaN
Freq: H, Name: gas, dtype: float64
2023-01-01 01:00:00 NaN
2023-01-01 02:00:00 NaN
2023-01-01 03:00:00 NaN
2023-01-01 04:00:00 NaN
2023-01-01 05:00:00 NaN
Freq: H, Name: gas, dtype: float64
gas
2023-01-01 01:00:00 NaN
2023-01-01 02:00:00 NaN
2023-01-01 03:00:00 NaN
2023-01-01 04:00:00 NaN
2023-01-01 05:00:00 NaN
gas
2023-01-01 01:00:00 40.0
2023-01-01 02:00:00 NaN
2023-01-01 03:00:00 20.0
2023-01-01 04:00:00 NaN
2023-01-01 05:00:00 NaN
但是最后一个方法代码非常慢(gasDict 有大约 10000 个条目)。正确的做法是什么?
我认为最好先从数据帧开始,然后扩展索引。要从字典创建数据框,您可以使用
DataFrame.from_dict
:
df = pd.DataFrame.from_dict(gasDict, orient='index', columns=['gas'])
然后将索引转换为
datatime
类型。
df.index = df.index.astype("datetime64['ns']")
之后使用 reindex 方法来扩展你的索引:
df = df.reindex(dataRange)