我已经从198x到2016年的数据累积总和,现在的形式如下:
State Year Month Value
TN 1987 1 24410.0
TN 1987 2 24410.0
TN 1987 3 24410.0
TN 1987 4 24410.0
.
.
TN 1996 1 24410.0
TN 1996 2 24410.0
TN 1996 3 24410.0
TN 1996 4 24410.0
TN 1996 5 37109.0
TN 1996 6 37109.0
TN 1996 7 37109.0
TN 1996 8 37109.0
TN 1996 9 37109.0
TN 1996 10 37109.0
TN 1996 11 37109.0
TN 1996 12 37109.0
TN 2016 1 49808.0
TN 2016 2 49808.0
这些数据实际上从1996年到2016年略有下降(TN的情况因国家而异)。我需要找到一种方法,通常填补数据中所有缺失的漏洞,因为有些年份不存在(2010-2015),我想填充它们,以便输出一直到2018年。
我希望缺少的值填充前面的值之前的值,以获得如下所示的输出:
TN 1996 4 24410.0
TN 1996 5 37109.0
TN 1996 6 37109.0
.
.
TN 2010 1 37109.0
TN 2010 2 37109.0
TN 2010 3 37109.0
.
.
TN 2016 1 37109.0
TN 2016 2 37109.0
.
.
TN 2016 11 49808.0
TN 2016 12 49808.0
.
.
TN 2017 1 49808.0
TN 2017 2 49808.0
TN 2017 3 49808.0
TN 2017 4 49808.0
.
.
TN 2018 1 49808.0
TN 2018 2 49808.0
pandas.interpolate
怎么样?:根据不同的方法插值
请参阅此处的'interpolate'部分:https://pandas.pydata.org/pandas-docs/stable/missing_data.html
以前发布的一些现有例子:Pandas interpolate() backwards in dataframe
您可以使用缺少的月份创建一个数据框,并将结果与其合并:
dates = pd.date_range(start='1/1/%d' %df['Year'].min(),
end='1/08/%d' %df['Year'].max(),
freq='MS', closed='left')
>> dates
DatetimeIndex(['1987-02-01', '1987-03-01', '1987-04-01', '1987-05-01',
'1987-06-01', '1987-07-01', '1987-08-01', '1987-09-01',
'1987-10-01', '1987-11-01',
...
'2015-04-01', '2015-05-01', '2015-06-01', '2015-07-01',
'2015-08-01', '2015-09-01', '2015-10-01', '2015-11-01',
'2015-12-01', '2016-01-01'],
dtype='datetime64[ns]', length=348, freq='MS')
然后,您可以创建包含所有月份的数据框:
all_months = pd.DataFrame.from_records((dates.year, dates.month),
index=['Year', 'Month']).T.sort_values(by=['Year', 'Month'])
然后将其与原始数据框合并并向前填充:
df.merge(all_months, how='right').ffill()
State Year Month Value
0 TN 1987.0 1.0 24410.0
1 TN 1987.0 2.0 24410.0
2 TN 1987.0 3.0 24410.0
3 TN 1987.0 4.0 24410.0
4 TN 1996.0 1.0 24410.0
5 TN 1996.0 2.0 24410.0
6 TN 1996.0 3.0 24410.0
7 TN 1996.0 4.0 24410.0
8 TN 1996.0 5.0 37109.0
9 TN 1996.0 6.0 37109.0
10 TN 1996.0 7.0 37109.0
11 TN 1996.0 8.0 37109.0
12 TN 1996.0 9.0 37109.0
13 TN 1996.0 10.0 37109.0
14 TN 1996.0 11.0 37109.0
15 TN 1996.0 12.0 37109.0
16 TN 2016.0 1.0 49808.0
17 TN 1987.0 5.0 49808.0
18 TN 1987.0 6.0 49808.0
19 TN 1987.0 7.0 49808.0
20 TN 1987.0 8.0 49808.0
21 TN 1987.0 9.0 49808.0
22 TN 1987.0 10.0 49808.0
23 TN 1987.0 11.0 49808.0
24 TN 1987.0 12.0 49808.0
25 TN 1988.0 1.0 49808.0
26 TN 1988.0 2.0 49808.0
27 TN 1988.0 3.0 49808.0
28 TN 1988.0 4.0 49808.0
29 TN 1988.0 5.0 49808.0
.. ... ... ... ...
319 TN 2013.0 7.0 49808.0
320 TN 2013.0 8.0 49808.0
321 TN 2013.0 9.0 49808.0
322 TN 2013.0 10.0 49808.0
323 TN 2013.0 11.0 49808.0
324 TN 2013.0 12.0 49808.0
325 TN 2014.0 1.0 49808.0
326 TN 2014.0 2.0 49808.0
327 TN 2014.0 3.0 49808.0
328 TN 2014.0 4.0 49808.0
329 TN 2014.0 5.0 49808.0
330 TN 2014.0 6.0 49808.0
331 TN 2014.0 7.0 49808.0
332 TN 2014.0 8.0 49808.0
333 TN 2014.0 9.0 49808.0
334 TN 2014.0 10.0 49808.0
335 TN 2014.0 11.0 49808.0
336 TN 2014.0 12.0 49808.0
337 TN 2015.0 1.0 49808.0
338 TN 2015.0 2.0 49808.0
339 TN 2015.0 3.0 49808.0
340 TN 2015.0 4.0 49808.0
341 TN 2015.0 5.0 49808.0
342 TN 2015.0 6.0 49808.0
343 TN 2015.0 7.0 49808.0
344 TN 2015.0 8.0 49808.0
345 TN 2015.0 9.0 49808.0
346 TN 2015.0 10.0 49808.0
347 TN 2015.0 11.0 49808.0
348 TN 2015.0 12.0 49808.0
另一个解决方案是按日期索引,然后在那里重新采样:
df['Day'] = 1
df1 = df.assign(date= lambda x:pd.to_datetime(x[['Year', 'Month', 'Day']])).set_index('date')
>> df1
State Year Month Value Day
date
1987-01-01 TN 1987.0 1.0 24410.0 1
1987-02-01 TN 1987.0 2.0 24410.0 1
1987-03-01 TN 1987.0 3.0 24410.0 1
1987-04-01 TN 1987.0 4.0 24410.0 1
1996-01-01 TN 1996.0 1.0 24410.0 1
1996-02-01 TN 1996.0 2.0 24410.0 1
1996-03-01 TN 1996.0 3.0 24410.0 1
1996-04-01 TN 1996.0 4.0 24410.0 1
1996-05-01 TN 1996.0 5.0 37109.0 1
1996-06-01 TN 1996.0 6.0 37109.0 1
1996-07-01 TN 1996.0 7.0 37109.0 1
1996-08-01 TN 1996.0 8.0 37109.0 1
1996-09-01 TN 1996.0 9.0 37109.0 1
1996-10-01 TN 1996.0 10.0 37109.0 1
1996-11-01 TN 1996.0 11.0 37109.0 1
1996-12-01 TN 1996.0 12.0 37109.0 1
2016-01-01 TN 2016.0 1.0 49808.0 1
2016-02-01 TN 2016.0 2.0 49808.0 1
然后你可以按月重新取样:
res = df1.resample('M').first().ffill()
>> res
State Year Month Value Day
date
1987-01-31 TN 1987.0 1.0 24410.0 1.0
1987-02-28 TN 1987.0 2.0 24410.0 1.0
1987-03-31 TN 1987.0 3.0 24410.0 1.0
1987-04-30 TN 1987.0 4.0 24410.0 1.0
1987-05-31 TN 1987.0 4.0 24410.0 1.0
1987-06-30 TN 1987.0 4.0 24410.0 1.0
1987-07-31 TN 1987.0 4.0 24410.0 1.0
1987-08-31 TN 1987.0 4.0 24410.0 1.0
1987-09-30 TN 1987.0 4.0 24410.0 1.0
1987-10-31 TN 1987.0 4.0 24410.0 1.0
1987-11-30 TN 1987.0 4.0 24410.0 1.0
1987-12-31 TN 1987.0 4.0 24410.0 1.0
1988-01-31 TN 1987.0 4.0 24410.0 1.0
1988-02-29 TN 1987.0 4.0 24410.0 1.0
1988-03-31 TN 1987.0 4.0 24410.0 1.0
1988-04-30 TN 1987.0 4.0 24410.0 1.0
1988-05-31 TN 1987.0 4.0 24410.0 1.0
1988-06-30 TN 1987.0 4.0 24410.0 1.0
1988-07-31 TN 1987.0 4.0 24410.0 1.0
1988-08-31 TN 1987.0 4.0 24410.0 1.0
1988-09-30 TN 1987.0 4.0 24410.0 1.0
1988-10-31 TN 1987.0 4.0 24410.0 1.0
1988-11-30 TN 1987.0 4.0 24410.0 1.0
1988-12-31 TN 1987.0 4.0 24410.0 1.0
1989-01-31 TN 1987.0 4.0 24410.0 1.0
1989-02-28 TN 1987.0 4.0 24410.0 1.0
1989-03-31 TN 1987.0 4.0 24410.0 1.0
1989-04-30 TN 1987.0 4.0 24410.0 1.0
1989-05-31 TN 1987.0 4.0 24410.0 1.0
1989-06-30 TN 1987.0 4.0 24410.0 1.0
... ... ... ... ... ...
2013-09-30 TN 1996.0 12.0 37109.0 1.0
2013-10-31 TN 1996.0 12.0 37109.0 1.0
2013-11-30 TN 1996.0 12.0 37109.0 1.0
2013-12-31 TN 1996.0 12.0 37109.0 1.0
2014-01-31 TN 1996.0 12.0 37109.0 1.0
2014-02-28 TN 1996.0 12.0 37109.0 1.0
2014-03-31 TN 1996.0 12.0 37109.0 1.0
2014-04-30 TN 1996.0 12.0 37109.0 1.0
2014-05-31 TN 1996.0 12.0 37109.0 1.0
2014-06-30 TN 1996.0 12.0 37109.0 1.0
2014-07-31 TN 1996.0 12.0 37109.0 1.0
2014-08-31 TN 1996.0 12.0 37109.0 1.0
2014-09-30 TN 1996.0 12.0 37109.0 1.0
2014-10-31 TN 1996.0 12.0 37109.0 1.0
2014-11-30 TN 1996.0 12.0 37109.0 1.0
2014-12-31 TN 1996.0 12.0 37109.0 1.0
2015-01-31 TN 1996.0 12.0 37109.0 1.0
2015-02-28 TN 1996.0 12.0 37109.0 1.0
2015-03-31 TN 1996.0 12.0 37109.0 1.0
2015-04-30 TN 1996.0 12.0 37109.0 1.0
2015-05-31 TN 1996.0 12.0 37109.0 1.0
2015-06-30 TN 1996.0 12.0 37109.0 1.0
2015-07-31 TN 1996.0 12.0 37109.0 1.0
2015-08-31 TN 1996.0 12.0 37109.0 1.0
2015-09-30 TN 1996.0 12.0 37109.0 1.0
2015-10-31 TN 1996.0 12.0 37109.0 1.0
2015-11-30 TN 1996.0 12.0 37109.0 1.0
2015-12-31 TN 1996.0 12.0 37109.0 1.0
2016-01-31 TN 2016.0 1.0 49808.0 1.0
2016-02-29 TN 2016.0 2.0 49808.0 1.0
您可以通过以下方式获得原始结构:
>> res.reset_index(drop=True).drop(['Day'], axis=1).head()
State Year Month Value
0 TN 1987.0 1.0 24410.0
1 TN 1987.0 2.0 24410.0
2 TN 1987.0 3.0 24410.0
3 TN 1987.0 4.0 24410.0
4 TN 1987.0 4.0 24410.0
5 TN 1987.0 4.0 24410.0
6 TN 1987.0 4.0 24410.0
7 TN 1987.0 4.0 24410.0
8 TN 1987.0 4.0 24410.0