在Python中填充部分数据的过去和未来数据

问题描述 投票:0回答:2

我已经从198x到2016年的数据累积总和,现在的形式如下:

State   Year    Month   Value
TN      1987    1       24410.0
TN      1987    2       24410.0
TN      1987    3       24410.0
TN      1987    4       24410.0
.
.
TN      1996    1       24410.0
TN      1996    2       24410.0
TN      1996    3       24410.0
TN      1996    4       24410.0
TN      1996    5       37109.0
TN      1996    6       37109.0
TN      1996    7       37109.0
TN      1996    8       37109.0
TN      1996    9       37109.0
TN      1996    10      37109.0
TN      1996    11      37109.0
TN      1996    12      37109.0
TN      2016    1       49808.0
TN      2016    2       49808.0

这些数据实际上从1996年到2016年略有下降(TN的情况因国家而异)。我需要找到一种方法,通常填补数据中所有缺失的漏洞,因为有些年份不存在(2010-2015),我想填充它们,以便输出一直到2018年。

我希望缺少的值填充前面的值之前的值,以获得如下所示的输出:

TN      1996    4       24410.0
TN      1996    5       37109.0
TN      1996    6       37109.0
.
.
TN      2010    1       37109.0
TN      2010    2       37109.0
TN      2010    3       37109.0
.
.
TN      2016    1       37109.0
TN      2016    2       37109.0
.
.
TN      2016    11      49808.0
TN      2016    12      49808.0
.
.
TN      2017    1       49808.0
TN      2017    2       49808.0
TN      2017    3       49808.0
TN      2017    4       49808.0
.
.
TN      2018    1       49808.0
TN      2018    2       49808.0
python python-3.x pandas dataframe missing-data
2个回答
0
投票

pandas.interpolate怎么样?:根据不同的方法插值

请参阅此处的'interpolate'部分:https://pandas.pydata.org/pandas-docs/stable/missing_data.html

以前发布的一些现有例子:Pandas interpolate() backwards in dataframe


0
投票

您可以使用缺少的月份创建一个数据框,并将结果与​​其合并:

dates = pd.date_range(start='1/1/%d' %df['Year'].min(),
                      end='1/08/%d' %df['Year'].max(),
                      freq='MS', closed='left')

>> dates

DatetimeIndex(['1987-02-01', '1987-03-01', '1987-04-01', '1987-05-01',
               '1987-06-01', '1987-07-01', '1987-08-01', '1987-09-01',
               '1987-10-01', '1987-11-01',
               ...
               '2015-04-01', '2015-05-01', '2015-06-01', '2015-07-01',
               '2015-08-01', '2015-09-01', '2015-10-01', '2015-11-01',
               '2015-12-01', '2016-01-01'],
              dtype='datetime64[ns]', length=348, freq='MS')

然后,您可以创建包含所有月份的数据框:

all_months = pd.DataFrame.from_records((dates.year, dates.month),
      index=['Year', 'Month']).T.sort_values(by=['Year', 'Month'])

然后将其与原始数据框合并并向前填充:

df.merge(all_months, how='right').ffill()

    State    Year  Month    Value
0      TN  1987.0    1.0  24410.0
1      TN  1987.0    2.0  24410.0
2      TN  1987.0    3.0  24410.0
3      TN  1987.0    4.0  24410.0
4      TN  1996.0    1.0  24410.0
5      TN  1996.0    2.0  24410.0
6      TN  1996.0    3.0  24410.0
7      TN  1996.0    4.0  24410.0
8      TN  1996.0    5.0  37109.0
9      TN  1996.0    6.0  37109.0
10     TN  1996.0    7.0  37109.0
11     TN  1996.0    8.0  37109.0
12     TN  1996.0    9.0  37109.0
13     TN  1996.0   10.0  37109.0
14     TN  1996.0   11.0  37109.0
15     TN  1996.0   12.0  37109.0
16     TN  2016.0    1.0  49808.0
17     TN  1987.0    5.0  49808.0
18     TN  1987.0    6.0  49808.0
19     TN  1987.0    7.0  49808.0
20     TN  1987.0    8.0  49808.0
21     TN  1987.0    9.0  49808.0
22     TN  1987.0   10.0  49808.0
23     TN  1987.0   11.0  49808.0
24     TN  1987.0   12.0  49808.0
25     TN  1988.0    1.0  49808.0
26     TN  1988.0    2.0  49808.0
27     TN  1988.0    3.0  49808.0
28     TN  1988.0    4.0  49808.0
29     TN  1988.0    5.0  49808.0
..    ...     ...    ...      ...
319    TN  2013.0    7.0  49808.0
320    TN  2013.0    8.0  49808.0
321    TN  2013.0    9.0  49808.0
322    TN  2013.0   10.0  49808.0
323    TN  2013.0   11.0  49808.0
324    TN  2013.0   12.0  49808.0
325    TN  2014.0    1.0  49808.0
326    TN  2014.0    2.0  49808.0
327    TN  2014.0    3.0  49808.0
328    TN  2014.0    4.0  49808.0
329    TN  2014.0    5.0  49808.0
330    TN  2014.0    6.0  49808.0
331    TN  2014.0    7.0  49808.0
332    TN  2014.0    8.0  49808.0
333    TN  2014.0    9.0  49808.0
334    TN  2014.0   10.0  49808.0
335    TN  2014.0   11.0  49808.0
336    TN  2014.0   12.0  49808.0
337    TN  2015.0    1.0  49808.0
338    TN  2015.0    2.0  49808.0
339    TN  2015.0    3.0  49808.0
340    TN  2015.0    4.0  49808.0
341    TN  2015.0    5.0  49808.0
342    TN  2015.0    6.0  49808.0
343    TN  2015.0    7.0  49808.0
344    TN  2015.0    8.0  49808.0
345    TN  2015.0    9.0  49808.0
346    TN  2015.0   10.0  49808.0
347    TN  2015.0   11.0  49808.0
348    TN  2015.0   12.0  49808.0

使用pandas.resample

另一个解决方案是按日期索引,然后在那里重新采样:

df['Day'] = 1

df1 = df.assign(date= lambda x:pd.to_datetime(x[['Year', 'Month', 'Day']])).set_index('date')

>> df1

           State    Year  Month    Value  Day
date                                         
1987-01-01    TN  1987.0    1.0  24410.0    1
1987-02-01    TN  1987.0    2.0  24410.0    1
1987-03-01    TN  1987.0    3.0  24410.0    1
1987-04-01    TN  1987.0    4.0  24410.0    1
1996-01-01    TN  1996.0    1.0  24410.0    1
1996-02-01    TN  1996.0    2.0  24410.0    1
1996-03-01    TN  1996.0    3.0  24410.0    1
1996-04-01    TN  1996.0    4.0  24410.0    1
1996-05-01    TN  1996.0    5.0  37109.0    1
1996-06-01    TN  1996.0    6.0  37109.0    1
1996-07-01    TN  1996.0    7.0  37109.0    1
1996-08-01    TN  1996.0    8.0  37109.0    1
1996-09-01    TN  1996.0    9.0  37109.0    1
1996-10-01    TN  1996.0   10.0  37109.0    1
1996-11-01    TN  1996.0   11.0  37109.0    1
1996-12-01    TN  1996.0   12.0  37109.0    1
2016-01-01    TN  2016.0    1.0  49808.0    1
2016-02-01    TN  2016.0    2.0  49808.0    1

然后你可以按月重新取样:

    res = df1.resample('M').first().ffill()

    >> res 

               State    Year  Month    Value  Day
    date                                         
    1987-01-31    TN  1987.0    1.0  24410.0  1.0
    1987-02-28    TN  1987.0    2.0  24410.0  1.0
    1987-03-31    TN  1987.0    3.0  24410.0  1.0
    1987-04-30    TN  1987.0    4.0  24410.0  1.0
    1987-05-31    TN  1987.0    4.0  24410.0  1.0
    1987-06-30    TN  1987.0    4.0  24410.0  1.0
    1987-07-31    TN  1987.0    4.0  24410.0  1.0
    1987-08-31    TN  1987.0    4.0  24410.0  1.0
    1987-09-30    TN  1987.0    4.0  24410.0  1.0
    1987-10-31    TN  1987.0    4.0  24410.0  1.0
    1987-11-30    TN  1987.0    4.0  24410.0  1.0
    1987-12-31    TN  1987.0    4.0  24410.0  1.0
    1988-01-31    TN  1987.0    4.0  24410.0  1.0
    1988-02-29    TN  1987.0    4.0  24410.0  1.0
    1988-03-31    TN  1987.0    4.0  24410.0  1.0
    1988-04-30    TN  1987.0    4.0  24410.0  1.0
    1988-05-31    TN  1987.0    4.0  24410.0  1.0
    1988-06-30    TN  1987.0    4.0  24410.0  1.0
    1988-07-31    TN  1987.0    4.0  24410.0  1.0
    1988-08-31    TN  1987.0    4.0  24410.0  1.0
    1988-09-30    TN  1987.0    4.0  24410.0  1.0
    1988-10-31    TN  1987.0    4.0  24410.0  1.0
    1988-11-30    TN  1987.0    4.0  24410.0  1.0
    1988-12-31    TN  1987.0    4.0  24410.0  1.0
    1989-01-31    TN  1987.0    4.0  24410.0  1.0
    1989-02-28    TN  1987.0    4.0  24410.0  1.0
    1989-03-31    TN  1987.0    4.0  24410.0  1.0
    1989-04-30    TN  1987.0    4.0  24410.0  1.0
    1989-05-31    TN  1987.0    4.0  24410.0  1.0
    1989-06-30    TN  1987.0    4.0  24410.0  1.0
    ...          ...     ...    ...      ...  ...
    2013-09-30    TN  1996.0   12.0  37109.0  1.0
    2013-10-31    TN  1996.0   12.0  37109.0  1.0
    2013-11-30    TN  1996.0   12.0  37109.0  1.0
    2013-12-31    TN  1996.0   12.0  37109.0  1.0
    2014-01-31    TN  1996.0   12.0  37109.0  1.0
    2014-02-28    TN  1996.0   12.0  37109.0  1.0
    2014-03-31    TN  1996.0   12.0  37109.0  1.0
    2014-04-30    TN  1996.0   12.0  37109.0  1.0
    2014-05-31    TN  1996.0   12.0  37109.0  1.0
    2014-06-30    TN  1996.0   12.0  37109.0  1.0
    2014-07-31    TN  1996.0   12.0  37109.0  1.0
    2014-08-31    TN  1996.0   12.0  37109.0  1.0
    2014-09-30    TN  1996.0   12.0  37109.0  1.0
    2014-10-31    TN  1996.0   12.0  37109.0  1.0
    2014-11-30    TN  1996.0   12.0  37109.0  1.0
    2014-12-31    TN  1996.0   12.0  37109.0  1.0
    2015-01-31    TN  1996.0   12.0  37109.0  1.0
    2015-02-28    TN  1996.0   12.0  37109.0  1.0
    2015-03-31    TN  1996.0   12.0  37109.0  1.0
    2015-04-30    TN  1996.0   12.0  37109.0  1.0
    2015-05-31    TN  1996.0   12.0  37109.0  1.0
    2015-06-30    TN  1996.0   12.0  37109.0  1.0
    2015-07-31    TN  1996.0   12.0  37109.0  1.0
    2015-08-31    TN  1996.0   12.0  37109.0  1.0
    2015-09-30    TN  1996.0   12.0  37109.0  1.0
    2015-10-31    TN  1996.0   12.0  37109.0  1.0
    2015-11-30    TN  1996.0   12.0  37109.0  1.0
    2015-12-31    TN  1996.0   12.0  37109.0  1.0
    2016-01-31    TN  2016.0    1.0  49808.0  1.0
    2016-02-29    TN  2016.0    2.0  49808.0  1.0

您可以通过以下方式获得原始结构:

>> res.reset_index(drop=True).drop(['Day'], axis=1).head()

        State    Year  Month    Value
    0      TN  1987.0    1.0  24410.0
    1      TN  1987.0    2.0  24410.0
    2      TN  1987.0    3.0  24410.0
    3      TN  1987.0    4.0  24410.0
    4      TN  1987.0    4.0  24410.0
    5      TN  1987.0    4.0  24410.0
    6      TN  1987.0    4.0  24410.0
    7      TN  1987.0    4.0  24410.0
    8      TN  1987.0    4.0  24410.0
© www.soinside.com 2019 - 2024. All rights reserved.