我有以下时间序列数据帧。我想用之前的值填充缺失的值。但是我只想填充first_valid_index和last_valid索引之间的缺失值。所以我想要填充的列对于每一行都是不同的。我怎样才能做到这一点?
所以,给定这个数据帧。
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, 2 ,3,np.nan,5], [1, 3 , np.nan , 4 , np.nan], [4, np.nan , 7 , np.nan,np.nan]], columns=[2007,2008,2009,2010,2011])
输入数据帧:
2007 2008 2009 2010 2011
1 2 3 NaN 5
1 3 NaN 4 NaN
4 Nan 7 NaN NaN
输出数据帧:
2007 2008 2009 2010 2011
1 2 3 3 5
1 3 3 4 NaN
4 4 7 NaN NaN
我想为first_valid_index和last_valid_index创建新列,然后使用.apply(),但是如何在每行填充不同的列?
def fillMissing(x):
first_valid = int(x["first_valid"])
last_valid = int(x["last_valid"])
for i in range(first_valid,last_valid + 1):
missing.append(i)
#What should i do here since the following is not valid
#x[missing] = x[missing].fillna(method='ffill', axis=1)
df.apply(fillMissing , axis=1)
您可以使用iloc
执行此操作,但我更喜欢使用Numpy执行此操作。基本上,使用ffill
来转发填充值,然后将NaN
的值一直掩盖到最后。
v = df.values
mask = np.logical_and.accumulate(
np.isnan(v)[:, ::-1], axis=1)[:, ::-1]
df.ffill(axis=1).mask(mask)
2007 2008 2009 2010 2011
0 1.0 2.0 3.0 3.0 5.0
1 1.0 3.0 3.0 4.0 NaN
2 4.0 4.0 7.0 NaN NaN
以下是两款完全基于NumPy的产品,灵感来自this post
-
def app1(df):
# Same as in the linked post
arr = df.values
m,n = arr.shape
r = np.arange(n)
mask = np.isnan(arr)
idx = np.where(~mask,r,0)
idx = np.maximum.accumulate(idx,axis=1)
out = arr[np.arange(m)[:,None], idx]
# Additional part to keep the trailing NaN islands and output a dataframe
out[(n - mask[:,::-1].argmin(1))[:,None] <= r] = np.nan
return pd.DataFrame(out, columns=df.columns)
def app2(df):
arr = df.values
m,n = arr.shape
r = np.arange(m)
mask = np.isnan(arr)
idx = np.where(~mask,np.arange(n),0)
put_idx = n - mask[:,::-1].argmin(1)
v = put_idx < n
rv = r[v]
idx[rv,put_idx[v]] = idx[rv,(put_idx-1)[v]]+1
idx = np.maximum.accumulate(idx,axis=1)
out = arr[r[:,None], idx]
return pd.DataFrame(out, columns=df.columns)
样品运行 -
In [246]: df
Out[246]:
2007 2008 2009 2010 2011
0 1 2.0 3.0 NaN 5.0
1 1 3.0 NaN 4.0 NaN
2 4 NaN 7.0 NaN NaN
In [247]: app1(df)
Out[247]:
2007 2008 2009 2010 2011
0 1.0 2.0 3.0 3.0 5.0
1 1.0 3.0 3.0 4.0 NaN
2 4.0 4.0 7.0 NaN NaN
In [248]: app2(df)
Out[248]:
2007 2008 2009 2010 2011
0 1.0 2.0 3.0 3.0 5.0
1 1.0 3.0 3.0 4.0 NaN
2 4.0 4.0 7.0 NaN NaN
填写df
NaNs的更大50%
的运行时测试 -
In [249]: df = pd.DataFrame(np.random.randint(1,9,(5000,5000)).astype(float))
In [250]: idx = np.random.choice(df.size, df.size//2, replace=0)
In [251]: df.values.ravel()[idx] = np.nan
# @piRSquared's soln
In [252]: %%timeit
...: v = df.values
...: mask = np.logical_and.accumulate(
...: np.isnan(v)[:, ::-1], axis=1)[:, ::-1]
...: df.ffill(axis=1).mask(mask)
1 loop, best of 3: 473 ms per loop
In [253]: %timeit app1(df)
1 loop, best of 3: 353 ms per loop
In [254]: %timeit app2(df)
1 loop, best of 3: 330 ms per loop