我有以下 Pandas 数据框(称为
df
)。
+--------+--------+------+--------+
| Person | Animal | Year | Number |
+--------+--------+------+--------+
| John | Dogs | 2000 | 2 |
| John | Dogs | 2001 | 2 |
| John | Dogs | 2002 | 2 |
| John | Dogs | 2003 | 2 |
| John | Dogs | 2004 | 2 |
| John | Dogs | 2005 | 2 |
| John | Cats | 2000 | 1 |
| John | Cats | 2001 | NaN |
| John | Cats | 2002 | NaN |
| John | Cats | 2003 | 4 |
| John | Cats | 2004 | 5 |
| John | Cats | 2005 | 6 |
| Peter | Dogs | 2000 | NaN |
| Peter | Dogs | 2001 | 1 |
| Peter | Dogs | 2002 | NaN |
| Peter | Dogs | 2003 | 5 |
| Peter | Dogs | 2004 | 5 |
| Peter | Dogs | 2005 | 5 |
| Peter | Cats | 2000 | NaN |
| Peter | Cats | 2001 | 4 |
| Peter | Cats | 2002 | 4 |
| Peter | Cats | 2003 | 4 |
| Peter | Cats | 2004 | 4 |
| Peter | Cats | 2005 | 4 |
+--------+--------+------+--------+
我的目标是得到以下内容,这意味着使用插值方法来填充
NaN
值,但基于其他列值。换句话说,应该
Person
和 Animal
列对 df 进行分区Year
(升序).
+--------+--------+------+--------+
| Person | Animal | Year | Number |
+--------+--------+------+--------+
| John | Dogs | 2000 | 2 |
| John | Dogs | 2001 | 2 |
| John | Dogs | 2002 | 2 |
| John | Dogs | 2003 | 2 |
| John | Dogs | 2004 | 2 |
| John | Dogs | 2005 | 2 |
| John | Cats | 2000 | 1 |
| John | Cats | 2001 | 2 |
| John | Cats | 2002 | 3 |
| John | Cats | 2003 | 4 |
| John | Cats | 2004 | 5 |
| John | Cats | 2005 | 6 |
| Peter | Dogs | 2000 | NaN |
| Peter | Dogs | 2001 | 1 |
| Peter | Dogs | 2002 | 3 |
| Peter | Dogs | 2003 | 5 |
| Peter | Dogs | 2004 | 5 |
| Peter | Dogs | 2005 | 5 |
| Peter | Cats | 2000 | NaN |
| Peter | Cats | 2001 | 4 |
| Peter | Cats | 2002 | 4 |
| Peter | Cats | 2003 | 4 |
| Peter | Cats | 2004 | 4 |
| Peter | Cats | 2005 | 4 |
+--------+--------+------+--------+
我做了什么
我可以过滤每个人和每个动物,然后应用插值方法。最后,将所有内容合并在一起,但如果您有很多列,这听起来又乏味又漫长。
你可以尝试:
df['Number'] = (df.sort_values('Year', ascending=True)
.groupby(['Person', 'Animal'])['Number']
.transform(lambda x: x.interpolate()))
print(df)
# Output
Person Animal Year Number
0 John Dogs 2000 2.0
1 John Dogs 2001 2.0
2 John Dogs 2002 2.0
3 John Dogs 2003 2.0
4 John Dogs 2004 2.0
5 John Dogs 2005 2.0
6 John Cats 2000 1.0
7 John Cats 2001 2.0 # interpolate
8 John Cats 2002 3.0 # interpolate
9 John Cats 2003 4.0
10 John Cats 2004 5.0
11 John Cats 2005 6.0
12 Peter Dogs 2000 NaN
13 Peter Dogs 2001 1.0
14 Peter Dogs 2002 3.0
15 Peter Dogs 2003 5.0
16 Peter Dogs 2004 5.0
17 Peter Dogs 2005 5.0
18 Peter Cats 2000 NaN
19 Peter Cats 2001 4.0
20 Peter Cats 2002 4.0
21 Peter Cats 2003 4.0
22 Peter Cats 2004 4.0
23 Peter Cats 2005 4.0
对于多列,只需使用相同的操作:
cols = ['Number'] # list of columns
df[cols] = (df.sort_values('Year', ascending=True)
.groupby(['Person', 'Animal'])[cols]
.transform(lambda x: x.interpolate()))