我正在寻找一种更快的方法来改善我的解决方案的性能,以解决以下问题:某个DataFrame有两列,其中包含几个NaN值。挑战是用来自辅助数据框架的值替换这些NaN。
下面我将分享用于实现我的方法的数据和代码。让我解释一下这个场景:merged_df
是原始的DataFrame,有几列,其中一些有NaN值的行:
从上图中可以看出,列day_of_week
和holiday_flg
特别有意义。我想通过查看名为date_info_df
的第二个DataFrame来填充这些列的NaN值,如下所示:
通过使用visit_date
中merged_df
列中的值,可以在calendar_date
上搜索第二个DataFrame并找到相应的匹配项。此方法允许从第二个DataFrame获取day_of_week
和holiday_flg
的值。
此练习的最终结果是DataFrame,如下所示:
您会注意到我正在使用的方法依赖于apply()
在merged_df
的每一行上执行自定义函数:
day_of_week
和holiday_flg
中搜索NaN值;visit_date
中的可用日期在第二个DataFrame中查找等效匹配,特别是date_info_df['calendar_date']
列;date_info_df['day_of_week']
中的值复制到merged_df['day_of_week']
中,并且date_info_df['holiday_flg']
中的值也必须复制到date_info_df['holiday_flg']
中。这是一个有效的源代码:
import math
import pandas as pd
import numpy as np
from IPython.display import display
### Data for df
data = { 'air_store_id': [ 'air_a1', 'air_a2', 'air_a3', 'air_a4' ],
'area_name': [ 'Tokyo', np.nan, np.nan, np.nan ],
'genre_name': [ 'Japanese', np.nan, np.nan, np.nan ],
'hpg_store_id': [ 'hpg_h1', np.nan, np.nan, np.nan ],
'latitude': [ 1234, np.nan, np.nan, np.nan ],
'longitude': [ 5678, np.nan, np.nan, np.nan ],
'reserve_datetime': [ '2017-04-22 11:00:00', np.nan, np.nan, np.nan ],
'reserve_visitors': [ 25, 35, 45, np.nan ],
'visit_datetime': [ '2017-05-23 12:00:00', np.nan, np.nan, np.nan ],
'visit_date': [ '2017-05-23' , '2017-05-24', '2017-05-25', '2017-05-27' ],
'day_of_week': [ 'Tuesday', 'Wednesday', np.nan, np.nan ],
'holiday_flg': [ 0, np.nan, np.nan, np.nan ]
}
merged_df = pd.DataFrame(data)
display(merged_df)
### Data for date_info_df
data = { 'calendar_date': [ '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26', '2017-05-27', '2017-05-28' ],
'day_of_week': [ 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday' ],
'holiday_flg': [ 0, 0, 0, 0, 1, 1 ]
}
date_info_df = pd.DataFrame(data)
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
display(date_info_df)
# Fix the NaN values in day_of_week and holiday_flg by inspecting data from another dataframe (date_info_df)
def fix_weekday_and_holiday(row):
weekday = row['day_of_week']
holiday = row['holiday_flg']
# search dataframe date_info_df for the appropriate value when weekday is NaN
if (type(weekday) == float and math.isnan(weekday)):
search_date = row['visit_date']
#print(' --> weekday search_date=', search_date, 'type=', type(search_date))
indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
idx = indexes[0]
weekday = date_info_df.at[idx,'day_of_week']
#print(' --> weekday search_date=', search_date, 'is', weekday)
row['day_of_week'] = weekday
# search dataframe date_info_df for the appropriate value when holiday is NaN
if (type(holiday) == float and math.isnan(holiday)):
search_date = row['visit_date']
#print(' --> holiday search_date=', search_date, 'type=', type(search_date))
indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
idx = indexes[0]
holiday = date_info_df.at[idx,'holiday_flg']
#print(' --> holiday search_date=', search_date, 'is', holiday)
row['holiday_flg'] = int(holiday)
return row
# send every row to fix_day_of_week
merged_df = merged_df.apply(fix_weekday_and_holiday, axis=1)
# Convert data from float to int (to remove decimal places)
merged_df['holiday_flg'] = merged_df['holiday_flg'].astype(int)
display(merged_df)
我做了一些测量,所以你可以理解这个斗争:
apply()
需要3.01 ms;apply()
需要2分51秒。apply()
需要4分钟2秒。如何提高此任务的性能?
你可以使用Index
来加速查找,使用combine_first()
来填充NaN:
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
print(merged_df[cols])
结果:
day_of_week holiday_flg
0 Tuesday 0.0
1 Wednesday 0.0
2 Thursday 0.0
3 Saturday 1.0
这是一个解决方案。它应该是有效的,因为没有明确的merge
或apply
。
merged_df['visit_date'] = pd.to_datetime(merged_df['visit_date'])
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
s = date_info_df.set_index('calendar_date')['day_of_week']
t = date_info_df.set_index('day_of_week')['holiday_flg']
merged_df['day_of_week'] = merged_df['day_of_week'].fillna(merged_df['visit_date'].map(s))
merged_df['holiday_flg'] = merged_df['holiday_flg'].fillna(merged_df['day_of_week'].map(t))
结果
air_store_id area_name day_of_week genre_name holiday_flg hpg_store_id \
0 air_a1 Tokyo Tuesday Japanese 0.0 hpg_h1
1 air_a2 NaN Wednesday NaN 0.0 NaN
2 air_a3 NaN Thursday NaN 0.0 NaN
3 air_a4 NaN Saturday NaN 1.0 NaN
latitude longitude reserve_datetime reserve_visitors visit_date \
0 1234.0 5678.0 2017-04-22 11:00:00 25.0 2017-05-23
1 NaN NaN NaN 35.0 2017-05-24
2 NaN NaN NaN 45.0 2017-05-25
3 NaN NaN NaN NaN 2017-05-27
visit_datetime
0 2017-05-23 12:00:00
1 NaN
2 NaN
3 NaN
说明
s
是pd.Series
映射calendar_date到date_info_df
的day_of_week。pd.Series.map
(以pd.Series
作为输入)在可能的情况下更新缺失值。编辑:一个也可以使用merge
来解决问题。比旧方法快10倍。 (需要确保"visit_date"
和"calendar_date"
具有相同的格式。)
# don't need to `set_index` for date_info_df but select columns needed.
merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]],
left_on="visit_date",
right_on="calendar_date",
how="left") # outer should also work
预期的结果将是"day_of_week_y"
和"holiday_flg_y"
专栏。在这种方法和map
方法中,我们根本不使用旧的"day_of_week"
和"holiday_flg"
。我们只需要将结果从data_info_df
映射到merged_df
。
merge
也可以完成这项工作,因为data_info_df
的数据条目是独一无二的。 (不会创建重复项。)
您也可以尝试使用pandas.Series.map
。它的作用是什么
使用输入对应(可以是字典,系列或函数)映射系列的值
# set "calendar_date" as the index such that
# mapping["day_of_week"] and mapping["holiday_flg"] will be two series
# with date_info_df["calendar_date"] as their index.
mapping = date_info_df.set_index("calendar_date")
# this line is optional (depending on the layout of data.)
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
# do replacement here.
merged_df["day_of_week"] = merged_df.visit_date.map(mapping["day_of_week"])
merged_df["holiday_flg"] = merged_df.visit_date.map(mapping["holiday_flg"])
注意merged_df.visit_date
最初是字符串类型。因此,我们使用
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
使它成为日期时间。
由karlphillip提供的时间date_info_df dataset和merged_df。
date_info_df = pd.read_csv("full_date_info_data.csv")
merged_df = pd.read_csv("full_data.csv")
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
date_info_df.calendar_date = pd.to_datetime(date_info_df.calendar_date)
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
# merge method I proprose on the top.
%timeit merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], left_on="visit_date", right_on="calendar_date", how="left")
511 ms ± 34.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method without assigning it back
%timeit merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
772 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method with assigning it back
%timeit merged_df[cols] = merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
258 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
可以看出,如果将结果分配回merged_df
,HYRY的方法运行速度提高了3倍。这就是为什么我认为HARY的方法乍一看比我快。我怀疑这是因为combine_first
的性质。我想HARY方法的速度将取决于它在merged_df
中的稀疏程度。因此,在返回结果的同时,列变满了;因此,在重新运行时,它会更快。
merge
和combine_first
方法的表现几乎相同。也许可能存在一个比另一个更快的情况。应由每个用户对其数据集进行一些测试。
这两种方法之间需要注意的另一件事是merge
方法假设merged_df
中的每个日期都包含在data_info_df
中。如果有一些日期包含在merged_df
但不包含data_info_df
,它应该返回NaN
。并且NaN
可以覆盖最初包含值的merged_df
的某些部分!这是当combine_first
方法应该是首选。请参阅Pandas replace, multi column criteria中MaxU的讨论