好的,我已经在这方面工作了一段时间,我有一个解决方案,但它没有最佳工作。以下是数据帧的示例:
print(month_df[['timestamp','date','hvac_state']].head(100))
timestamp date hvac_state
0 2017-11-09 18:43:45 12-09-17 off
1 2017-11-09 20:15:27 12-09-17 heating
2 2017-11-09 22:29:00 12-09-17 heating
3 2017-11-09 23:42:34 12-09-17 off
4 2017-11-10 00:09:40 12-10-17 heating
5 2017-11-10 00:23:14 12-10-17 heating
6 2017-11-10 03:32:17 12-10-17 off
7 2017-11-10 10:59:24 12-10-17 heating
8 2017-11-10 11:12:59 12-10-17 off
9 2017-11-10 13:49:09 12-10-17 off
10 2017-11-10 16:58:11 12-10-17 heating
11 2017-11-10 17:11:45 12-10-17 heating
12 2017-11-10 17:25:19 12-10-17 off
13 2017-11-10 23:53:44 12-10-17 off
14 2017-11-11 00:25:22 12-11-17 off
15 2017-11-11 03:29:53 12-11-17 heating
16 2017-11-11 03:43:26 12-11-17 heating
17 2017-11-11 04:01:31 12-11-17 off
month_df数据框中还有其他字段,但这些是我正在使用的三个字段。任何变化都会附加一条线。有时项目更改是hvac_state,有时它是不同的列。这就是为什么你看到,偶尔,国家不会改变,但还有另一个条目。
我希望按天汇总所有“时间量”和hvac_state列的状态。我发现了一些关于groupby和使用shift的文章(比如this one),这就是我实现的但是它并不完美,因为每天的截止时间并不完全是00:00:00-23:59:59。我可以在我的汇总数据中说明这一点,因为我的数量总计超过24小时。此外,由于我使用'timestamp'和'date'列来执行此操作,因此显然效率不高。
这是我目前使用的方法:
def remove_consecutive_duplicates(a):
return a.loc[a.shift() != a]
# Get the left data frame ready, with timestamps associated specifically with state changes.
left = remove_consecutive_duplicates(month_df.set_index('timestamp')['hvac_state']).reset_index()
# Then delta from change to change and shift by negative 1.
left['delta'] = left.timestamp.diff().fillna(0).astype(int).shift(-1).fillna(0)
# Now prep the right dataframe by hurling the hvac_state so we don't end up with two.
right = month_df.drop(['hvac_state'], axis=1)
# Perform the merge, dropping the stuff that isn't in the left side.
result = pd.merge(left, right, how='left', on='timestamp')
# Now we can store that month's hourly usage by day.
grouped = (result.groupby(['date','hvac_state'])[['delta']].sum()/3600000).round(2)
results = defaultdict(lambda: defaultdict(dict))
for index, value in grouped.itertuples():
for i, key in enumerate(index):
if i == 0:
nested = results[key]
elif i == len(index) - 1:
nested[key] = value
else:
nested = nested[key]
results
defaultdict(<function __main__.<lambda>>,
{'12-09-17': defaultdict(dict, {'heating': 3.84, 'off': 10.24}),
'12-10-17': defaultdict(dict, {'heating': 8.36, 'off': 14.39}),
'12-11-17': defaultdict(dict, {'heating': 10.17, 'off': 14.91}),
'12-12-17': defaultdict(dict, {'heating': 9.34, 'off': 13.56}),
'12-13-17': defaultdict(dict, {'heating': 10.49, 'off': 13.59}),
'12-14-17': defaultdict(dict, {'heating': 9.58, 'off': 14.72}),
'12-15-17': defaultdict(dict, {'heating': 6.03, 'off': 24.38}),
'12-16-17': defaultdict(dict, {'heating': 0.0})})
从这个例子可以看出,在15日我的“关闭”时间是24.38小时,而“加热”时间是6.03小时。
我希望得到一个标准的字典(对于json使用),以日期为关键,状态为子键,指向每个状态所花费的时间。状态值应该增加到24小时。理想情况下,这样的事情:
{
'12-12-17': {'heating': 5.23, 'off': 18.77},
'12-13-17': {'heating': 7.85, 'off': 16.15},
'12-14-17': {'heating': 7.91, 'off': 16.09},
'12-15-17': {'heating': 6.22, 'off': 17.78},
'12-16-17': {'heating': 5.11, 'off': 18.89},
'12-17-17': {'heating': 9.17, 'off': 14.83}
}
考虑将午夜记录添加到数据框中,该数据框向前填充最后一行的HVAC状态(第一条记录是回填的)。下面打印出该过程的每个步骤:
初始数据
from io import StringIO
import pandas as pd
txt = """ timestamp date hvac_state
0 "2017-11-09 18:43:45" "12-09-17" off
1 "2017-11-09 20:15:27" "12-09-17" heating
2 "2017-11-09 22:29:00" "12-09-17" heating
3 "2017-11-09 23:42:34" "12-09-17" off
4 "2017-11-10 00:09:40" "12-10-17" heating
5 "2017-11-10 00:23:14" "12-10-17" heating
6 "2017-11-10 03:32:17" "12-10-17" off
7 "2017-11-10 10:59:24" "12-10-17" heating
8 "2017-11-10 11:12:59" "12-10-17" off
9 "2017-11-10 13:49:09" "12-10-17" off
10 "2017-11-10 16:58:11" "12-10-17" heating
11 "2017-11-10 17:11:45" "12-10-17" heating
12 "2017-11-10 17:25:19" "12-10-17" off
13 "2017-11-10 23:53:44" "12-10-17" off
14 "2017-11-11 00:25:22" "12-11-17" off
15 "2017-11-11 03:29:53" "12-11-17" heating
16 "2017-11-11 03:43:26" "12-11-17" heating
17 "2017-11-11 04:01:31" "12-11-17" off"""
month_df = pd.read_table(StringIO(txt), sep="\s+", index_col=0, parse_dates=[0,1])
午夜追加
midnights_df = pd.DataFrame({'timestamp':pd.date_range(month_df['timestamp'].min().normalize(),
month_df['timestamp'].max()),
'date': pd.date_range(month_df['date'].min(),
month_df['date'].max())})
print(midnights_df)
# date timestamp
# 0 2017-12-09 2017-11-09
# 1 2017-12-10 2017-11-10
# 2 2017-12-11 2017-11-11
month_df = pd.concat([month_df, midnights_df], ignore_index=True)\
.sort_values(['date', 'timestamp'])\
.append({'date': month_df['date'].max(),
'timestamp': month_df['timestamp'].max().normalize() +
pd.DateOffset(days=1)}, ignore_index=True)\
.fillna(method='ffill').reset_index(drop=True)
# BACK FILL
month_df.loc[0, 'hvac_state'] = month_df.loc[1, 'hvac_state']
print(month_df)
# date hvac_state timestamp
# 0 2017-12-09 off 2017-11-09 00:00:00
# 1 2017-12-09 off 2017-11-09 18:43:45
# 2 2017-12-09 heating 2017-11-09 20:15:27
# 3 2017-12-09 heating 2017-11-09 22:29:00
# 4 2017-12-09 off 2017-11-09 23:42:34
# 5 2017-12-10 off 2017-11-10 00:00:00
# 6 2017-12-10 heating 2017-11-10 00:09:40
# 7 2017-12-10 heating 2017-11-10 00:23:14
# 8 2017-12-10 off 2017-11-10 03:32:17
# 9 2017-12-10 heating 2017-11-10 10:59:24
# 10 2017-12-10 off 2017-11-10 11:12:59
# 11 2017-12-10 off 2017-11-10 13:49:09
# 12 2017-12-10 heating 2017-11-10 16:58:11
# 13 2017-12-10 heating 2017-11-10 17:11:45
# 14 2017-12-10 off 2017-11-10 17:25:19
# 15 2017-12-10 off 2017-11-10 23:53:44
# 16 2017-12-11 off 2017-11-11 00:00:00
# 17 2017-12-11 off 2017-11-11 00:25:22
# 18 2017-12-11 heating 2017-11-11 03:29:53
# 19 2017-12-11 heating 2017-11-11 03:43:26
# 20 2017-12-11 off 2017-11-11 04:01:31
# 21 2017-12-11 off 2017-11-12 00:00:00
加入班次时间(带时差计算)
从那里开始,考虑将主数据框与移位版本连接起来,然后运行内联时间减法以获得最终的groupby
总和。
join_df = month_df.join(month_df.shift(-1), lsuffix='', rsuffix='_')
join_df['time_diff'] = (join_df['timestamp_'] - join_df['timestamp']).astype('timedelta64[s]') / 3600.0
print(join_df)
# date hvac_state timestamp date_ hvac_state_ timestamp_ time_diff
# 0 2017-12-09 off 2017-11-09 00:00:00 2017-12-09 off 2017-11-09 18:43:45 18.729167
# 1 2017-12-09 off 2017-11-09 18:43:45 2017-12-09 heating 2017-11-09 20:15:27 1.528333
# 2 2017-12-09 heating 2017-11-09 20:15:27 2017-12-09 heating 2017-11-09 22:29:00 2.225833
# 3 2017-12-09 heating 2017-11-09 22:29:00 2017-12-09 off 2017-11-09 23:42:34 1.226111
# 4 2017-12-09 off 2017-11-09 23:42:34 2017-12-10 off 2017-11-10 00:00:00 0.290556
# 5 2017-12-10 off 2017-11-10 00:00:00 2017-12-10 heating 2017-11-10 00:09:40 0.161111
# 6 2017-12-10 heating 2017-11-10 00:09:40 2017-12-10 heating 2017-11-10 00:23:14 0.226111
# 7 2017-12-10 heating 2017-11-10 00:23:14 2017-12-10 off 2017-11-10 03:32:17 3.150833
# 8 2017-12-10 off 2017-11-10 03:32:17 2017-12-10 heating 2017-11-10 10:59:24 7.451944
# 9 2017-12-10 heating 2017-11-10 10:59:24 2017-12-10 off 2017-11-10 11:12:59 0.226389
# 10 2017-12-10 off 2017-11-10 11:12:59 2017-12-10 off 2017-11-10 13:49:09 2.602778
# 11 2017-12-10 off 2017-11-10 13:49:09 2017-12-10 heating 2017-11-10 16:58:11 3.150556
# 12 2017-12-10 heating 2017-11-10 16:58:11 2017-12-10 heating 2017-11-10 17:11:45 0.226111
# 13 2017-12-10 heating 2017-11-10 17:11:45 2017-12-10 off 2017-11-10 17:25:19 0.226111
# 14 2017-12-10 off 2017-11-10 17:25:19 2017-12-10 off 2017-11-10 23:53:44 6.473611
# 15 2017-12-10 off 2017-11-10 23:53:44 2017-12-11 off 2017-11-11 00:00:00 0.104444
# 16 2017-12-11 off 2017-11-11 00:00:00 2017-12-11 off 2017-11-11 00:25:22 0.422778
# 17 2017-12-11 off 2017-11-11 00:25:22 2017-12-11 heating 2017-11-11 03:29:53 3.075278
# 18 2017-12-11 heating 2017-11-11 03:29:53 2017-12-11 heating 2017-11-11 03:43:26 0.225833
# 19 2017-12-11 heating 2017-11-11 03:43:26 2017-12-11 off 2017-11-11 04:01:31 0.301389
# 20 2017-12-11 off 2017-11-11 04:01:31 2017-12-11 off 2017-11-11 23:59:59 19.974444
# 21 2017-12-11 off 2017-11-12 00:00:00 NaT NaN NaT NaN
聚合(按日期和hvac_state)
grp_df = join_df.groupby(['date', 'hvac_state']).sum().reset_index()
print(grp_df)
# date hvac_state time_diff
# 0 2017-12-09 heating 3.451944
# 1 2017-12-09 off 20.548056
# 2 2017-12-10 heating 4.055556
# 3 2017-12-10 off 19.944444
# 4 2017-12-11 heating 0.527222
# 5 2017-12-11 off 23.472500
Pivot(最终需要的json)
pvt_df = grp_df.pivot(index='date', columns='hvac_state', values='time_diff')
pvt_df.index = pvt_df.index.astype('str')
print(pvt_df)
# hvac_state heating off
# date
# 2017-12-09 3.451944 20.548056
# 2017-12-10 4.055556 19.944444
# 2017-12-11 0.527222 23.472500
json_data = pvt_df.to_json(orient='index')
print(json_data)
# {"2017-12-09":{"heating":3.4519444444,"off":20.5480555556},
# "2017-12-10":{"heating":4.0555555556,"off":19.9444444444},
# "2017-12-11":{"heating":0.5272222222,"off":23.4725}
# }