如何通过按日期时间分组的pandas数据框中的状态更改列进行聚合

Question

好的，我已经在这方面工作了一段时间，我有一个解决方案，但它没有最佳工作。以下是数据帧的示例：

print(month_df[['timestamp','date','hvac_state']].head(100))
             timestamp      date hvac_state
0  2017-11-09 18:43:45  12-09-17        off
1  2017-11-09 20:15:27  12-09-17    heating
2  2017-11-09 22:29:00  12-09-17    heating
3  2017-11-09 23:42:34  12-09-17        off
4  2017-11-10 00:09:40  12-10-17    heating
5  2017-11-10 00:23:14  12-10-17    heating
6  2017-11-10 03:32:17  12-10-17        off
7  2017-11-10 10:59:24  12-10-17    heating
8  2017-11-10 11:12:59  12-10-17        off
9  2017-11-10 13:49:09  12-10-17        off
10 2017-11-10 16:58:11  12-10-17    heating
11 2017-11-10 17:11:45  12-10-17    heating
12 2017-11-10 17:25:19  12-10-17        off
13 2017-11-10 23:53:44  12-10-17        off
14 2017-11-11 00:25:22  12-11-17        off
15 2017-11-11 03:29:53  12-11-17    heating
16 2017-11-11 03:43:26  12-11-17    heating
17 2017-11-11 04:01:31  12-11-17        off

month_df数据框中还有其他字段，但这些是我正在使用的三个字段。任何变化都会附加一条线。有时项目更改是hvac_state，有时它是不同的列。这就是为什么你看到，偶尔，国家不会改变，但还有另一个条目。

我希望按天汇总所有“时间量”和hvac_state列的状态。我发现了一些关于groupby和使用shift的文章（比如this one），这就是我实现的但是它并不完美，因为每天的截止时间并不完全是00：00：00-23：59：59。我可以在我的汇总数据中说明这一点，因为我的数量总计超过24小时。此外，由于我使用'timestamp'和'date'列来执行此操作，因此显然效率不高。

这是我目前使用的方法：

def remove_consecutive_duplicates(a):
    return a.loc[a.shift() != a]

# Get the left data frame ready, with timestamps associated specifically with state changes.
left = remove_consecutive_duplicates(month_df.set_index('timestamp')['hvac_state']).reset_index()
# Then delta from change to change and shift by negative 1.
left['delta'] = left.timestamp.diff().fillna(0).astype(int).shift(-1).fillna(0)
# Now prep the right dataframe by hurling the hvac_state so we don't end up with two.
right = month_df.drop(['hvac_state'], axis=1)
# Perform the merge, dropping the stuff that isn't in the left side.
result = pd.merge(left, right, how='left', on='timestamp')
# Now we can store that month's hourly usage by day.
grouped = (result.groupby(['date','hvac_state'])[['delta']].sum()/3600000).round(2)
results = defaultdict(lambda: defaultdict(dict))
for index, value in grouped.itertuples():
    for i, key in enumerate(index):
        if i == 0:
            nested = results[key]
        elif i == len(index) - 1:
            nested[key] = value
        else:
            nested = nested[key]

results
defaultdict(<function __main__.<lambda>>,
            {'12-09-17': defaultdict(dict, {'heating': 3.84, 'off': 10.24}),
             '12-10-17': defaultdict(dict, {'heating': 8.36, 'off': 14.39}),
             '12-11-17': defaultdict(dict, {'heating': 10.17, 'off': 14.91}),
             '12-12-17': defaultdict(dict, {'heating': 9.34, 'off': 13.56}),
             '12-13-17': defaultdict(dict, {'heating': 10.49, 'off': 13.59}),
             '12-14-17': defaultdict(dict, {'heating': 9.58, 'off': 14.72}),
             '12-15-17': defaultdict(dict, {'heating': 6.03, 'off': 24.38}),
             '12-16-17': defaultdict(dict, {'heating': 0.0})})

从这个例子可以看出，在15日我的“关闭”时间是24.38小时，而“加热”时间是6.03小时。

我希望得到一个标准的字典（对于json使用），以日期为关键，状态为子键，指向每个状态所花费的时间。状态值应该增加到24小时。理想情况下，这样的事情：

{
 '12-12-17': {'heating': 5.23, 'off': 18.77},
 '12-13-17': {'heating': 7.85, 'off': 16.15},
 '12-14-17': {'heating': 7.91, 'off': 16.09},
 '12-15-17': {'heating': 6.22, 'off': 17.78},
 '12-16-17': {'heating': 5.11, 'off': 18.89},
 '12-17-17': {'heating': 9.17, 'off': 14.83}
}

Answer 1

考虑将午夜记录添加到数据框中，该数据框向前填充最后一行的HVAC状态（第一条记录是回填的）。下面打印出该过程的每个步骤：

初始数据

from io import StringIO
import pandas as pd

txt = """             timestamp      date hvac_state
0  "2017-11-09 18:43:45"  "12-09-17"        off
1  "2017-11-09 20:15:27"  "12-09-17"    heating
2  "2017-11-09 22:29:00"  "12-09-17"    heating
3  "2017-11-09 23:42:34"  "12-09-17"        off
4  "2017-11-10 00:09:40"  "12-10-17"    heating
5  "2017-11-10 00:23:14"  "12-10-17"    heating
6  "2017-11-10 03:32:17"  "12-10-17"        off
7  "2017-11-10 10:59:24"  "12-10-17"    heating
8  "2017-11-10 11:12:59"  "12-10-17"        off
9  "2017-11-10 13:49:09"  "12-10-17"        off
10 "2017-11-10 16:58:11"  "12-10-17"    heating
11 "2017-11-10 17:11:45"  "12-10-17"    heating
12 "2017-11-10 17:25:19"  "12-10-17"        off
13 "2017-11-10 23:53:44"  "12-10-17"        off
14 "2017-11-11 00:25:22"  "12-11-17"        off
15 "2017-11-11 03:29:53"  "12-11-17"    heating
16 "2017-11-11 03:43:26"  "12-11-17"    heating
17 "2017-11-11 04:01:31"  "12-11-17"        off"""

month_df = pd.read_table(StringIO(txt), sep="\s+", index_col=0, parse_dates=[0,1])

午夜追加

midnights_df = pd.DataFrame({'timestamp':pd.date_range(month_df['timestamp'].min().normalize(), 
                                                       month_df['timestamp'].max()),
                             'date': pd.date_range(month_df['date'].min(), 
                                                   month_df['date'].max())})
print(midnights_df)
#         date  timestamp
# 0 2017-12-09 2017-11-09
# 1 2017-12-10 2017-11-10
# 2 2017-12-11 2017-11-11

month_df = pd.concat([month_df, midnights_df], ignore_index=True)\
                    .sort_values(['date', 'timestamp'])\
                    .append({'date': month_df['date'].max(),
                             'timestamp': month_df['timestamp'].max().normalize() + 
                             pd.DateOffset(days=1)}, ignore_index=True)\
                    .fillna(method='ffill').reset_index(drop=True)
# BACK FILL
month_df.loc[0, 'hvac_state'] = month_df.loc[1, 'hvac_state']

print(month_df)
#          date hvac_state           timestamp
# 0  2017-12-09        off 2017-11-09 00:00:00
# 1  2017-12-09        off 2017-11-09 18:43:45
# 2  2017-12-09    heating 2017-11-09 20:15:27
# 3  2017-12-09    heating 2017-11-09 22:29:00
# 4  2017-12-09        off 2017-11-09 23:42:34
# 5  2017-12-10        off 2017-11-10 00:00:00
# 6  2017-12-10    heating 2017-11-10 00:09:40
# 7  2017-12-10    heating 2017-11-10 00:23:14
# 8  2017-12-10        off 2017-11-10 03:32:17
# 9  2017-12-10    heating 2017-11-10 10:59:24
# 10 2017-12-10        off 2017-11-10 11:12:59
# 11 2017-12-10        off 2017-11-10 13:49:09
# 12 2017-12-10    heating 2017-11-10 16:58:11
# 13 2017-12-10    heating 2017-11-10 17:11:45
# 14 2017-12-10        off 2017-11-10 17:25:19
# 15 2017-12-10        off 2017-11-10 23:53:44
# 16 2017-12-11        off 2017-11-11 00:00:00
# 17 2017-12-11        off 2017-11-11 00:25:22
# 18 2017-12-11    heating 2017-11-11 03:29:53
# 19 2017-12-11    heating 2017-11-11 03:43:26
# 20 2017-12-11        off 2017-11-11 04:01:31
# 21 2017-12-11        off 2017-11-12 00:00:00

加入班次时间（带时差计算）

从那里开始，考虑将主数据框与移位版本连接起来，然后运行内联时间减法以获得最终的groupby总和。

join_df = month_df.join(month_df.shift(-1), lsuffix='', rsuffix='_')
join_df['time_diff'] = (join_df['timestamp_'] - join_df['timestamp']).astype('timedelta64[s]') / 3600.0

print(join_df)
#          date hvac_state           timestamp      date_ hvac_state_          timestamp_  time_diff
# 0  2017-12-09        off 2017-11-09 00:00:00 2017-12-09         off 2017-11-09 18:43:45  18.729167
# 1  2017-12-09        off 2017-11-09 18:43:45 2017-12-09     heating 2017-11-09 20:15:27   1.528333
# 2  2017-12-09    heating 2017-11-09 20:15:27 2017-12-09     heating 2017-11-09 22:29:00   2.225833
# 3  2017-12-09    heating 2017-11-09 22:29:00 2017-12-09         off 2017-11-09 23:42:34   1.226111
# 4  2017-12-09        off 2017-11-09 23:42:34 2017-12-10         off 2017-11-10 00:00:00   0.290556
# 5  2017-12-10        off 2017-11-10 00:00:00 2017-12-10     heating 2017-11-10 00:09:40   0.161111
# 6  2017-12-10    heating 2017-11-10 00:09:40 2017-12-10     heating 2017-11-10 00:23:14   0.226111
# 7  2017-12-10    heating 2017-11-10 00:23:14 2017-12-10         off 2017-11-10 03:32:17   3.150833
# 8  2017-12-10        off 2017-11-10 03:32:17 2017-12-10     heating 2017-11-10 10:59:24   7.451944
# 9  2017-12-10    heating 2017-11-10 10:59:24 2017-12-10         off 2017-11-10 11:12:59   0.226389
# 10 2017-12-10        off 2017-11-10 11:12:59 2017-12-10         off 2017-11-10 13:49:09   2.602778
# 11 2017-12-10        off 2017-11-10 13:49:09 2017-12-10     heating 2017-11-10 16:58:11   3.150556
# 12 2017-12-10    heating 2017-11-10 16:58:11 2017-12-10     heating 2017-11-10 17:11:45   0.226111
# 13 2017-12-10    heating 2017-11-10 17:11:45 2017-12-10         off 2017-11-10 17:25:19   0.226111
# 14 2017-12-10        off 2017-11-10 17:25:19 2017-12-10         off 2017-11-10 23:53:44   6.473611
# 15 2017-12-10        off 2017-11-10 23:53:44 2017-12-11         off 2017-11-11 00:00:00   0.104444
# 16 2017-12-11        off 2017-11-11 00:00:00 2017-12-11         off 2017-11-11 00:25:22   0.422778
# 17 2017-12-11        off 2017-11-11 00:25:22 2017-12-11     heating 2017-11-11 03:29:53   3.075278
# 18 2017-12-11    heating 2017-11-11 03:29:53 2017-12-11     heating 2017-11-11 03:43:26   0.225833
# 19 2017-12-11    heating 2017-11-11 03:43:26 2017-12-11         off 2017-11-11 04:01:31   0.301389
# 20 2017-12-11        off 2017-11-11 04:01:31 2017-12-11         off 2017-11-11 23:59:59  19.974444
# 21 2017-12-11        off 2017-11-12 00:00:00        NaT         NaN                 NaT        NaN

聚合（按日期和hvac_state）

grp_df = join_df.groupby(['date', 'hvac_state']).sum().reset_index()

print(grp_df)
#         date hvac_state  time_diff
# 0 2017-12-09    heating   3.451944
# 1 2017-12-09        off  20.548056
# 2 2017-12-10    heating   4.055556
# 3 2017-12-10        off  19.944444
# 4 2017-12-11    heating   0.527222
# 5 2017-12-11        off  23.472500

Pivot（最终需要的json）

pvt_df = grp_df.pivot(index='date', columns='hvac_state', values='time_diff')
pvt_df.index = pvt_df.index.astype('str')

print(pvt_df)
# hvac_state   heating        off
# date                           
# 2017-12-09  3.451944  20.548056
# 2017-12-10  4.055556  19.944444
# 2017-12-11  0.527222  23.472500

json_data = pvt_df.to_json(orient='index')

print(json_data)
#  {"2017-12-09":{"heating":3.4519444444,"off":20.5480555556},
#   "2017-12-10":{"heating":4.0555555556,"off":19.9444444444},
#   "2017-12-11":{"heating":0.5272222222,"off":23.4725}
#  }

如何通过按日期时间分组的pandas数据框中的状态更改列进行聚合

问题描述投票：3回答：1

1个回答

最新问题

如何通过按日期时间分组的pandas数据框中的状态更改列进行聚合

问题描述 投票：3回答：1

1个回答

最新问题

问题描述投票：3回答：1