我有一个约150万行和数百列的数据表结构,代表带有赛马成绩的日期-这将用于预测模型,但是必须先进行特征工程以计算各个实体的罢工率创建前一天参加每场比赛的记录。
“罢工率”可以用多种方式定义,但是一个简单的方法是任何给定的马匹,教练员,骑师等的获胜次数与奔跑次数的比率。包括“今天”的结果,因为这对于构建模型来说是无稽之谈。
无论如何,简化的数据结构(从网上的一些示例改编而来,就足以说明。
生成数据如下:
n <- 90
dt <- data.table(
date=rep(seq(as.Date('2010-01-01'), as.Date('2015-01-01'), by='year'), n/6),
finish=c(1:5),
trainer=sort(rep(letters[1:5], n/5))
)
想象一下,在每个日期,每个教练员都有一名跑步者,其在比赛中的终点位置以“完成”表示。对于序列中的新日期(但不在此数据中),可以这样计算到目前为止赢得的时间比例:
dt[order(trainer, date), .(strike_rate = sum(finish==1)/.N), by=trainer]
但是,为每个培训员显示的结果strike_rate变量仅对不在此数据集中的序列中的新日期(例如“ 2015-01-02”或我们的样本不足集)有效。
[为了建立模型,我们需要每天和每位培训师(以及许多其他实体,但现在让我们继续使用培训师)的罢工率保持一致。
我已经使用了shift
函数和数据表结构,但无法使其适用于此特定问题-但是,在循环上下文中,它的运行正常,尽管显示得非常好。
为了说明所需的输出,此示例代码(尽管我确定它并不优雅!)可以正常工作:
#order dates most recent to oldest so that the loop works backwards in time:
dt <- dt[order(-date)]
#find unique dates (converting to character as something weird with date)
dates = as.character(unique(dt$date))
for (d in dates) {
#find unique trainers on this date
trainers = unique(dt$trainer[dt$date==d])
for (t in trainers) {
trainer_past_form = dt[trainer==t & date < d]
strike_rate = sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
# save this strike rate for this day and this trainer
dt$strike_rate[dt$trainer==t & dt$date==d] <- strike_rate
}
}
并给出所需的输出:
date finish trainer strike_rate
1: 2015-01-01 1 a 0.2000000
2: 2015-01-01 2 a 0.2000000
3: 2015-01-01 3 a 0.2000000
4: 2015-01-01 4 b 0.2000000
5: 2015-01-01 5 b 0.2000000
6: 2015-01-01 1 b 0.2000000
7: 2015-01-01 2 c 0.2000000
8: 2015-01-01 3 c 0.2000000
9: 2015-01-01 4 c 0.2000000
10: 2015-01-01 5 d 0.2000000
11: 2015-01-01 1 d 0.2000000
12: 2015-01-01 2 d 0.2000000
13: 2015-01-01 3 e 0.2000000
14: 2015-01-01 4 e 0.2000000
15: 2015-01-01 5 e 0.2000000
16: 2014-01-01 5 a 0.1666667
17: 2014-01-01 1 a 0.1666667
18: 2014-01-01 2 a 0.1666667
19: 2014-01-01 3 b 0.2500000
20: 2014-01-01 4 b 0.2500000
21: 2014-01-01 5 b 0.2500000
22: 2014-01-01 1 c 0.1666667
23: 2014-01-01 2 c 0.1666667
24: 2014-01-01 3 c 0.1666667
25: 2014-01-01 4 d 0.1666667
26: 2014-01-01 5 d 0.1666667
27: 2014-01-01 1 d 0.1666667
28: 2014-01-01 2 e 0.2500000
29: 2014-01-01 3 e 0.2500000
30: 2014-01-01 4 e 0.2500000
31: 2013-01-01 4 a 0.1111111
32: 2013-01-01 5 a 0.1111111
33: 2013-01-01 1 a 0.1111111
34: 2013-01-01 2 b 0.3333333
35: 2013-01-01 3 b 0.3333333
36: 2013-01-01 4 b 0.3333333
37: 2013-01-01 5 c 0.1111111
38: 2013-01-01 1 c 0.1111111
39: 2013-01-01 2 c 0.1111111
40: 2013-01-01 3 d 0.2222222
41: 2013-01-01 4 d 0.2222222
42: 2013-01-01 5 d 0.2222222
43: 2013-01-01 1 e 0.2222222
44: 2013-01-01 2 e 0.2222222
45: 2013-01-01 3 e 0.2222222
46: 2012-01-01 3 a 0.1666667
47: 2012-01-01 4 a 0.1666667
48: 2012-01-01 5 a 0.1666667
49: 2012-01-01 1 b 0.3333333
50: 2012-01-01 2 b 0.3333333
51: 2012-01-01 3 b 0.3333333
52: 2012-01-01 4 c 0.0000000
53: 2012-01-01 5 c 0.0000000
54: 2012-01-01 1 c 0.0000000
55: 2012-01-01 2 d 0.3333333
56: 2012-01-01 3 d 0.3333333
57: 2012-01-01 4 d 0.3333333
58: 2012-01-01 5 e 0.1666667
59: 2012-01-01 1 e 0.1666667
60: 2012-01-01 2 e 0.1666667
61: 2011-01-01 2 a 0.3333333
62: 2011-01-01 3 a 0.3333333
63: 2011-01-01 4 a 0.3333333
64: 2011-01-01 5 b 0.3333333
65: 2011-01-01 1 b 0.3333333
66: 2011-01-01 2 b 0.3333333
67: 2011-01-01 3 c 0.0000000
68: 2011-01-01 4 c 0.0000000
69: 2011-01-01 5 c 0.0000000
70: 2011-01-01 1 d 0.3333333
71: 2011-01-01 2 d 0.3333333
72: 2011-01-01 3 d 0.3333333
73: 2011-01-01 4 e 0.0000000
74: 2011-01-01 5 e 0.0000000
75: 2011-01-01 1 e 0.0000000
76: 2010-01-01 1 a NaN
77: 2010-01-01 2 a NaN
78: 2010-01-01 3 a NaN
79: 2010-01-01 4 b NaN
80: 2010-01-01 5 b NaN
81: 2010-01-01 1 b NaN
82: 2010-01-01 2 c NaN
83: 2010-01-01 3 c NaN
84: 2010-01-01 4 c NaN
85: 2010-01-01 5 d NaN
86: 2010-01-01 1 d NaN
87: 2010-01-01 2 d NaN
88: 2010-01-01 3 e NaN
89: 2010-01-01 4 e NaN
90: 2010-01-01 5 e NaN
在数据表中“适当”执行此操作的任何帮助将不胜感激。可以看出,我已经开始使用该库,但是遇到了此类问题。我了解循环的逻辑,但在150万行上效率不高,因为要在所有变量上执行很多这种类型的计算。
dt[, strike_rate :=
.SD[.SD, on=.(trainer, date<date), by=.EACHI, sum(finish==1)/.N]$V1
]
另一个应该更快的选项:
dt[order(trainer, date), strike_rate := { ri <- rleid(date) firstd <- which(diff(ri) != 0) + 1L cs <- replace(rep(NA_real_, .N), firstd, cumsum(finish==1L)[firstd - 1L]) k <- replace(rep(NA, .N), firstd, (1L:.N)[firstd - 1L]) nafill(cs, "locf") / nafill(k, "locf") }, trainer]
setorder(dt, -date, trainer, finish)[]
的输出:
date finish trainer strike_rate 1: 2015-01-01 1 a 0.2000000 2: 2015-01-01 2 a 0.2000000 3: 2015-01-01 3 a 0.2000000 4: 2015-01-01 1 b 0.2000000 5: 2015-01-01 4 b 0.2000000 6: 2015-01-01 5 b 0.2000000 7: 2015-01-01 2 c 0.2000000 8: 2015-01-01 3 c 0.2000000 9: 2015-01-01 4 c 0.2000000 10: 2015-01-01 1 d 0.2000000 11: 2015-01-01 2 d 0.2000000 12: 2015-01-01 5 d 0.2000000 13: 2015-01-01 3 e 0.2000000 14: 2015-01-01 4 e 0.2000000 15: 2015-01-01 5 e 0.2000000 16: 2014-01-01 1 a 0.1666667 17: 2014-01-01 2 a 0.1666667 18: 2014-01-01 5 a 0.1666667 19: 2014-01-01 3 b 0.2500000 20: 2014-01-01 4 b 0.2500000 21: 2014-01-01 5 b 0.2500000 22: 2014-01-01 1 c 0.1666667 23: 2014-01-01 2 c 0.1666667 24: 2014-01-01 3 c 0.1666667 25: 2014-01-01 1 d 0.1666667 26: 2014-01-01 4 d 0.1666667 27: 2014-01-01 5 d 0.1666667 28: 2014-01-01 2 e 0.2500000 29: 2014-01-01 3 e 0.2500000 30: 2014-01-01 4 e 0.2500000 31: 2013-01-01 1 a 0.1111111 32: 2013-01-01 4 a 0.1111111 33: 2013-01-01 5 a 0.1111111 34: 2013-01-01 2 b 0.3333333 35: 2013-01-01 3 b 0.3333333 36: 2013-01-01 4 b 0.3333333 37: 2013-01-01 1 c 0.1111111 38: 2013-01-01 2 c 0.1111111 39: 2013-01-01 5 c 0.1111111 40: 2013-01-01 3 d 0.2222222 41: 2013-01-01 4 d 0.2222222 42: 2013-01-01 5 d 0.2222222 43: 2013-01-01 1 e 0.2222222 44: 2013-01-01 2 e 0.2222222 45: 2013-01-01 3 e 0.2222222 46: 2012-01-01 3 a 0.1666667 47: 2012-01-01 4 a 0.1666667 48: 2012-01-01 5 a 0.1666667 49: 2012-01-01 1 b 0.3333333 50: 2012-01-01 2 b 0.3333333 51: 2012-01-01 3 b 0.3333333 52: 2012-01-01 1 c 0.0000000 53: 2012-01-01 4 c 0.0000000 54: 2012-01-01 5 c 0.0000000 55: 2012-01-01 2 d 0.3333333 56: 2012-01-01 3 d 0.3333333 57: 2012-01-01 4 d 0.3333333 58: 2012-01-01 1 e 0.1666667 59: 2012-01-01 2 e 0.1666667 60: 2012-01-01 5 e 0.1666667 61: 2011-01-01 2 a 0.3333333 62: 2011-01-01 3 a 0.3333333 63: 2011-01-01 4 a 0.3333333 64: 2011-01-01 1 b 0.3333333 65: 2011-01-01 2 b 0.3333333 66: 2011-01-01 5 b 0.3333333 67: 2011-01-01 3 c 0.0000000 68: 2011-01-01 4 c 0.0000000 69: 2011-01-01 5 c 0.0000000 70: 2011-01-01 1 d 0.3333333 71: 2011-01-01 2 d 0.3333333 72: 2011-01-01 3 d 0.3333333 73: 2011-01-01 1 e 0.0000000 74: 2011-01-01 4 e 0.0000000 75: 2011-01-01 5 e 0.0000000 76: 2010-01-01 1 a NA 77: 2010-01-01 2 a NA 78: 2010-01-01 3 a NA 79: 2010-01-01 1 b NA 80: 2010-01-01 4 b NA 81: 2010-01-01 5 b NA 82: 2010-01-01 2 c NA 83: 2010-01-01 3 c NA 84: 2010-01-01 4 c NA 85: 2010-01-01 1 d NA 86: 2010-01-01 2 d NA 87: 2010-01-01 5 d NA 88: 2010-01-01 3 e NA 89: 2010-01-01 4 e NA 90: 2010-01-01 5 e NA date finish trainer strike_rate
for
循环。我在这里使用magrittr::%>%
主要是因为我认为它有助于打破操作流程。它不是必需的,并且可以轻松将其转换为data.table
管道或类似的首选项。library(data.table)
library(magrittr)
dt %>%
.[ order(date), ] %>%
.[, c("rate", "n") := .(cumsum(finish == 1), seq_len(.N)), by = .(trainer) ] %>%
.[, .(rate = max(rate) / max(n)), by = .(date, trainer) ] %>%
.[, date := shift(date, type = "lead"), by = .(trainer) ] %>%
merge(dt, ., by = c("trainer", "date"), all.x = TRUE) %>%
.[ order(-date), ]
# trainer date finish rate
# 1: a 2015-01-01 1 0.2000000
# 2: a 2015-01-01 2 0.2000000
# 3: a 2015-01-01 3 0.2000000
# 4: b 2015-01-01 4 0.2000000
# 5: b 2015-01-01 5 0.2000000
# 6: b 2015-01-01 1 0.2000000
# 7: c 2015-01-01 2 0.2000000
# 8: c 2015-01-01 3 0.2000000
# 9: c 2015-01-01 4 0.2000000
# 10: d 2015-01-01 5 0.2000000
# 11: d 2015-01-01 1 0.2000000
# 12: d 2015-01-01 2 0.2000000
# 13: e 2015-01-01 3 0.2000000
# 14: e 2015-01-01 4 0.2000000
# 15: e 2015-01-01 5 0.2000000
# 16: a 2014-01-01 5 0.1666667
# 17: a 2014-01-01 1 0.1666667
# 18: a 2014-01-01 2 0.1666667
# 19: b 2014-01-01 3 0.2500000
# 20: b 2014-01-01 4 0.2500000
# 21: b 2014-01-01 5 0.2500000
# 22: c 2014-01-01 1 0.1666667
# 23: c 2014-01-01 2 0.1666667
# 24: c 2014-01-01 3 0.1666667
# 25: d 2014-01-01 4 0.1666667
# 26: d 2014-01-01 5 0.1666667
# 27: d 2014-01-01 1 0.1666667
# 28: e 2014-01-01 2 0.2500000
# 29: e 2014-01-01 3 0.2500000
# 30: e 2014-01-01 4 0.2500000
# 31: a 2013-01-01 4 0.1111111
# 32: a 2013-01-01 5 0.1111111
# 33: a 2013-01-01 1 0.1111111
# 34: b 2013-01-01 2 0.3333333
# 35: b 2013-01-01 3 0.3333333
# 36: b 2013-01-01 4 0.3333333
# 37: c 2013-01-01 5 0.1111111
# 38: c 2013-01-01 1 0.1111111
# 39: c 2013-01-01 2 0.1111111
# 40: d 2013-01-01 3 0.2222222
# 41: d 2013-01-01 4 0.2222222
# 42: d 2013-01-01 5 0.2222222
# 43: e 2013-01-01 1 0.2222222
# 44: e 2013-01-01 2 0.2222222
# 45: e 2013-01-01 3 0.2222222
# 46: a 2012-01-01 3 0.1666667
# 47: a 2012-01-01 4 0.1666667
# 48: a 2012-01-01 5 0.1666667
# 49: b 2012-01-01 1 0.3333333
# 50: b 2012-01-01 2 0.3333333
# 51: b 2012-01-01 3 0.3333333
# 52: c 2012-01-01 4 0.0000000
# 53: c 2012-01-01 5 0.0000000
# 54: c 2012-01-01 1 0.0000000
# 55: d 2012-01-01 2 0.3333333
# 56: d 2012-01-01 3 0.3333333
# 57: d 2012-01-01 4 0.3333333
# 58: e 2012-01-01 5 0.1666667
# 59: e 2012-01-01 1 0.1666667
# 60: e 2012-01-01 2 0.1666667
# 61: a 2011-01-01 2 0.3333333
# 62: a 2011-01-01 3 0.3333333
# 63: a 2011-01-01 4 0.3333333
# 64: b 2011-01-01 5 0.3333333
# 65: b 2011-01-01 1 0.3333333
# 66: b 2011-01-01 2 0.3333333
# 67: c 2011-01-01 3 0.0000000
# 68: c 2011-01-01 4 0.0000000
# 69: c 2011-01-01 5 0.0000000
# 70: d 2011-01-01 1 0.3333333
# 71: d 2011-01-01 2 0.3333333
# 72: d 2011-01-01 3 0.3333333
# 73: e 2011-01-01 4 0.0000000
# 74: e 2011-01-01 5 0.0000000
# 75: e 2011-01-01 1 0.0000000
# 76: a 2010-01-01 1 NA
# 77: a 2010-01-01 2 NA
# 78: a 2010-01-01 3 NA
# 79: b 2010-01-01 4 NA
# 80: b 2010-01-01 5 NA
# 81: b 2010-01-01 1 NA
# 82: c 2010-01-01 2 NA
# 83: c 2010-01-01 3 NA
# 84: c 2010-01-01 4 NA
# 85: d 2010-01-01 5 NA
# 86: d 2010-01-01 1 NA
# 87: d 2010-01-01 2 NA
# 88: e 2010-01-01 3 NA
# 89: e 2010-01-01 4 NA
# 90: e 2010-01-01 5 NA
# trainer date finish rate
这一点是成功率取决于尝试次数中的获胜次数。为此,按
trainer
分组,收集尝试次数(seq_len(.N)
)和获胜次数(cumsum(finish == 1)
);- 按
date, trainer
分组,用最大获胜次数与最大尝试次数之比汇总每个组,确保我们有“最后一天的结束”;]]- 移动
date
,以便我们最终可以...- [
merge
(合并)回到原始数据中,使“最后一个已知日期”数据显示到今天,所以今天的比赛不影响今天的罢工率- 临时(-C0之前)可能很有见地,并且
显示
merge
(带日期的偏移)而不是替换它,如上所述。知道prevdate
是原始数据的prevdate
上连接的内容:date
dt %>%
.[ order(date), ] %>%
.[, c("rate", "n") := .(cumsum(finish == 1), seq_len(.N)), by = .(trainer) ] %>%
# .[, c("rate", "n") := .(cumsum(finish == 1), .I), by = .(trainer) ] %>%
.[, .(rate = max(rate) / max(n)), by = .(date, trainer) ] %>%
.[, prevdate := shift(date, type = "lead"), by = .(trainer) ]
# date trainer rate prevdate
# 1: 2010-01-01 a 0.3333333 2011-01-01
# 2: 2010-01-01 b 0.3333333 2011-01-01
# 3: 2010-01-01 c 0.0000000 2011-01-01
# 4: 2010-01-01 d 0.3333333 2011-01-01
# 5: 2010-01-01 e 0.0000000 2011-01-01
# 6: 2011-01-01 a 0.1666667 2012-01-01
# 7: 2011-01-01 b 0.3333333 2012-01-01
# 8: 2011-01-01 c 0.0000000 2012-01-01
# 9: 2011-01-01 d 0.3333333 2012-01-01
# 10: 2011-01-01 e 0.1666667 2012-01-01
# 11: 2012-01-01 a 0.1111111 2013-01-01
# 12: 2012-01-01 b 0.3333333 2013-01-01
# 13: 2012-01-01 c 0.1111111 2013-01-01
# 14: 2012-01-01 d 0.2222222 2013-01-01
# 15: 2012-01-01 e 0.2222222 2013-01-01
# 16: 2013-01-01 a 0.1666667 2014-01-01
# 17: 2013-01-01 b 0.2500000 2014-01-01
# 18: 2013-01-01 c 0.1666667 2014-01-01
# 19: 2013-01-01 d 0.1666667 2014-01-01
# 20: 2013-01-01 e 0.2500000 2014-01-01
# 21: 2014-01-01 a 0.2000000 2015-01-01
# 22: 2014-01-01 b 0.2000000 2015-01-01
# 23: 2014-01-01 c 0.2000000 2015-01-01
# 24: 2014-01-01 d 0.2000000 2015-01-01
# 25: 2014-01-01 e 0.2000000 2015-01-01
# 26: 2015-01-01 a 0.2222222 <NA> ### data this point and below are "lost"
# 27: 2015-01-01 b 0.2222222 <NA> ### when merged, because there are no
# 28: 2015-01-01 c 0.1666667 <NA> ### dates after it to join onto
# 29: 2015-01-01 d 0.2222222 <NA>
# 30: 2015-01-01 e 0.1666667 <NA>
# date trainer rate prevdate
(不要与split.data.table
混淆),以便在一个循环中处理日期/培训员子集: