我正在使用大约50,000行的数据框游戏来处理这个篮球游戏数据。我试图比较每个游戏中每个团队(A和B)的统计数据。
我有另一个名为teamStats的数据框,每个季节有大约3000行,每个团队都有团队。
到目前为止,我已经汇编了以下代码:
for (i in 1:nrow(games)) {
if (length(which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamA[i]))) == 1) {
selectTeamA <- teamStats[which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamA[i])),4:45]
} else {
selectTeamA <- as.numeric(rep(NA, ncol(differences)))
}
if (length(which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamB[i]))) == 1) {
selectTeamB <- teamStats[which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamB[i])),4:45]
} else {
selectTeamB <- as.numeric(rep(NA, ncol(differences)))
}
differences[i,] <- selectTeamA - selectTeamB
}
基本上,在对正确的季节进行子集化之后,此代码会为每个团队A和B搜索正确的teamID。由于每个赛季的每支球队都没有出现在队员队伍中,所以我现在已经填补了NA的缺失行。 “差异”数据框是一个空的数据框,它将填充我的团队A和B的统计数据与for循环的差异。
为了让您了解数据:
游戏 - 前6行
Season teamA teamB winner scoreA scoreB
108123 2010 1143 1293 A 75 70
108124 2010 1198 1314 B 72 88
108125 2010 1108 1326 B 60 100
108126 2010 1107 1393 B 43 75
108127 2010 1143 1178 A 95 61
teamStats - 前6行,前6行仅用于空间 - 在完整数据帧中具有不同统计数据的列数。代码找到teamID的正确行,然后减去stat列,例如G W L等
School Year teamID G W L
1 abilene christian 2018 1101 32 16 16
2 air force 2018 1102 31 12 19
3 akron 2018 1103 32 14 18
4 alabama a&m 2018 1105 31 3 28
5 alabama-birmingham 2018 1412 33 20 13
关闭这篇很长的帖子,我的问题。我的for循环代码工作并填充差异数据帧。问题是运行此代码需要20-30分钟。使用这么多数据我不是很有经验。有一种我不知道的技术吗?如何以更有效的方式重写此代码?
这是一种使用tidyverse
包的方法,我期望它应该比OP中的循环解决方案快得多。速度(我期望)来自更依赖于数据库连接操作(例如,基础merge
或dplyr的left_join
)来连接两个表。
library(tidyverse)
# First, use the first few columns from the `games` table, and convert to long format with
# a row for each team, and a label column `team_cat` telling us if it's a teamA or teamB.
stat_differences <- games %>%
select(row, Season, teamA, teamB) %>%
gather(team_cat, teamID, teamA:teamB) %>%
# Join to the teamStats table to bring in the team's total stats for that year
left_join(teamStats %>% select(-row), # We don't care about this "row"
by = c("teamID", "Season" = "Year")) %>%
# Now I want to reverse the stats' sign if it's a teamB. To make this simpler, I gather
# all the stats into long format so that we can do the reversal on all of them, and
# then spread back out.
gather(stat, value, G:L) %>%
mutate(value = if_else(team_cat == "teamB", value * -1, value * 1)) %>%
spread(stat, value) %>%
# Get the difference in stats for each row in the original games table.
group_by(row) %>%
summarise_at(vars(G:W), sum)
# Finally, add the output to the original table
output <- games %>%
left_join(stat_differences)
为了测试这一点,我更改了给定的样本数据,以便两个表相互关联:
games <- read.table(header = T, stringsAsFactors = F,
text = "row Season teamA teamB winner scoreA scoreB
108123 2010 1143 1293 A 75 70
108124 2010 1198 1314 B 72 88
108125 2010 1108 1326 B 60 100")
teamStats <- read.table(header = T, stringsAsFactors = F,
text = "row School Year teamID G W L
1 abilene_christian 2010 1143 32 16 16
2 air_force 2010 1293 31 12 19
3 akron 2010 1314 32 14 18
4 alabama_a&m 2010 1198 31 3 28
5 alabama-birmingham 2010 1108 33 20 13
6 made_up_team 2018 1326 160 150 10 # To confirm getting right season
7 made_up_team 2010 1326 60 50 10"
)
然后我得到以下输出,这似乎是有道理的。 (我刚才意识到我应用的收集/变异/传播改变了列的顺序;如果我有时间,我可能会尝试使用mutate_if来保存顺序。)
> output
row Season teamA teamB winner scoreA scoreB G L W
1 108123 2010 1143 1293 A 75 70 1 -3 4
2 108124 2010 1198 1314 B 72 88 -1 10 -11
3 108125 2010 1108 1326 B 60 100 -27 3 -30
一种方法是合并games
和teamStats
,作为跨行迭代的替代方法。
一些代码来复制您的设置,以创建一个最小的工作示例:
library(dplyr)
library(purrr)
set.seed(123)
n_games <- 50000
n_teams <- 400
n_years <- 10
games <- data.frame(Season = rep(2005:(2005 + n_years - 1),
each = n_games / n_years)) %>%
mutate(teamA = sample(1000:(1000 + n_teams - 1), n_games, r = TRUE),
teamB = map_int(teamA, ~sample(setdiff(1000:(1000 + n_teams - 1), .), 1)),
scoreA = as.integer(rnorm(n_games, 80, 20)),
scoreB = as.integer(rnorm(n_games, 80, 20)),
scoreB = ifelse(scoreA == scoreB, scoreA + sample(c(-1, 1), n_games, r = TRUE), scoreB),
winner = ifelse(scoreA > scoreB, "A", "B"))
gen_random_string <- function(...) {
paste(sample(c(letters, " "), rpois(1, 10), r = TRUE), collapse = "")
}
schools_ids <- data.frame(teamID = 1000:(1000 + n_teams - 1)) %>%
mutate(School = map_chr(teamID, gen_random_string))
teamStats <- data.frame(Year = rep(2005:(2005 + n_years - 1), each = 300)) %>%
mutate(teamID = as.vector(replicate(n_years, sample(schools_ids$teamID, 300))),
G = 32, W = rpois(length(teamID), 16), L = G - W) %>%
left_join(schools_ids)
我们有50k行的games
和3k行的teamStats。现在,我们通过teamStats
和Year
将teamID
摧毁成一个tibble:
teamStats <- teamStats %>%
group_by(Year, teamID) %>%
nest()
# # A tibble: 3,000 x 3
# Year teamID data
# <int> <int> <list>
# 1 2005 1321 <tibble [1 x 4]>
# 2 2005 1192 <tibble [1 x 4]>
# 3 2005 1074 <tibble [1 x 4]>
# <snip>
制作一个小的便利函数来计算差异:
calculate_diff <- function(x, y) {
if (is.null(x) | is.null(y)) {
data.frame(G = NA, W = NA, L = NA)
} else {
x[, 1:3] - y[, 1:3]
}
}
现在,我们(1)使用games
连接(或合并)teamStats
,(2)使用连接数据集计算差异,以及(3)unnest
(或解除崩溃)数据帧。
start <- Sys.time()
differences <- games %>%
left_join(teamStats, c("Season" = "Year", "teamA" = "teamID")) %>%
rename(teamA_stats = data) %>%
left_join(teamStats, c("Season" = "Year", "teamB" = "teamID")) %>%
rename(teamB_stats = data) %>%
mutate(diff = map2(teamA_stats, teamB_stats, calculate_diff)) %>%
select(Season, teamA, teamB, diff) %>%
unnest()
difftime(Sys.time(), start)
# Time difference of 11.27832 secs
结果
head(differences)
# Season teamA teamB G W L
# 1 2005 1115 1085 NA NA NA
# 2 2005 1315 1177 NA NA NA
# 3 2005 1163 1051 0 -9 9
# 4 2005 1353 1190 0 -4 4
# 5 2005 1376 1286 NA NA NA
# 6 2005 1018 1362 0 -1 1