我想知道特定足球运动员(例如球员“P1”)与其他所有球员(进球总和)相比表现如何。我有每个足球运动员每年目标的数据框。我的结果的时间框架不是静态的,但所有可能的时间框架都很有趣,可以了解球员在哪几年表现好或差。
10 年有 10*10/2 个可能的时间范围。下图显示,在 2003 年到 2008 年间,数据库中有 n=10 名球员,而我们正在查看的球员是最好的球员,总共进了 85 个球:
计算结果不是问题,但我想切换到月度数据而不是年度数据(10 年内的 120*120/2 个时间范围)。结果本身很小。但是对于 5000 名足球运动员甚至 20000 名球员,我获得结果的方法对 32 位环境使用了太多内存(问题是 dplyr inner_join):
错误:无法分配大小为 1.1 Gb 的向量
是否有更“聪明”的方法来计算每个时间段内玩家的相对排名? (记忆方面)
library(tidyverse)
library(plotly)
library(lubridate)
set.seed(1)
# config example data
analyse_for_player <- "P1"
number_of_players <- 1000 # 1k works, 10k would be a problem
number_of_years <- 120 # still named "years", will be months later
# generate example data
player <- rep(paste0("P", seq(1, number_of_players)), rep(c(number_of_years), number_of_players))
date <- rep(seq(as.Date("1900/12/31"), by = "years", length.out = number_of_years), number_of_players)
goals <- round(runif(n = number_of_players * number_of_years, min = 0, max = 20), 0)
df <- data.frame(player, date, goals)
df <- df %>%
# self join to get every possible from-to-combination
inner_join(df, "player", suffix = c(".from", ".to"), relationship = "many-to-many") %>%
# only need triangle = 1 half of square
filter(date.to >= date.from) %>%
group_by(player, date.from) %>%
arrange(date.to) %>%
# important: if there was a gap (player didn't play each year), don't use it for calculations
filter(year(date.to) - year(date.from) == row_number() - 1) %>%
mutate(goalsum = cumsum(goals.to)) %>%
ungroup() %>%
group_by(date.from, date.to) %>%
mutate(RANK_PCT = percent_rank(goalsum), n_players = n()) %>%
ungroup() %>%
filter(player == analyse_for_player) %>%
select(player, date.from, date.to, goalsum, RANK_PCT, n_players)
plot_ly(
x = df$date.from, y = df$date.to, z = df$RANK_PCT, text = paste0(
"From: ",
df$date.from, "\n", "To: ", df$date.to, "\n", "goals: ", df$goalsum,
"\n", "n: ", df$n_players, "\n", "rank (pct): ", df$RANK_PCT * 100, "\n"
), type = "heatmap",
hoverinfo = "text"
)