计算每个可能时间段的相对排名(内存效率)

问题描述 投票:0回答:0

我想知道特定足球运动员(例如球员“P1”)与其他所有球员(进球总和)相比表现如何。我有每个足球运动员每年目标的数据框。我的结果的时间框架不是静态的,但所有可能的时间框架都很有趣,可以了解球员在哪几年表现好或差。

10 年有 10*10/2 个可能的时间范围。下图显示,在 2003 年到 2008 年间,数据库中有 n=10 名球员,而我们正在查看的球员是最好的球员,总共进了 85 个球:



计算结果不是问题,但我想切换到月度数据而不是年度数据(10 年内的 120*120/2 个时间范围)。结果本身很小。但是对于 5000 名足球运动员甚至 20000 名球员,我获得结果的方法对 32 位环境使用了太多内存(问题是 dplyr inner_join):

错误:无法分配大小为 1.1 Gb 的向量

是否有更“聪明”的方法来计算每个时间段内玩家的相对排名? (记忆方面)

library(tidyverse)
library(plotly)
library(lubridate)

set.seed(1)

# config example data
analyse_for_player <- "P1"
number_of_players <- 1000 # 1k works, 10k would be a problem
number_of_years <- 120 # still named "years", will be months later

# generate example data
player <- rep(paste0("P", seq(1, number_of_players)), rep(c(number_of_years), number_of_players))
date <- rep(seq(as.Date("1900/12/31"), by = "years", length.out = number_of_years), number_of_players)
goals <- round(runif(n = number_of_players * number_of_years, min = 0, max = 20), 0)

df <- data.frame(player, date, goals)

df <- df %>%
  # self join to get every possible from-to-combination
  inner_join(df, "player", suffix = c(".from", ".to"), relationship = "many-to-many") %>%
  # only need triangle = 1 half of square
  filter(date.to >= date.from) %>%
  group_by(player, date.from) %>%
  arrange(date.to) %>%
  # important: if there was a gap (player didn't play each year), don't use it for calculations
  filter(year(date.to) - year(date.from) == row_number() - 1) %>%
  mutate(goalsum = cumsum(goals.to)) %>%
  ungroup() %>%
  group_by(date.from, date.to) %>%
  mutate(RANK_PCT = percent_rank(goalsum), n_players = n()) %>%
  ungroup() %>%
  filter(player == analyse_for_player) %>%
  select(player, date.from, date.to, goalsum, RANK_PCT, n_players)

plot_ly(
  x = df$date.from, y = df$date.to, z = df$RANK_PCT, text = paste0(
    "From: ",
    df$date.from, "\n", "To: ", df$date.to, "\n", "goals: ", df$goalsum,
    "\n", "n: ", df$n_players, "\n", "rank (pct): ", df$RANK_PCT * 100, "\n"
  ), type = "heatmap",
  hoverinfo = "text"
)
r dataframe dplyr memory-management
© www.soinside.com 2019 - 2024. All rights reserved.