首先,我是 R 编程新手,所以我的很多问题可能是对基础知识的误解。
我正在研究大学橄榄球,并试图按年自动计算每个会议内的标准差。
我当前的数据框使用这些变量进行格式化:
年份、会议、学院、胜利、损失和 w_pct(根据胜利和损失计算)。
这是一个示例:
我的问题主要是关于分组并计算每组(年份/会议)内 w_pct 的标准差。
我已经尝试过很多次 group_by ,但是当我向其中添加 stats::SD 函数时,它要么返回错误,要么计算整个数据的一个标准差,而不是按年份和会议计算。
有没有更好/更简单/更有效的方法来做到这一点?或者我真的需要为每年/会议创建单独的数据框吗?
非常感谢任何帮助!
谢谢!
克里斯
由于我们没有真实数据,我将弥补一些样本数据。这不会是“真实”的输赢记录,需要定义获胜和失败的球队等,但这个想法和代码应该像随机数据一样适用于正确的输赢记录。
首先,为四个会议创建五年的数据,每个会议有六个团队。因此,每年将有24个参赛作品。
set.seed(14L)
# Five years with 24 entries each
Year <- rep(2020:2024, each = 24L)
# Four conferences with six teams each for the five years
Conference <- rep(c("North", "South", "East", "West"), each = 6L, times = 5L)
# 24 teams
Teams <- LETTERS[1:24]
# Wins for each team in each year (120 in total)
Wins <- sample(0:8, 120, replace = TRUE)
Losses = 8 - Wins
WP <- Wins / 8
Hist <- data.frame(Year = Year,
Conference = Conference,
Teams = Teams,
Wins = Wins,
Losses = Losses,
WP = WP)
摘录:
> head(Hist)
Year Conference Teams Wins Losses WP
1 2020 North A 8 0 1.000
2 2020 North B 8 0 1.000
3 2020 North C 3 5 0.375
4 2020 North D 3 5 0.375
5 2020 North E 0 8 0.000
6 2020 North F 8 0 1.000
> tail(Hist)
Year Conference Teams Wins Losses WP
115 2024 West S 5 3 0.625
116 2024 West T 8 0 1.000
117 2024 West U 2 6 0.250
118 2024 West V 8 0 1.000
119 2024 West W 8 0 1.000
120 2024 West X 1 7 0.125
在我看来,最简单的方法是使用
data.table
。
> Hist[, .(mean = mean(WP), sd = sd(WP)), keyby = Year]
Year mean sd
1: 2020 0.5104167 0.3356464
2: 2021 0.5052083 0.3387942
3: 2022 0.5416667 0.2823299
4: 2023 0.5416667 0.3142163
5: 2024 0.5104167 0.3553104
>
> Hist[, .(mean = mean(WP), sd = sd(WP)), keyby = Conference]
Conference mean sd
1: East 0.4666667 0.3043034
2: North 0.5958333 0.3500051
3: South 0.4833333 0.3021713
4: West 0.5416667 0.3255190
> Hist[, .(mean = mean(WP), sd = sd(WP)), keyby = c("Year", "Conference")]
Year Conference mean sd
1: 2020 East 0.6250000 0.2622022
2: 2020 North 0.6250000 0.4330127
3: 2020 South 0.2916667 0.2922613
4: 2020 West 0.5000000 0.2958040
5: 2021 East 0.4791667 0.2671220
6: 2021 North 0.4791667 0.4063301
7: 2021 South 0.6458333 0.3825626
8: 2021 West 0.4166667 0.3322900
9: 2022 East 0.5625000 0.2931510
10: 2022 North 0.4375000 0.2709935
11: 2022 South 0.4583333 0.2188988
12: 2022 West 0.7083333 0.3227486
13: 2023 East 0.4583333 0.3322900
14: 2023 North 0.8541667 0.2002602
15: 2023 South 0.4375000 0.2931510
16: 2023 West 0.4166667 0.2457980
17: 2024 East 0.2083333 0.2813657
18: 2024 North 0.5833333 0.3415650
19: 2024 South 0.5833333 0.2700309
20: 2024 West 0.6666667 0.4005205
在基础 R 中,可以使用
aggregate
。由于需要将变量分组为列表,语法有点麻烦。另外,我相信一个人一次只能传递一个函数。下面将使用sd
。
aggregate(Hist$WP, by = list(Hist$Year), sd)
> aggregate(Hist$WP, by = list(Hist$Year), sd)
Group.1 x
1 2020 0.3356464
2 2021 0.3387942
3 2022 0.2823299
4 2023 0.3142163
5 2024 0.3553104
> aggregate(Hist$WP, by = list(Hist$Conference), sd)
Group.1 x
1 East 0.3043034
2 North 0.3500051
3 South 0.3021713
4 West 0.3255190
请注意,这里最快的变化的变量是第一个,在data.table中最慢的变量是第一个。
> aggregate(Hist$WP, by = list(Hist$Conference, Hist$Year), sd)
Group.1 Group.2 x
1 East 2020 0.2622022
2 North 2020 0.4330127
3 South 2020 0.2922613
4 West 2020 0.2958040
5 East 2021 0.2671220
6 North 2021 0.4063301
7 South 2021 0.3825626
8 West 2021 0.3322900
9 East 2022 0.2931510
10 North 2022 0.2709935
11 South 2022 0.2188988
12 West 2022 0.3227486
13 East 2023 0.3322900
14 North 2023 0.2002602
15 South 2023 0.2931510
16 West 2023 0.2457980
17 East 2024 0.2813657
18 North 2024 0.3415650
19 South 2024 0.2700309
20 West 2024 0.4005205
tapply
方法如果结构不太重要,也可以使用
tapply
。
> tapply(Hist$WP, Hist$Year, sd)
2020 2021 2022 2023 2024
0.3356464 0.3387942 0.2823299 0.3142163 0.3553104
> tapply(Hist$WP, Hist$Conference, sd)
East North South West
0.3043034 0.3500051 0.3021713 0.3255190
> tapply(Hist$WP, list(Hist$Year, Hist$Conference), sd)
East North South West
2020 0.2622022 0.4330127 0.2922613 0.2958040
2021 0.2671220 0.4063301 0.3825626 0.3322900
2022 0.2931510 0.2709935 0.2188988 0.3227486
2023 0.3322900 0.2002602 0.2931510 0.2457980
2024 0.2813657 0.3415650 0.2700309 0.4005205