我有这些数据,其中每一行都是一年,其中包含最佳图片,最佳男主角和最佳女演员演讲详情的不同栏目。
我需要更改数据集,以便每年有3行使用新的列类型,标识行对应于语音“类型”(请参阅下面的输出)。还需要将thanksM和thanksW加在一起
## wcnt: number of words in the Best Picture acceptance speech
## year: movie release year (broadcast occurs in year+1)
## budget: total unadjusted budget in US dollars
## inflate: Inflation rate with respect to Fall 2018
## thanksP: number of "thanks" in the Best Picture speech
## man: number of words in the Best Leading Actor speech
## woman: number of words in the Best Leading Actress speech
## thanksM: number of "thanks" in the Best Leading Actor speech
## thanksW: number of "thanks" in the Best Leadeing Actress speech
oscars<- read.table(header=T, sep=",", text="
wcnt, year, budget, inflate, thanksP, man, woman, thanksM, thanksW, time
212, 1942 , 1344000 , 16.06, 3, 101, 452 , 1 , 2 , 108
119, 1946 ,2100000 , 13.85, 1, 56, 218 , 2 , 1 , 101
176, 1947 ,2000000 ,11.73, 5, 96, 220 , 1 , 1 , 172
50, 1949 , 0 ,10.51, 4, 29 , 31 , 3 , 1 , 118
34, 1950 , 1400000, 10.73, 4 , 208 , 46 , 3 , 1 ,110
31, 1951 , 2723903, 9.93, 3 , 73 , 43 , 1 , 1 ,138
156, 1952 , 4000000, 9.51, 3 , 159 , 100 , 0 , 4 , 113
97, 1953 , 1650000, 9.48, 3 , 4, 33 , 2 , 1 , 93
46, 1954 , 910000, 9.37, 1 , 64, 33 , 1 , 2 , 118
70, 1955 , 343000, 9.44, 1 , 61, 71 , 4 , 1 , 108
35, 1956 , 6000000, 9.41, 2 , 22 , 132 , 1 , 3 , 90
91, 1957 , 3000000, 9.14, 1 , 79, 41 , 2 , 3 , 188
20, 1958 , 3319355, 8.82, 1 , 36 , 39 , 2 , 4 , 161
81, 1959 ,15900000, 8.69, 1 , 131, 78 , 3 , 4 , 115
70, 1960 , 3000000 , 8.61, 1 , 76 , 30 , 3 , 2 , 125
125, 1961 , 6000000, 8.46, 2 , 104, 71 , 1 , 0 , 130
90, 1962 ,15000000 , 8.40, 2 , 74 , 28 , 5 , 1 , 150
64, 1963 , 1000000, 8.29, 1 , 52 , 55 , 1 , 3 , 128
159, 1964 ,17000000, 8.16, 6 , 81 , 97 , 2 , 6 , 170
69, 1965 , 8200000, 8.08 , 4 , 46 , 24 , 4 , 2 , 174
4, 1966 , 2000000, 7.93 , 1 , 62 , 36 , 1 , 2 , 151
99, 1967 , 2000000 , 7.66 , 3 , 120 , 44 , 11 , 2 , 110
62, 1968 ,10000000 , 7.39 , 2 , 44 , 50 , 2 , 1 , 153
37, 1969 , 3600000 , 7.08 , 3 ,127 , 74 , 3 , 2 , 145
51, 1970 ,12000000, 6.67 ,5 , 44 , 41 , 0 , 2 , 172
66, 1971 , 1800000, 6.34 , 2 , 143 , 41 , 5 , 4 , 104
217, 1972 , 6000000, 6.13 , 2 , 141 , 58 , 1 , 4 , 158
127, 1973 , 5500000 , 5.92 , 4 , 240 , 119 , 3 , 5 , 203
73, 1974 ,13000000 , 5.41 , 7 , 59 , 57 , 3 , 4 , 200
236, 1975 , 4400000 , 4.84 , 3 , 106 , 131 , 3 , 3 , 192
125, 1976 , 960000 , 4.53 , 5 , 193 , 82 , 7 , 4 , 218
216, 1977 , 4000000 , 4.31 , 3 , 77 , 60 , 1 , 3 , 210
68, 1978 ,15000000 , 4.03 , 5 , 317 , 367 , 8 , 11 , 215
208, 1979 , 8000000 , 3.69 , 1 , 362 , 287 , 4 , 3 ,192
162, 1980 , 6000000 , 3.24 , 5 , 240 , 137 , 3 , 2 , 193
188, 1981 , 5500000, 2.90 , 4 , 590 , 0 , 6 , 0 , 204
427, 1982 ,22000000, 2.67 , 1 , 123 , 231 , 1 , 6 , 195
192, 1983 , 8000000, 2.58 , 2 ,265 , 359 , 3 , 3 , 222
248, 1984 ,18000000 , 2.47 , 4, 127 , 144 , 1 , 2 , 190
48, 1985 ,31000000 , 2.39 , 3 , 55 , 119 , 2 , 5 , 182
279, 1986 , 6000000 , 2.30 , 5 , 97 , 104 , 1 , 5 , 199
118, 1987 ,23000000 , 2.27 , 4 , 316 , 184 , 8 , 5 , 213
207, 1988 ,25000000 , 2.18 , 5 , 326 , 140 , 11 , 3 , 199
213, 1989 , 7500000 , 2.08 , 9 , 111 , 100 , 1 , 2 , 217
258, 1990 ,22000000 , 1.98 , 3 , 126 , 189 , 8 , 9 , 215
236, 1991 ,19000000 , 1.87 , 7 ,159 , 278 , 3 , 9 , 213
123, 1992 ,14400000 , 1.83 , 5, 472 , 185 , 11 , 3 , 210
282, 1993 ,22000000 , 1.77, 8 , 414 , 264 , 0 , 5 , 198
423, 1994 ,55000000 , 1.72, 9 , 228 , 201 , 3 , 3 , 215
145, 1995 ,72000000 , 1.68, 9 , 184 , 317 , 4 , 12 , 218
243, 1996 ,27000000 , 1.63, 6 , 226 , 200 , 5 , 1 , 214
594, 1997 ,200000000 , 1.58, 5 , 193 , 271 , 3 , 6 , 227
386, 1998 ,25000000 , 1.56, 8 , 198 , 363 , 7 , 11 , 242
321, 1999 ,15000000 , 1.53, 9 ,260 , 385 , 7 , 9 , 249
314, 2000 ,103000000 , 1.49, 10, 253 , 396 , 4 , 5 , 203
378 ,2001, 58000000 , 1.44 , 11 , 302 , 528 , 4 , 32 , 263
232, 2002, 45000000 , 1.42, 2 , 462 , 234 , 10 , 2 , 210
436, 2003, 94000000 , 1.39 , 4 , 139 , 287 , 3 , 15 , 224
265, 2004, 30000000 , 1.36 , 6 , 490 , 354 , 15 , 11 , 194
193, 2005, 6500000 , 1.32 , 12 , 208 , 436 , 8 , 11 , 213
257, 2006, 90000000 , 1.27 , 8 ,297 , 192 , 8 , 6 , 231
181, 2007, 25000000 , 1.25 , 6 , 199 , 72 , 6 , 6 , 201
241, 2008, 15000000 , 1.19 , 5 , 300 , 328 , 4 , 4 , 210
271, 2009, 15000000 , 1.19 , 8 , 302 , 468 , 12 , 11 , 217
273, 2010, 15000000 , 1.16 , 9 , 319 , 361 , 2 , 6 , 195
263, 2011, 15000000 , 1.14 , 8 , 122 , 270 , 7 , 11 , 194
634, 2012, 44500000 , 1.11 , 22 , 254 , 118 , 2 , 7 , 215
380, 2013, 20000000 , 1.09 , 14 ,549 , 513 , 12 , 11 , 214
431, 2014, 18000000 , 1.08 , 10 ,195 , 324 , 5 , 8 , 223
148, 2015, 20000000 , 1.08 ,4, 402 , 178 , 10 , 10 , 217
283, 2016, 1500000 , 1.06 , 9 , 218 , 294 , 4 , 9 , 229
213, 2017, 19400000 , 1.04 , 4 , 293 , 264 , 8 , 3 , 233")
year words thanks type
1942 212 3 BestPicture
1942 101 1 Actor
1942 452 2 Actress
1946 119 1 BestPicture
1946 56 2 Actor
1946 218 1 Actress
1947 176 5 BestPicture
1947 96 1 Actor
1947 220 1 Actress
不同的tidyverse
可能是:
bind_cols(oscars %>%
select(-budget, -inflate, -time, -contains("thanks")) %>%
gather(type, words, -c(year)) %>%
mutate(type = ifelse(type == "wcnt", "BestPicture",
ifelse(type == "man", "Actor", "Actress"))) %>%
arrange(year, type), oscars %>%
select(-budget, -inflate, -time, -wcnt, -man, -woman) %>%
gather(temp, thanks, -c(year)) %>%
mutate(temp = ifelse(temp == "thanksP", "BestPicture",
ifelse(temp == "thanksM", "Actor", "Actress"))) %>%
arrange(year, temp) %>%
select(-year, -temp))
year type words thanks
1 1942 Actor 101 1
2 1942 Actress 452 2
3 1942 BestPicture 212 3
4 1946 Actor 56 2
5 1946 Actress 218 1
6 1946 BestPicture 119 1
7 1947 Actor 96 1
8 1947 Actress 220 1
9 1947 BestPicture 176 5
10 1949 Actor 29 3
11 1949 Actress 31 1
12 1949 BestPicture 50 4
我们可以使用melt
的data.table
:
library(data.table)
DT <- setDT(oscars)
setnames(DT, c("wcnt", "man", "woman"), c("wcntP", "wcntM", "wcntW"))
output <- melt(DT[, .SD, .SDcols = names(DT) %like% "year|^thanks|^wcnt"],
id.vars = "year", measure.vars = patterns("^thanks", "^wcnt"),
variable.name = "type", value.name = c("thanks", "words"))[order(year)]
levels(output$type) = c("BestPicture", "Actor", "Actress")
输出:
year type thanks words
1: 1942 BestPicture 3 212
2: 1942 Actor 1 101
3: 1942 Actress 2 452
4: 1946 BestPicture 1 119
5: 1946 Actor 2 56
---
212: 2016 Actor 4 218
213: 2016 Actress 9 294
214: 2017 BestPicture 4 213
215: 2017 Actor 8 293
216: 2017 Actress 3 264
我们也可以使用来自gather
和dplyr
的tidyr
,但它看起来效率低于data.table::melt
:
library(dplyr)
library(tidyr)
oscars %>%
select(year, starts_with("thanks"), wcnt, man, woman) %>%
gather(type, thanks, starts_with("thanks")) %>%
gather(type2, words, wcnt, man, woman) %>%
arrange(year) %>%
filter((type == "thanksP" & type2 == "wcnt") |
(type == "thanksM" & type2 == "man") |
(type == "thanksW" & type2 == "woman")) %>%
mutate(type = case_when(type == "thanksP" ~ "BestPicture",
type == "thanksM" ~ "Actor",
TRUE ~ "Actress")) %>%
select(year, words, thanks, type)
输出:
year words thanks type
1 1942 212 3 BestPicture
2 1942 101 1 Actor
3 1942 452 2 Actress
4 1946 119 1 BestPicture
5 1946 56 2 Actor
6 1946 218 1 Actress
7 1947 176 5 BestPicture
8 1947 96 1 Actor
9 1947 220 1 Actress
10 1949 50 4 BestPicture
11 1949 29 3 Actor
12 1949 31 1 Actress
13 1950 34 4 BestPicture
14 1950 208 3 Actor
15 1950 46 1 Actress
16 1951 31 3 BestPicture
17 1951 73 1 Actor
18 1951 43 1 Actress
19 1952 156 3 BestPicture
20 1952 159 0 Actor
...