df <- structure(list(stop1 = c("New York", "Milwaukee", "New York",
"Los Angeles", NA, "Milwaukee"),
stop2 = c(NA, "New York", "Los Angeles",
"New York", NA, "New York"),
stop1_apple = c("apple", "apple", NA, "apple", NA, "apple"),
stop1_pear = c("pear", "pear", "pear", NA, NA, "pear"),
stop2_apple = c(NA, "apple", "apple", NA, NA, "apple"),
stop2_pear = c(NA, "pear", "pear", "pear", NA, NA)),
class = "data.frame", row.names = c(NA, -6L))
df
stop1 stop2 stop1_apple stop1_pear stop2_apple stop2_pear
1 New York <NA> apple pear <NA> <NA>
2 Milwaukee New York apple pear apple pear
3 New York Los Angeles <NA> pear apple pear
4 Los Angeles New York apple <NA> <NA> pear
5 <NA> <NA> <NA> <NA> <NA> <NA>
6 Milwaukee New York apple pear apple <NA>
每一排都是一个旅行者。例如,第三排是一位在纽约和洛杉矶停留的旅行者。在纽约的第一站,她吃了一个梨。在洛杉矶的第二站,她吃了一个苹果和一个梨。
首先,我想计算每个城市过境的总人数。例如,这里共有 5 人在纽约转机(2 人在 stop1,3 人在 stop2)。其次,我想计算每个城市吃掉的苹果和梨的数量。例如,在纽约,吃了3个苹果和4个梨。
df
Location NStops NApples NPears
Los Angeles 2 2 1
Milwaukee 2 2 2
New York 5 3 4
library(tidyverse)
df %>% pivot_longer(c(stop1, stop2)) %>%
rename(Location = value, Stop = name) %>%
add_count(Location, name = "NStops") %>%
pivot_longer(c(stop1_apple, stop1_pear, stop2_apple, stop2_pear)) %>%
group_by(Location, value, NStops) %>%
summarise(NFruits = n()) %>%
pivot_wider(names_from = value, values_from = NFruits) %>%
rename(NApples = apple, NPears = pear) %>%
select(-`NA`) %>%
filter(!is.na(Location))
Output:
Location NStops NApples NPears
Los Angeles 2 2 3
Milwaukee 2 4 3
New York 5 7 7
每个城市的停靠次数是正确的,但是吃到的水果数量不符合预期,见上面的期望输出。另请注意,实际上,我有超过 2 个站点、超过 2 个水果和数十个城市。最后,如果可能的话,我更喜欢
tidyverse
解决方案。
library(dplyr)
library(tidyr)
df <- structure(list(stop1 = c("New York", "Milwaukee", "New York",
"Los Angeles", NA, "Milwaukee"),
stop2 = c(NA, "New York", "Los Angeles",
"New York", NA, "New York"),
stop1_apple = c("apple", "apple", NA, "apple", NA, "apple"),
stop1_pear = c("pear", "pear", "pear", NA, NA, "pear"),
stop2_apple = c(NA, "apple", "apple", NA, NA, "apple"),
stop2_pear = c(NA, "pear", "pear", "pear", NA, NA)),
class = "data.frame", row.names = c(NA, -6L))
df %>%
rename(
stop1_stop = stop1,
stop2_stop = stop2
) %>%
pivot_longer(
everything(),
names_pattern = "stop\\d_(.*)",
names_to = ".value"
) %>%
summarize(
n_stop = n(),
n_apple = sum(apple == "apple", na.rm = TRUE),
n_pear = sum(pear == "pear", na.rm = TRUE),
.by = stop
) %>%
filter(!is.na(stop))
#> # A tibble: 3 × 4
#> stop n_stop n_apple n_pear
#> <chr> <int> <int> <int>
#> 1 New York 5 3 4
#> 2 Milwaukee 2 2 2
#> 3 Los Angeles 2 2 1
创建于 2024-10-09,使用 reprex v2.1.1