我有一个像这样的数据框:
# A tibble: 4 x 5
category month comment score email
<chr> <chr> <chr> <dbl> <chr>
1 neutro 2020-01 "" 8 xxx
2 promotor 2020-04 "ok" 9 xxx
3 promotor 2020-04 "very cool" 9 xxx
4 promotor 2020-05 "i really liked it" 9 xxx
不幸的是,有一项调查,但有错误(客户可以回答多次!)。
所以现在我试图只保留每个组中的最后一个答案。
当我使用
dplyr::distinct()
时,他保留第一次出现的情况:
df %>%
distinct(category, month, score, email, .keep_all = T)
# A tibble: 3 x 5
category month comment score email
<chr> <chr> <chr> <dbl> <chr>
1 neutro 2020-01 "" 8 xxx
2 promotor 2020-04 "ok" 9 xxx
3 promotor 2020-05 "i really liked it" 9 xxx
但是我想保留最后一个,所以这是我想要的结果:
# A tibble: 4 x 5
category month comment score email
<chr> <chr> <chr> <dbl> <chr>
1 neutro 2020-01 "" 8 xxx
2 promotor 2020-04 "very cool" 9 xxx
3 promotor 2020-05 "i really liked it" 9 xxx
观察:正如我在标题中引用的那样,我无法排列分组的列。
你可以
group_by
吗?
library(dplyr)
df %>%
group_by(category, month, score, email) %>% # Also group_by(across(-comment)) would work with the example
slice_tail() %>%
ungroup()
输出:
# A tibble: 3 x 5
category month comment score email
<fct> <fct> <fct> <int> <fct>
1 neutro 2020-01 "" 8 xxx
2 promotor 2020-04 "very cool" 9 xxx
3 promotor 2020-05 "i really liked it" 9 xxx
另一个选项,使用
collapse
package 会明显更快:
library(collapse)
df %>%
fgroup_by(category, month, score, email) %>%
flast(., na.rm = FALSE) %>%
fungroup()
# category month score email comment
# 1 neutro 2020-01 8 xxx
# 2 promotor 2020-04 9 xxx very cool
# 3 promotor 2020-05 9 xxx i really liked it