我正在寻找一种简单而优雅的方法来实现这一目标。
所以,如果我有数据集x
和关系是A -> B -> Z -> Y
和D -> H -> G
,我想创建数据集y
。不幸的是,它们不一定是有序的:
> x <- data.frame(
+ from = as.character(c("A", "E", "B", "D", "H", "Z")),
+ to = as.character(c("B", "E", "Z", "H", "G", "Y")))
>
> y <- data.frame(
+ from = as.character(c("A", "E", "B", "D", "H", "Z")),
+ to = as.character(c("Y", "E", "Y", "G", "G", "Y")))
>
> x
from to
1 A B
2 E E
3 B Z
4 D H
5 H G
6 Z Y
> y
from to
1 A Y
2 E E
3 B Y
4 D G
5 H G
6 Z Y
我有一个相当大的数据集(目前有500k行;将来会增长)并且实际上关心性能;我不确定是否有任何其他方法可以在没有for循环或甚至向量化/并行化过程的情况下执行此操作。
我正在考虑拆分和删除所有行from == to
或创建指针以跳过某些行,这样循环不必每次都通过整个数据集。
我还想知道如果我创建一个循环,断点应该是什么;我不确定如何定义循环何时停止。
任何建议,将不胜感激。谢谢!
我们可以使用dplyr
通过比较'to'和'from'的相邻元素来创建分组变量,并将'to'中的值更改为'to'的last
元素
library(dplyr)
x %>%
group_by(grp = cumsum(lag(lead(from, default = last(from)) !=
as.character(to), default = TRUE))) %>%
mutate(to = last(to)) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 2
# from to
# <fctr> <fctr>
#1 A D
#2 B D
#3 C D
#4 E E
使用来自lag
的dplyr
和来自fill
的tidyr
可以实现另一种解决方案:
library(tidyverse)
x %>% arrange(from) %>%
mutate(samegroup = ifelse(from == lag(to), 1, 0)) %>%
mutate(group = ifelse(samegroup == 0 | is.na(samegroup), row_number(), NA)) %>%
fill(group) %>%
group_by(group) %>%
mutate(to = last(to)) %>%
ungroup() %>%
select(-samegroup, - group)
# A tibble: 6 x 2
# from to
# <chr> <chr>
#1 A D
#2 B D
#3 C D
#4 E E
#5 F H
#6 G H
x <- data.frame(from = as.character(c("A", "B", "F", "C", "G", "E")),
to = as.character(c("B", "C", "G", "D", "H", "E")),
stringsAsFactors = FALSE)