循环以替换匹配值

问题描述 投票:1回答:2

我正在寻找一种简单而优雅的方法来实现这一目标。 所以,如果我有数据集x和关系是A -> B -> Z -> YD -> H -> G,我想创建数据集y。不幸的是,它们不一定是有序的:

> x <- data.frame(
+     from = as.character(c("A", "E", "B", "D", "H", "Z")), 
+     to = as.character(c("B", "E", "Z", "H", "G", "Y")))
> 
> y <- data.frame(
+     from = as.character(c("A", "E", "B", "D", "H", "Z")), 
+     to = as.character(c("Y", "E", "Y", "G", "G", "Y")))
> 
> x
  from to
1    A  B
2    E  E
3    B  Z
4    D  H
5    H  G
6    Z  Y
> y
  from to
1    A  Y
2    E  E
3    B  Y
4    D  G
5    H  G
6    Z  Y

我有一个相当大的数据集(目前有500k行;将来会增长)并且实际上关心性能;我不确定是否有任何其他方法可以在没有for循环或甚至向量化/并行化过程的情况下执行此操作。 我正在考虑拆分和删除所有行from == to或创建指针以跳过某些行,这样循环不必每次都通过整个数据集。 我还想知道如果我创建一个循环,断点应该是什么;我不确定如何定义循环何时停止。 任何建议,将不胜感激。谢谢!

r loops for-loop
2个回答
1
投票

我们可以使用dplyr通过比较'to'和'from'的相邻元素来创建分组变量,并将'to'中的值更改为'to'的last元素

library(dplyr)
x %>% 
    group_by(grp = cumsum(lag(lead(from, default = last(from)) != 
      as.character(to), default = TRUE))) %>% 
    mutate(to = last(to)) %>%
    ungroup %>%
    select(-grp)
# A tibble: 4 x 2
#  from   to    
# <fctr> <fctr>
#1 A      D     
#2 B      D     
#3 C      D     
#4 E      E    

1
投票

使用来自lagdplyr和来自filltidyr可以实现另一种解决方案:

library(tidyverse)

x %>% arrange(from) %>%
  mutate(samegroup = ifelse(from == lag(to), 1, 0)) %>%
  mutate(group = ifelse(samegroup == 0 | is.na(samegroup), row_number(), NA)) %>%
  fill(group) %>%
  group_by(group) %>%
  mutate(to = last(to)) %>%
  ungroup() %>%
  select(-samegroup, - group)

# A tibble: 6 x 2
#  from  to   
#  <chr> <chr>
#1 A     D    
#2 B     D    
#3 C     D    
#4 E     E    
#5 F     H 
#6 G     H 

Data used

x <- data.frame(from = as.character(c("A", "B", "F", "C", "G", "E")), 
   to = as.character(c("B", "C", "G", "D", "H", "E")), 
   stringsAsFactors = FALSE)
© www.soinside.com 2019 - 2024. All rights reserved.