我需要删除重复的行。行和条件都有。请在下面找到示例数据框。
Sr. Invoice Status Name
1 XXX Booked ABC
2 YYY Booked DEF
3 YYY Cancelled DEF
4 ZZZ Booked GHI
5 ZZZ Changed GHI
我需要删除重复的发票和状态被取消的两个实例。
这是数据框应该是什么样子:
Sr. Invoice Status Name
1 XXX Booked ABC
2 ZZZ Booked GHI
3 ZZZ Changed GHI
我需要为已删除的集合分隔数据框。那是:
Sr. Invoice Status Name
1 YYY Booked DEF
2 YYY Cancelled DEF
给定这样的数据集:
> d
Sr Invoice Status Name
1 1 XXX Book AB
2 2 YYY Book DE
3 3 YYY Cancelled DE
4 4 ZZZ Book GH
5 5 ZZZ Changed GH
此表达式查找具有“已取消”状态的所有发票代码:
> d$Invoice[d$Status=="Cancelled"]
[1] "YYY"
然后,对于包含或不包含这些代码的所有行,此表达式返回true或false:
> d$Invoice %in% d$Invoice[d$Status=="Cancelled"]
[1] FALSE TRUE TRUE FALSE FALSE
然后,您可以使用该逻辑向量来拆分数据框。例如,使用split
获取两个元素的列表:
> split(d, d$Invoice %in% d$Invoice[d$Status=="Cancelled"])
$`FALSE`
Sr Invoice Status Name
1 1 XXX Book AB
4 4 ZZZ Book GH
5 5 ZZZ Changed GH
$`TRUE`
Sr Invoice Status Name
2 2 YYY Book DE
3 3 YYY Cancelled DE
使用ave
我们构建一个布尔向量,使用split
我们用它来创建2个独立的数据帧:
split(df1,ave(df1$Status, df1$Invoice, FUN = function(x) tail(x,1) != "Cancelled"))
# $`FALSE`
# Sr. Invoice Status Name
# 2 2 YYY Booked DEF
# 3 3 YYY Cancelled DEF
#
# $`TRUE`
# Sr. Invoice Status Name
# 1 1 XXX Booked ABC
# 4 4 ZZZ Booked GHI
# 5 5 ZZZ Changed GHI
考虑到对这个问题的评论,这个问题得到了解答,这让我认为只有Cancelled
作为最后一个元素的出现是相关的。
数据
df1 <- read.table(header=TRUE,stringsAsFactors=FALSE,text="Sr. Invoice Status Name
1 XXX Booked ABC
2 YYY Booked DEF
3 YYY Cancelled DEF
4 ZZZ Booked GHI
5 ZZZ Changed GHI")
你也可以这样做:
library(dplyr)
df %>%
group_by(Invoice) %>%
mutate(Cancellation = +(any(Status == 'Cancelled'))) %>%
split(., .$Cancellation) %>%
setNames(., c("NoCancellations", "Cancellations")) %>%
list2env(., .GlobalEnv)
这会在您的环境中抛出2个新的数据框,名为NoCancellations
和Cancellations
(您可以根据需要重命名)。
NoCancellations
# A tibble: 3 x 5
# Groups: Invoice [2]
Sr Invoice Status Name Cancellation
<int> <chr> <chr> <chr> <int>
1 1 XXX Book AB 0
2 4 ZZZ Book GH 0
3 5 ZZZ Changed GH 0
Cancellations
# A tibble: 2 x 5
# Groups: Invoice [1]
Sr Invoice Status Name Cancellation
<int> <chr> <chr> <chr> <int>
1 2 YYY Book DE 1
2 3 YYY Cancelled DE 1
新数据框还包含一个名为Cancellation
的列,用于拆分;如果需要,你可以删除它,例如:
df %>%
group_by(Invoice) %>%
mutate(Cancellation = +(any(Status == 'Cancelled'))) %>%
split(., .$Cancellation) %>%
lapply(., function(x) { x["Cancellation"] <- NULL; x }) %>%
setNames(., c("NoCancellations", "Cancellations")) %>%
list2env(., .GlobalEnv)
而不是lapply
,你也可以在那一行使用purrr::map(., ~ (.x %>% select(-Cancellation)))
。
我认为使用tidyverse
有一种更简单的方法。使用基础filter
和all
在组级别创建组,然后创建any
。
library(tidyverse) # Load library
要删除已取消状态的组:
df %>%
group_by(Invoice) %>%
filter(all(Status != "Cancelled"))
# A tibble: 3 x 4
# Groups: Invoice [2]
Sr. Invoice Status Name
<dbl> <chr> <chr> <chr>
1 1 XXX Booked ABC
2 4 ZZZ Booked GHI
3 5 ZZZ Changed GHI
要分离已取消状态的组:
df %>%
group_by(Invoice) %>%
filter(any(Status == "Cancelled"))
# A tibble: 2 x 4
# Groups: Invoice [1]
Sr. Invoice Status Name
<dbl> <chr> <chr> <chr>
1 2 YYY Booked DEF
2 3 YYY Cancelled DEF