我有一些GIS数据,其中包含始发地和目的地(OD),以及有关OD一天中的时间的信息。我打算对此进行映射,并根据每天的时间对OD进行着色。
一件事是,某些OD在白天和黑夜中都有可能以不同的顺序出现。我想以不同的方式标记这些标记,例如“日/夜”
是否有简单的方法可以做到这一点?我的MWE只是一个OD,但我需要在其他几个中进行识别。无论顺序如何,我都可以找到重复项,但是我不知道如何找出是否存在两个时间情况以及如何用“日/夜”替换它们。
library(data.table)
Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London")
Destination<-c("Paris", "London", "Berlin","Lisbon", "Lisbon", "Paris")
Time=factor(c("Day", "Night", "Day", "Day/Night","Day", "Day/Night"))
dt<-data.table(Origin=Origin, Destination=Destination, Time=Time)
#duplicates regardless of order
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt[duplicated(dat.sort) | duplicated(dat.sort, fromLast=TRUE),]
您可以使用dplyr软件包,如下所述:
根据您的需要随意更改条件。
library(data.table)
library(dplyr)
# Creating data
dt <-
data.table(
Origin = c("London", "Paris", "Italy", "Spain", "Portugal", "Poland"),
Destination = c("Paris", "London", "Norway", "Portugal", "Spain", "Spain"),
Time = c("Day", "Night", "Day", NA_character_, NA_character_, NA_character_)
)
dt
# Origin Destination Time
# London Paris Day
# Paris London Night
# Italy Norway Day
# Spain Portugal <NA>
# Portugal Spain <NA>
# Poland Spain <NA>
dt %>%
# pmin and pmax are used to sort the 2 columns
# in order to group by them regardless to their order
group_by(Origin2 = pmin(Origin, Destination),
Destination2 = pmax(Origin, Destination)) %>%
mutate(count = n(), # To check if Origin/destination are repeated or not
row = row_number(), # Place holder to know if it was first to repeat or second
# If not repeated then make Time = Day
# If repeated and first occurance then Time = Day
# If repeated and second occurance then Time = Night
Time = case_when(count == 1 ~ "Day",
count == 2 & row == 1 ~ "Day",
count == 2 & row == 2 ~ "Night")) %>%
ungroup() %>%
select(Origin, Destination, Time)
# Origin Destination Time
# <chr> <chr> <chr>
# 1 London Paris Day
# 2 Paris London Night
# 3 Italy Norway Day
# 4 Spain Portugal Day
# 5 Portugal Spain Night
# 6 Poland Spain Day
感谢@Nareman Darwisch的dplyr解决方案,该解决方案为我提供了data.table
解决方案的灵感
我正在为每个原始目的地创建一个新变量作为唯一ID
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt.temp<-data.table(dat.sort)
dt.temp[,unique.name:=paste(V1,V2)]
dt$unique.name<-factor(dt.temp$unique.name)
然后,我可以按组计算因子的唯一出现的长度,或者如果它们与3个级别中的任何一个匹配多个,则可以。基于此,每当长度> 1或其他条件为TRUE时,我都可以使用“日/夜”级别重新编码标签。
dt[,No.levels:=length(unique(c(Time))), by=unique.name] dt[,No.levels.logi:=sum(c(Time) %in% c(1:3))>1 , by=unique.name]
我想了解如何本着按组查看级别并将这些级别与所需案例进行比较的精神来使用逻辑条件。
dt[,No.levels.logi:=sum(levels(Time) %in% c("Day", "Night"))>1 , by=unique.name]
但是我想级别命令总是给我全部三个级别。
如果我理解正确,OP会希望