目前,我有两个数据帧。第一个df1包含两列代表网络联系。另一个df2包含一个列,其中列出了我拥有属性数据的所有案例。
我想从df1中的df2中搜索这些情况,如果它们不在df1的一列或两列中,我想删除该行。因此,我最终会得到一个df1,它包含我拥有属性数据的案例之间的网络联系。
df1有大约240万个关系(边缘列表),df2有34k个案例。
这是我在谷歌搜索一段时间后尝试的:
首先,我复制df1上的两列以进行备份。
library*dlypr)
df3<- df1%>%
mutate(friendid = friendid %in% df2$V1,
friendid = friendid*1.0) #converts boolean to numeric
df3<- df3%>%
mutate(tieid= tieid%in% df2$V1,
tieid= tieid*1.0)
#So what I think is going on here is that if the number is not found it turn the value into 0 and 1 if present. I do this for the two original columns from df1.
#Then I attempt to delete the rows by searching for 0 values on each column (2 and 3, which contain the edgelist).
df3<-df3[apply(df3[2],1,function(z) !any(z==0)),]
df3<-df3[apply(df3[3],1,function(z) !any(z==0)),]
这个过程会删除一堆行,但我最终会遇到大约2k个案例。这似乎不对。我尝试了一种类似于Excel的方法,但它对一次可以加载的行数有限制。在将数据集分成三个不同的文件并使用Kutools之后,我最终得到了大约74k个案例。但由于我做了大量的手工工作,我很确定excel工作中存在错误。 R允许我一次加载所有数据,这将有助于获得更具体的结果。
任何帮助将不胜感激。谢谢
已编辑以提供更多信息:
> head(df1)
ID steamid friendid daysknown years el1 el2
1 NA 7.65612e+16 7.65612e+16 2156 5.902806 7.65612e+16 7.65612e+16
2 NA 7.65612e+16 7.65612e+16 3480 9.527721 7.65612e+16 7.65612e+16
3 NA 7.65612e+16 7.65612e+16 1588 4.347707 7.65612e+16 7.65612e+16
4 NA 7.65612e+16 7.65612e+16 501 1.371663 7.65612e+16 7.65612e+16
5 NA 7.65612e+16 7.65612e+16 858 2.349076 7.65612e+16 7.65612e+16
6 NA 7.65612e+16 7.65612e+16 686 1.878166 7.65612e+16 7.65612e+16
> head(df2)
V1
1 76561197960265800
2 76561197960266000
3 76561197960266100
4 76561197960267800
5 76561197960268100
6 76561197960268400
df1中的列steamid和friendid都需要在df2 $ V1中提供ID。如果该对中只有一个ID必须被删除,那么如果也不存在则。结束df将具有只能在df2中找到的id对。
你可以这样做:
df2$flag <- 1 #create a lookup column
df_temp <- merge(df1, df2, by.x = "friendid", by.y = "V1", all.x = TRUE)
names(df_temp) <- c("friendid", "tieid", "flag_1")
df_new <- merge(df_temp, df2, by.x = "tieid", by.y = "V1", all.x = TRUE)
names(df_temp) <- c("friendid", "tieid", "flag_1", "flag_2")
df_final <- subset(df_new, df_new$flag_1 == 1 | df_new$flag_2 == 1)
首先,您要检查哪些friendid与df1和df2匹配。然后你要检查新数据帧和df2之间的哪个tieid匹配。然后,您将对新创建的数据框进行子集化,以仅保留其中一个为1的行
您好Juan Arroyo Flores,欢迎来到stackoverflow。
我不确定我是否让你正确,但我认为你可以使用%in%运算符来解决这个问题。
%df2 $ variable中的df $ variable1%将检查df $ variable1的每个元素(如果它存在于df2 $ variable中)。
df1 = data.frame("name1" = c("a", "b", "c", "d"), "name2" = c("f", "g", "h", "i"), stringsAsFactors = F)
df2 = data.frame("names" = c("a", "g", "i"), stringsAsFactors = F)
df1
df2
# name1 name2
# 1 a f
# 2 b g
# 3 c h
# 4 d i
# > df2
# names
# 1 a
# 2 g
# 3 i
# so we want to have row 1 selecet (cause of a), row 2 (caus of g) and row 4 (caus of i)
# row 3 gets deleated
# lets use %in%
df1$name1 %in% df2$names
# > df1$name1 %in% df2$names
# [1] TRUE FALSE FALSE FALSE
df1$name2 %in% df2$names
# > df1$name2 %in% df2$names
# [1] FALSE TRUE FALSE TRUE
# to combine both a or is needed
df1$name1 %in% df2$names | df1$name2 %in% df2$names
# > df1$name1 %in% df2$names | df1$name2 %in% df2$names
# [1] TRUE TRUE FALSE TRUE
# with which you can select the index
select_index = which(df1$name1 %in% df2$names | df1$name2 %in% df2$names)
select_index
# > select_index
# [1] 1 2 4
# now this can be used to select the desired rows
df1[select_index,]
# > df1[select_index,]
# name1 name2
# 1 a f
# 2 b g
# 4 d i
# you could as well just use
df1[df1$name1 %in% df2$names | df1$name2 %in% df2$names,]
# > df1[df1$name1 %in% df2$names | df1$name2 %in% df2$names,]
# name1 name2
# 1 a f
# 2 b g
# 4 d i
or with dplyr
filter(df1, name1 %in% df2$names | name2 %in% df2$names)
# > filter(df1, name1 %in% df2$names | name2 %in% df2$names)
# name1 name2
# 1 a f
# 2 b g
# 3 d i
不确定这是不是你想要的?
这就是我最终不确定它是否正确。但是在SmitM和TinglTanglBob的代码的帮助下,我想到了这个:
#This looks for the id number on the steamid column and returns a new variable tf1 with a logical value of T or F. The same goes for the friendid column returning results to tf2
df1$tf1<-df1$steamid %in% df2$V1
df1$tf2<-df1$friendid %in% df2$V1
#The I do two subsets, first a subset of df1 where tf1= TRUE and then a second subest out of that one where tf2= TRUE
df3<-subset(df1,subset = tf1 %in% 'TRUE' & tf2 %in% 'TRUE')
df4<-subset(df3,subset=tf2 %in% 'TRUE')
可悲的是,我的数据比我想象的要少得多。至少如果我做得对。