如果不是r中的删除行,则检查df 2中df1中是否存在值

问题描述 投票:-1回答:3

目前,我有两个数据帧。第一个df1包含两列代表网络联系。另一个df2包含一个列,其中列出了我拥有属性数据的所有案例。

我想从df1中的df2中搜索这些情况,如果它们不在df1的一列或两列中,我想删除该行。因此,我最终会得到一个df1,它包含我拥有属性数据的案例之间的网络联系。

df1有大约240万个关系(边缘列表),df2有34k个案例。

这是我在谷歌搜索一段时间后尝试的:

首先,我复制df1上的两列以进行备份。

library*dlypr)
df3<- df1%>%
     mutate(friendid = friendid %in% df2$V1,
            friendid = friendid*1.0) #converts boolean to numeric
df3<- df3%>%
     mutate(tieid= tieid%in% df2$V1,
            tieid= tieid*1.0)
#So what I think is going on here is that if the number is not found it turn the value into 0 and 1 if present. I do this for the two original columns from df1.

#Then I attempt to delete the rows by searching for 0 values on each column (2 and 3, which contain the edgelist).

df3<-df3[apply(df3[2],1,function(z) !any(z==0)),] 

df3<-df3[apply(df3[3],1,function(z) !any(z==0)),]

这个过程会删除一堆行,但我最终会遇到大约2k个案例。这似乎不对。我尝试了一种类似于Excel的方法,但它对一次可以加载的行数有限制。在将数据集分成三个不同的文件并使用Kutools之后,我最终得到了大约74k个案例。但由于我做了大量的手工工作,我很确定excel工作中存在错误。 R允许我一次加载所有数据,这将有助于获得更具体的结果。

任何帮助将不胜感激。谢谢

已编辑以提供更多信息:

> head(df1)
    ID     steamid    friendid daysknown    years         el1         el2
1    NA 7.65612e+16 7.65612e+16      2156 5.902806 7.65612e+16 7.65612e+16
2    NA 7.65612e+16 7.65612e+16      3480 9.527721 7.65612e+16 7.65612e+16
3    NA 7.65612e+16 7.65612e+16      1588 4.347707 7.65612e+16 7.65612e+16
4    NA 7.65612e+16 7.65612e+16       501 1.371663 7.65612e+16 7.65612e+16
5    NA 7.65612e+16 7.65612e+16       858 2.349076 7.65612e+16 7.65612e+16
6    NA 7.65612e+16 7.65612e+16       686 1.878166 7.65612e+16 7.65612e+16
> head(df2)
                 V1
1 76561197960265800
2 76561197960266000
3 76561197960266100
4 76561197960267800
5 76561197960268100
6 76561197960268400

df1中的列steamid和friendid都需要在df2 $ V1中提供ID。如果该对中只有一个ID必须被删除,那么如果也不存在则。结束df将具有只能在df2中找到的id对。

r dataframe
3个回答
0
投票

你可以这样做:

df2$flag <- 1 #create a lookup column

df_temp <- merge(df1, df2, by.x = "friendid", by.y = "V1", all.x = TRUE)
names(df_temp) <- c("friendid", "tieid", "flag_1")
df_new <- merge(df_temp, df2, by.x = "tieid", by.y = "V1", all.x = TRUE)
names(df_temp) <- c("friendid", "tieid", "flag_1", "flag_2")

df_final <- subset(df_new, df_new$flag_1 == 1 | df_new$flag_2 == 1)

首先,您要检查哪些friendid与df1和df2匹配。然后你要检查新数据帧和df2之间的哪个tieid匹配。然后,您将对新创建的数据框进行子集化,以仅保留其中一个为1的行


0
投票

您好Juan Arroyo Flores,欢迎来到stackoverflow。

我不确定我是否让你正确,但我认为你可以使用%in%运算符来解决这个问题。

%df2 $ variable中的df $ variable1%将检查df $ variable1的每个元素(如果它存在于df2 $ variable中)。

    df1 = data.frame("name1" = c("a", "b", "c", "d"), "name2" = c("f", "g", "h", "i"), stringsAsFactors = F)
    df2 = data.frame("names" = c("a", "g", "i"), stringsAsFactors = F)

    df1
    df2


    # name1 name2
    # 1     a     f
    # 2     b     g
    # 3     c     h
    # 4     d     i
    # > df2
    # names
    # 1     a
    # 2     g
    # 3     i

    # so we want to have row 1 selecet (cause of a), row 2 (caus of g) and row 4 (caus of i)
    # row 3 gets deleated

    # lets use %in% 

    df1$name1 %in% df2$names

    # > df1$name1 %in% df2$names
    # [1]  TRUE FALSE FALSE FALSE

    df1$name2 %in% df2$names

    # > df1$name2 %in% df2$names
    # [1] FALSE  TRUE FALSE  TRUE

    # to combine both a or is needed

    df1$name1 %in% df2$names | df1$name2 %in% df2$names

    # > df1$name1 %in% df2$names | df1$name2 %in% df2$names
    # [1]  TRUE  TRUE FALSE  TRUE

    # with which you can select the index 
    select_index = which(df1$name1 %in% df2$names | df1$name2 %in% df2$names)
    select_index

    # > select_index
    # [1] 1 2 4

    # now this can be used to select the desired rows
    df1[select_index,]

    # > df1[select_index,]
    # name1 name2
    # 1     a     f
    # 2     b     g
    # 4     d     i

    # you could as well just use 
    df1[df1$name1 %in% df2$names | df1$name2 %in% df2$names,]

    # > df1[df1$name1 %in% df2$names | df1$name2 %in% df2$names,]
    # name1 name2
    # 1     a     f
    # 2     b     g
    # 4     d     i

or with dplyr

filter(df1, name1 %in% df2$names | name2 %in% df2$names)

# > filter(df1, name1 %in% df2$names | name2 %in% df2$names)
# name1 name2
# 1     a     f
# 2     b     g
# 3     d     i

不确定这是不是你想要的?


0
投票

这就是我最终不确定它是否正确。但是在SmitM和TinglTanglBob的代码的帮助下,我想到了这个:

#This looks for the id number on the steamid column and returns a new variable tf1 with a logical value of T or F. The same goes for the friendid column returning results to tf2
df1$tf1<-df1$steamid %in% df2$V1
df1$tf2<-df1$friendid %in% df2$V1

#The I do two subsets, first a subset of df1 where tf1= TRUE and then a second subest out of that one where tf2= TRUE
df3<-subset(df1,subset = tf1 %in% 'TRUE' & tf2 %in% 'TRUE')
df4<-subset(df3,subset=tf2 %in% 'TRUE')

可悲的是,我的数据比我想象的要少得多。至少如果我做得对。

© www.soinside.com 2019 - 2024. All rights reserved.