我有两个数据框,都有一个
Last_Name
列。第一个数据框有一个列 Contains_First_Name
,第二个数据框有一个名为 Last_Name
的列。我想将两者结合到 Last_Name
的精确拼写以及 Contains_First_Name
和 First_Name
的子字符串匹配(其中 First_Name
是 Contains_First_Name
的子字符串。)请参阅下面的示例。
library(dplyr)
library(stringr)
# Create df1
Last_Name <- c("Smith", "Jones", "Adams", "Rogers", "Lee", "Lee", "Lee")
Contains_First_Name <- c("Kimberly Nicole", "Patrick L", "Johnson Ann", "Rick", "McAdams Jennifer Marie", "Kirk", "Kirk B")
Account_Number <- c("123", "345", "678", "901", "234", "567", "890")
df1 <- data.frame(Last_Name, Contains_First_Name, Account_Number)
# Create df2
Last_Name <- c("Smith", "Jones", "Adams", "Lee", "Lee")
First_Name <- c("Kimberly", "Patrick", "Ann", "Jennifer", "Kirk")
df2 <- data.frame(Last_Name, First_Name)
生成的数据帧:
> df1
Last_Name Contains_First_Name Account_Number
1 Smith Kimberly Nicole 123
2 Jones Patrick L 345
3 Adams Johnson Ann 678
4 Rogers Rick 901
5 Lee McAdams Jennifer Marie 234
6 Lee Kirk 567
7 Lee Kirk B 890
> df2
Last_Name First_Name
1 Smith Kimberly
2 Jones Patrick
3 Adams Ann
4 Lee Jennifer
5 Lee Kirk
我想要的最终结果是:
> df3
Last_Name Contains_First_Name Account_Number First_Name
1 Smith Kimberly Nicole 123 Kimberly
2 Jones Patrick L 345 Patrick
3 Adams Johnson Ann 678 Ann
4 Lee McAdams Jennifer Marie 234 Jennifer
5 Lee Kirk 567 Kirk
6 Lee Kirk B 890 Kirk
我试过这个:
df3 <-
filter(df1,
Last_Name %in% df2$Last_Name,
str_detect(Contains_First_Name, paste(df2$First_Name, collapse = "|")))
出现以下错误:
Error in match.arg(method) : 'arg' must be NULL or a character vector
我还探索了
fuzzyjoin
库,但无法弄清楚如何连接具有两种不同连接类型(精确和子字符串)的两个变量。我看到了一个类似的问题,但似乎没有答案:合并两个数据帧基于R 中一列中的精确匹配和另一列中的错误匹配。
任何意见是极大的赞赏。谢谢你。
我想说你有两个选择:要么仅在第一列上使用等连接并稍后进行过滤,要么使用
fuzzyjoin
,正如你所描述的:
# Approach 1: Match all, filter later
inner_join(df1, df2, join_by(Last_Name), relationship = "many-to-many") |>
filter(str_detect(Contains_First_Name, First_Name))
#> # A tibble: 6 × 4
#> Last_Name Contains_First_Name Account_Number First_Name
#> <chr> <chr> <chr> <chr>
#> 1 Smith Kimberly Nicole 123 Kimberly
#> 2 Jones Patrick L 345 Patrick
#> 3 Adams Johnson Ann 678 Ann
#> 4 Lee McAdams Jennifer Marie 234 Jennifer
#> 5 Lee Kirk 567 Kirk
#> 6 Lee Kirk B 890 Kirk
# Approach 2: fuzzyjoin
fuzzyjoin::fuzzy_inner_join(
df1,
df2,
by = c("Last_Name" = "Last_Name", "Contains_First_Name" = "First_Name"),
match_fun = list(`==`, \(x, y) str_detect(x, y))
) |>
select(!Last_Name.y) |>
rename(Last_Name = Last_Name.x)
#> # A tibble: 6 × 4
#> Last_Name Contains_First_Name Account_Number First_Name
#> <chr> <chr> <chr> <chr>
#> 1 Smith Kimberly Nicole 123 Kimberly
#> 2 Jones Patrick L 345 Patrick
#> 3 Adams Johnson Ann 678 Ann
#> 4 Lee McAdams Jennifer Marie 234 Jennifer
#> 5 Lee Kirk 567 Kirk
#> 6 Lee Kirk B 890 Kirk
创建于 2024-01-06,使用 reprex v2.0.2