如何使用dplyr将一个列中的单个条目组合在一起?我希望我可以通过电子邮件分组,然后对每个单独的列A-Z进行排序,但我无法弄清楚如何在不对整个数据帧进行排序的情况下执行此操作。非常感谢你提前!
样本数据
df <- data.frame(
cleanname = c("Steven Smith", "Rob Tan", 'Zachary', "Matthew"),
dirtyname = c('rob Tan', 'stevesmith','zach', "Matthew"),
email = c('[email protected]', '[email protected]', '[email protected]', '[email protected]')
)
期望的最终结果
desireddf <- data.frame(
cleanname = c("Rob Tan", "Steven Smith", "Zachary", "Matthew"),
dirtyname = c('rob Tan', 'stevesmith','zach', 'Matthew'),
email = c('[email protected]', '[email protected]', '[email protected]', '[email protected]')
)
编辑
感谢Sotos指出我的问题可以通过模糊名称匹配来解决。
你可以使用amatch
-package中的stringdist
函数:
library(stringdist)
df %>%
mutate(dirtyname = dirtyname[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)],
email = email[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)])
这使:
cleanname dirtyname email 1 Steven Smith stevesmith [email protected] 2 Rob Tan rob Tan [email protected] 3 Zachary zach [email protected] 4 Matthew Matthew [email protected]
与data.table
相同的逻辑:
library(data.table)
setDT(df)[, `:=` (dirtyname = dirtyname[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)],
email = email[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)])]
如果数据框中的行表示不同的观察值,则不适合对每个列进行独立排序,因为独立的向量排序将使行不再代表单独的观察。
矢量可以通过多种方式进行排序,例如使用order()
函数。
dirtyname <- dirtyname[order(dirtyname)]