我有一列名字;有些有中间名或中间名缩写。我想从
fullname
列中删除这些中间名缩写,并在此列旁边创建一个新列来存储这些中间名/缩写。
根据我的研究这篇文章提供了一些删除中间名的解决方案。如何将这些中间名/缩写移至新列?
以下是我的数据的 40 行示例,使用
dput()
该数据有 4300 万行,大约 3.8 GB。
data<- structure(list(id = c(439116595, 439317458, 439373574, 439434694,
439508848, 439632143, 439778306, 439917155, 440009485, 440147556,
440207880, 441479247, 442115059, 441569787, 438192228, 438215998,
438307365, 438317476, 438389110, 438409963, 438479736, 438509859,
438634859, 438662407, 438764944, 438846700, 438884094, 438954147,
439227370, 439243020, 439248564, 439272667, 439357884, 439403127,
439446363, 439511276, 439546441, 439586141, 439804213, 439862550,
439889286), fullname = c("Shawn Chase", "Steven Hofer", "Stephen Paradise",
"Shengho Yang", "Nelson Carvalho", "RICK GILLICK", "Marie Jhoanne Morilla",
"Sanjay Kulkarni", "Sam Bunn", "Iran Murphy", "Kathryn Cutler",
"Diane Sik", "Donna Yee", "Christine Coltrain", "Maher Dakkouri",
"Ray Perl", "Abid Khalil", "Ian Crombie", "Allen Carr", "Daniel Angeline",
"Jimmy Tan", "Thierry LAMBERT", "Diene Faye", "Greg Greene",
"Laura Holsopple", "Roberta Minkus", "Bridget Chenette", "Joshua Polite",
"John Liberty", "David Smith", "Igor Baratta", "Pierre Schmitz",
"alejandra salvanes", "Malcolm K Knight", "Xiaoyan Hu", "Joe Pawl",
"Bryan Armstrong", "Christina Spezio", "Robert Gibson", "Peter Head",
"Mike Russo"), degree = c("Bachelor", "Bachelor", "MBA",
"", "", "", "", "Master", "", "MBA", "", "", "Bachelor", "Doctor",
"Bachelor", "Master", "Master", "", "", "", "", "", "", "Bachelor",
"Master", "", "Master", "", "Bachelor", "Master", "", "", "",
"", "Bachelor", "Associate", "", "Bachelor", "", "", "Associate"
)), row.names = 20:60, class = "data.frame")
我已对一小部分数据尝试了以下操作。但是,出现了错误消息:
警告信息: 在 stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) 中:参数不是原子向量;胁迫
str_match(data, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]
更新(和解决方案):
我能够在 @Ben Bolker 的帮助下将上述方法编辑为以下方法,效果非常快!
names<- stringr::str_match(data$fullname, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]
names<-data.frame(no_mid_name = paste(names[,1], names[,3]),
mid_name = names[,2])
然后我使用
cbind()
将结果列添加到我的原始数据集。
也许你可以试试这个
transform(data,
no_mid_name = sub("^(\\w+).*?(\\w+)$", "\\1 \\2", fullname),
mid_name = trimws(sub("^\\w+(.*?)\\w+$", "\\1", fullname))
)
这给出了
id fullname degree no_mid_name mid_name
20 439116595 Shawn Chase Bachelor Shawn Chase
21 439317458 Steven Hofer Bachelor Steven Hofer
22 439373574 Stephen Paradise MBA Stephen Paradise
23 439434694 Shengho Yang Shengho Yang
24 439508848 Nelson Carvalho Nelson Carvalho
25 439632143 RICK GILLICK RICK GILLICK
26 439778306 Marie Jhoanne Morilla Marie Morilla Jhoanne
27 439917155 Sanjay Kulkarni Master Sanjay Kulkarni
28 440009485 Sam Bunn Sam Bunn
29 440147556 Iran Murphy MBA Iran Murphy
30 440207880 Kathryn Cutler Kathryn Cutler
31 441479247 Diane Sik Diane Sik
32 442115059 Donna Yee Bachelor Donna Yee
33 441569787 Christine Coltrain Doctor Christine Coltrain
34 438192228 Maher Dakkouri Bachelor Maher Dakkouri
35 438215998 Ray Perl Master Ray Perl
36 438307365 Abid Khalil Master Abid Khalil
37 438317476 Ian Crombie Ian Crombie
38 438389110 Allen Carr Allen Carr
39 438409963 Daniel Angeline Daniel Angeline
40 438479736 Jimmy Tan Jimmy Tan
41 438509859 Thierry LAMBERT Thierry LAMBERT
42 438634859 Diene Faye Diene Faye
43 438662407 Greg Greene Bachelor Greg Greene
44 438764944 Laura Holsopple Master Laura Holsopple
45 438846700 Roberta Minkus Roberta Minkus
46 438884094 Bridget Chenette Master Bridget Chenette
47 438954147 Joshua Polite Joshua Polite
48 439227370 John Liberty Bachelor John Liberty
49 439243020 David Smith Master David Smith
50 439248564 Igor Baratta Igor Baratta
51 439272667 Pierre Schmitz Pierre Schmitz
52 439357884 alejandra salvanes alejandra salvanes
53 439403127 Malcolm K Knight Malcolm Knight K
54 439446363 Xiaoyan Hu Bachelor Xiaoyan Hu
55 439511276 Joe Pawl Associate Joe Pawl
56 439546441 Bryan Armstrong Bryan Armstrong
57 439586141 Christina Spezio Bachelor Christina Spezio
58 439804213 Robert Gibson Robert Gibson
59 439862550 Peter Head Peter Head
60 439889286 Mike Russo Associate Mike Russo
OP的
str_match
解决方案实际上比@ThomasIsCoding的更快(可能是因为它避免了两次正则表达式搜索)。完整的问题大约比此处给出的示例大一百万倍(4300 万对 41 行);由于一次迭代只需要 200 µs,我们预计整个过程将在 200 秒内完成假设行数呈线性缩放(并且我已经正确完成了算术)。因此,可能会发生其他事情......人们可能需要(1)做一些实验来看看这些解决方案如何随着问题规模的增加而扩展,以及(2)可能在内存管理等方面更加小心,例如使用 data.table
... 寻找解决方案
thomas <- function(data) with(data,
data.frame(no_mid_name = sub("^(\\w+).*?(\\w+)$", "\\1 \\2", fullname),
mid_name = trimws(sub("^\\w+(.*?)\\w+$", "\\1", fullname))))
OP <- function(data) {
ss <- stringr::str_match(data$fullname, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]
data.frame(no_mid_name = paste(ss[,1], ss[,3]),
mid_name = ss[,2])
}
bench::mark(thomas(data), OP(data)) |>
dplyr::select(expression, median, `itr/sec`)
## expression median `itr/sec`
## <bch:expr> <bch:tm> <dbl>
## 1 thomas(dd) 289µs 3351.
## 2 OP(dd) 187µs 5266.