从列中删除中间名和首字母并保存在 R 中的单独列中

问题描述 投票:0回答:2

我有一列名字;有些有中间名或中间名缩写。我想从

fullname
列中删除这些中间名缩写,并在此列旁边创建一个新列来存储这些中间名/缩写。

根据我的研究这篇文章提供了一些删除中间名的解决方案。如何将这些中间名/缩写移至新列?

以下是我的数据的 40 行示例,使用

dput()
该数据有 4300 万行,大约 3.8 GB。

data<- structure(list(id = c(439116595, 439317458, 439373574, 439434694, 
439508848, 439632143, 439778306, 439917155, 440009485, 440147556, 
440207880, 441479247, 442115059, 441569787, 438192228, 438215998, 
438307365, 438317476, 438389110, 438409963, 438479736, 438509859, 
438634859, 438662407, 438764944, 438846700, 438884094, 438954147, 
439227370, 439243020, 439248564, 439272667, 439357884, 439403127, 
439446363, 439511276, 439546441, 439586141, 439804213, 439862550, 
439889286), fullname = c("Shawn Chase", "Steven Hofer", "Stephen Paradise", 
"Shengho Yang", "Nelson Carvalho", "RICK GILLICK", "Marie Jhoanne Morilla", 
"Sanjay Kulkarni", "Sam Bunn", "Iran Murphy", "Kathryn Cutler", 
"Diane Sik", "Donna Yee", "Christine Coltrain", "Maher Dakkouri", 
"Ray Perl", "Abid Khalil", "Ian Crombie", "Allen Carr", "Daniel Angeline", 
"Jimmy Tan", "Thierry LAMBERT", "Diene Faye", "Greg Greene", 
"Laura Holsopple", "Roberta Minkus", "Bridget Chenette", "Joshua Polite", 
"John Liberty", "David Smith", "Igor Baratta", "Pierre Schmitz", 
"alejandra salvanes", "Malcolm K Knight", "Xiaoyan Hu", "Joe Pawl", 
"Bryan Armstrong", "Christina Spezio", "Robert Gibson", "Peter Head", 
"Mike Russo"), degree = c("Bachelor", "Bachelor", "MBA", 
"", "", "", "", "Master", "", "MBA", "", "", "Bachelor", "Doctor", 
"Bachelor", "Master", "Master", "", "", "", "", "", "", "Bachelor", 
"Master", "", "Master", "", "Bachelor", "Master", "", "", "", 
"", "Bachelor", "Associate", "", "Bachelor", "", "", "Associate"
)), row.names = 20:60, class = "data.frame")

我已对一小部分数据尝试了以下操作。但是,出现了错误消息:

警告信息: 在 stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) 中:参数不是原子向量;胁迫

str_match(data, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]

更新(和解决方案):

我能够在 @Ben Bolker 的帮助下将上述方法编辑为以下方法,效果非常快!

 names<- stringr::str_match(data$fullname, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]  
 names<-data.frame(no_mid_name = paste(names[,1], names[,3]),
                mid_name = names[,2])

然后我使用

cbind()
将结果列添加到我的原始数据集。

r string data-cleaning
2个回答
2
投票

也许你可以试试这个

transform(data,
  no_mid_name = sub("^(\\w+).*?(\\w+)$", "\\1 \\2", fullname),
  mid_name = trimws(sub("^\\w+(.*?)\\w+$", "\\1", fullname))
)

这给出了

          id              fullname    degree        no_mid_name mid_name
20 439116595           Shawn Chase  Bachelor        Shawn Chase
21 439317458          Steven Hofer  Bachelor       Steven Hofer
22 439373574      Stephen Paradise       MBA   Stephen Paradise
23 439434694          Shengho Yang                 Shengho Yang
24 439508848       Nelson Carvalho              Nelson Carvalho
25 439632143          RICK GILLICK                 RICK GILLICK
26 439778306 Marie Jhoanne Morilla                Marie Morilla  Jhoanne
27 439917155       Sanjay Kulkarni    Master    Sanjay Kulkarni
28 440009485              Sam Bunn                     Sam Bunn
29 440147556           Iran Murphy       MBA        Iran Murphy
30 440207880        Kathryn Cutler               Kathryn Cutler
31 441479247             Diane Sik                    Diane Sik
32 442115059             Donna Yee  Bachelor          Donna Yee
33 441569787    Christine Coltrain    Doctor Christine Coltrain
34 438192228        Maher Dakkouri  Bachelor     Maher Dakkouri
35 438215998              Ray Perl    Master           Ray Perl
36 438307365           Abid Khalil    Master        Abid Khalil
37 438317476           Ian Crombie                  Ian Crombie
38 438389110            Allen Carr                   Allen Carr
39 438409963       Daniel Angeline              Daniel Angeline
40 438479736             Jimmy Tan                    Jimmy Tan
41 438509859       Thierry LAMBERT              Thierry LAMBERT
42 438634859            Diene Faye                   Diene Faye
43 438662407           Greg Greene  Bachelor        Greg Greene
44 438764944       Laura Holsopple    Master    Laura Holsopple
45 438846700        Roberta Minkus               Roberta Minkus
46 438884094      Bridget Chenette    Master   Bridget Chenette
47 438954147         Joshua Polite                Joshua Polite
48 439227370          John Liberty  Bachelor       John Liberty
49 439243020           David Smith    Master        David Smith
50 439248564          Igor Baratta                 Igor Baratta
51 439272667        Pierre Schmitz               Pierre Schmitz
52 439357884    alejandra salvanes           alejandra salvanes
53 439403127      Malcolm K Knight               Malcolm Knight        K
54 439446363            Xiaoyan Hu  Bachelor         Xiaoyan Hu
55 439511276              Joe Pawl Associate           Joe Pawl
56 439546441       Bryan Armstrong              Bryan Armstrong
57 439586141      Christina Spezio  Bachelor   Christina Spezio
58 439804213         Robert Gibson                Robert Gibson
59 439862550            Peter Head                   Peter Head
60 439889286            Mike Russo Associate         Mike Russo

1
投票

OP的

str_match
解决方案实际上比@ThomasIsCoding的更快(可能是因为它避免了两次正则表达式搜索)。完整的问题大约比此处给出的示例大一百万倍(4300 万对 41 行);由于一次迭代只需要 200 µs,我们预计整个过程将在 200 秒内完成假设行数呈线性缩放(并且我已经正确完成了算术)。因此,可能会发生其他事情......人们可能需要(1)做一些实验来看看这些解决方案如何随着问题规模的增加而扩展,以及(2)可能在内存管理等方面更加小心,例如使用
data.table
...

寻找解决方案
thomas <- function(data) with(data,
      data.frame(no_mid_name = sub("^(\\w+).*?(\\w+)$", "\\1 \\2", fullname),
                 mid_name = trimws(sub("^\\w+(.*?)\\w+$", "\\1", fullname))))

OP <- function(data) {
    ss <- stringr::str_match(data$fullname, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]
    data.frame(no_mid_name = paste(ss[,1], ss[,3]),
               mid_name = ss[,2])
}

bench::mark(thomas(data), OP(data))  |>
 dplyr::select(expression, median, `itr/sec`)

##   expression   median `itr/sec`
##  <bch:expr> <bch:tm>     <dbl>
## 1 thomas(dd)    289µs     3351.
## 2 OP(dd)        187µs     5266.
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.