在R中使用Regex获取Twitter @Username

Question

如何在R中使用正则表达式从一串文本中提取Twitter用户名？

我试过了

library(stringr)

theString <- '@foobar Foobar! and @foo (@bar) but not [email protected]'

str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)')

但我最终得到了包含不需要的括号的@foobar，@foo和(@bar。

我怎样才能得到@foobar，@foo和@bar作为输出？

Answer 1

这是一种在R中有效的方法：

theString <- '@foobar Foobar! and @foo (@bar) but not [email protected]'
theString1 <- unlist(strsplit(theString, " "))
regex <- "(^|[^@\\w])@(\\w{1,15})\\b"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo"    "(@bar)"

如果你想在R中使用@Jerry的答案：

regex <- "@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo"    "(@bar)"

但是，这两种方法都包含您不想要的括号。

更新这将从头到尾没有括号或任何其他类型的标点符号（除了下划线，因为它们在用户名中被允许）

theString <- '@foobar Foobar! and @fo_o (@bar) but not [email protected]'
theString1 <- unlist(strsplit(theString, " "))
regex1 <- "(^|[^@\\w])@(\\w{1,15})\\b" # get strings with @
regex2 <- "[^[:alnum:]@_]"             # remove all punctuation except _ and @
users <- gsub(regex2, "", theString1[grep(regex1, theString1, perl = T)])
users

[1] "@foobar" "@fo_o"   "@bar"

Answer 2

尝试使用负向lookbehind，以便在匹配中不使用字符：

(?:^|(?<![-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)
      ^^^

编辑：因为它似乎看起来不适用于R（我在这里发现某个地方看起来后面的工作在R上，但显然不是......），试试这个：

@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)

编辑：双重逃脱了点

EDITv3 ...：尝试打开PCRE：

str_extract_all(string=theString,perl("(?:^|(?<![-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)")

Answer 3

@[a-zA-Z0-9_]{0,15}

哪里：

@字面匹配字符@（区分大小写）。
[a-zA-Z0-15]匹配列表中的单个字符
{0,15} Quantifier尽可能多地匹配0到15次，根据需要回放

它在从混合数据集中选择twitter用户名时工作正常。

在R中使用Regex获取Twitter @Username

问题描述投票：4回答：3

3个回答

最新问题

在R中使用Regex获取Twitter @Username

问题描述 投票：4回答：3

3个回答

最新问题

问题描述投票：4回答：3