删除非ASCII值，然后降低文本给出错误

Question

我有一个大的数据集，我清理，发现其中一个字段有价值

"My son is turning into a monster \xf0\u009f\u0098\u0092"

我无法创建这个简单的数据，因为它给出了下面提到的错误

a <- c('My son is turning into a monster \xf0\u009f\u0098\u0092')

错误：不允许在字符串中混合Unicode和八进制/十六进制转义

现在假设我的变量中有这个值，并希望删除所有非ascii字符

library(stringi)
b <- stri_trans_general(a, "latin-ascii")

现在想要以较低格式转换文本

tolower(b)

我得到下面提到的错误

tolower错误（b）：无效输入'我的儿子在'utf8towcs'中变成怪物ðŸ~''

有人可以帮我理解这个问题

Answer 1

要删除所有非ASCII字符，可以使用正则表达式。 [\x00-\x7F]是所有非ASCII字符的集合，因此我们可以用任何内容替换每个出现。但是，R不喜欢\x00，因为它是空字符，所以我不得不修改系列为[\x01-\x7F]

a <- c('My son is turning into a monster \u009f\u0098\u0092')
#> [1] "My son is turning into a monster \u009f\u0098\u0092"
tolower(gsub('[^\x01-\x7F]+','',a))
#> [1] "my son is turning into a monster "

或者，用八进制代码

a <- c('My son is turning into a monster \xf0')
#> [1] "My son is turning into a monster ð"
tolower(gsub('[^\x01-\x7F]+','',a))
#> [1] "my son is turning into a monster "

Answer 2

您可以使用iconv删除非ASCII字符：

a <- c('My son is turning into a monster \xf0\x9f\x98\x92')
a
[1] "My son is turning into a monster ðŸ˜’"
iconv(a,to="ASCII",sub="")
[1] "My son is turning into a monster "

删除非ASCII值，然后降低文本给出错误

问题描述投票：2回答：2

2个回答

最新问题

删除非ASCII值，然后降低文本给出错误

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2