R：删除字符串中所有引用的值

Question

我正在使用 Twitter 数据在 R 中开始我的第一个文本分析项目，在预处理阶段我试图删除引号内出现的所有值。我发现一些代码本身删除了引号，但没有删除其中的值（例如，“Hello World”变成了 Hello World），但没有任何代码始终删除值和引号（例如，这是一个“引用文本” “变成这是一个）。

我已经匿名化了一个我正在使用的示例数据框（保留了这些特定推文的精确格式，只是内容发生了变化）：


    df <- data.frame(text = c("Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”", 
                              "Text \"this is a quote.\" More text. https://t.co/"))

对于这个数据框，目标是最终得到：

Example: https://t.co/ -  MORE TEXT - example: 

Text More text. https://t.co/

我试过这些：

df$text <- gsub('"[^"]+"', '', df$text)

df$text <- gsub('".*"', '', df$text)

df$text <- gsub("[\"'].*['\"]","", df$text)

但我发现它只能成功地从第二次观察中删除引用，而不是第一次。我怀疑这可能与第二个引号是如何从 Twitter 导入并用 \ 括起来有关。我不确定这个假设是否正确，如果正确，我不确定如何克服它。任何帮助将不胜感激！

Answer 1

如果有两层嵌套引号，可以这样做

基础R

df <- data.frame(text = c("Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”", 
                          "Text \"this is a quote.\" More text. https://t.co/"))

df$text |>
  gsub('(“|")[^”"“]*(”|")', '', x = _) |>
  gsub('(“|")[^”"]*(”|")', '', x = _)
#> [1] "Example:  https://t.co/ -  MORE TEXT - example: "
#> [2] "Text  More text. https://t.co/"

Tidyverse

df <- data.frame(text = c("Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”", 
                          "Text \"this is a quote.\" More text. https://t.co/"))
df$text
#> [1] "Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”"
#> [2] "Text \"this is a quote.\" More text. https://t.co/"

library(stringr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df %>% 
  mutate(text = str_remove_all(text, '(“|")[^”"“]*(”|")'),
         text = str_remove_all(text, '(“|")[^”"]*(”|")'))
#>                                               text
#> 1 Example:  https://t.co/ -  MORE TEXT - example: 
#> 2                   Text  More text. https://t.co/

Answer 2

这是一个使用单行模式的解决方案：

library(tidyverse)
df %>%
  mutate(text = str_remove_all(text, '"[^"]+"|“[^“”]+”|“.+”'))
                                              text
1 Example:  https://t.co/ -  MORE TEXT - example: 
2                   Text  More text. https://t.co/

该模式使用三种替代模式来处理

text

中显示的可变性：

```
"[^"]+"
```
：第一个选择：删除包裹在
```
"
```
```
“[^“”]+”
```
：第二种选择：删除包裹在
```
“
```
和
```
”
```
```
“.+”
```
：第三种选择：删除包裹在
```
“
```
和
```
”
```

如果在实际数据中也有嵌套的

" "

引号，这可以用另一个交替来解释。

Answer 3

您可以使用递归

?1

或

?R

来匹配

“

和

”

的平衡/嵌套结构。

(“([^“”]|(?R))*”)

将匹配（嵌套）成对的

“

和

”

，其中

a(?R)z

是匹配一个或多个字母

后跟完全相同数量的字母

.

的递归

对于

，很难区分是嵌套了还是引号多了

".*"

会假设它们是嵌套的，但如果它们是成对的则不算数，

("([^"]|(?R))*")

会匹配成对的嵌套，并且

"[^"]*"

会假设

没有嵌套。

gsub('("([^"]|(?R))*")|(“([^“”]|(?R))*”)', '', df$text, perl=TRUE)
#[1] "Example:  https://t.co/ -  MORE TEXT - example: "
#[2] "Text  More text. https://t.co/"

gsub('"[^"]*"|(“([^“”]|(?R))*”)', '', df$text, perl=TRUE)
#[1] "Example:  https://t.co/ -  MORE TEXT - example: "
#[2] "Text  More text. https://t.co/"                  

gsub('".*"|(“([^“”]|(?R))*”)', '', df$text, perl=TRUE)
#[1] "Example:  https://t.co/ -  MORE TEXT - example: "
#[2] "Text  More text. https://t.co/"

R：删除字符串中所有引用的值

问题描述投票：0回答：3

3个回答

最新问题

R：删除字符串中所有引用的值

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3