R：read_html() + html_text() 的替代方案/方法也适用于没有 HTML/XML 标签的字符串

Question

在此解决方案中，要从字符串中删除 HTML 标签，字符串会传递到

rvest::read_html()

以创建

html_document

对象，然后将该对象传递到

rvest::html_text()

以返回“无 HTML 文本”。

但是，如果字符串不包含 HTML 标记，

read_html()

会抛出错误，因为该字符串被视为文件/连接路径，如下所示。当尝试从许多可能不包含任何标签的字符串中删除 HTML 时，这可能会出现问题。

library(rvest)

# Example data
dat <- c(
  "<B>Positives:</B> Rangy, athletic build with room for additional growth. ...",
  "Positives: Better football player than his measureables would indicate. ..."
)


# Success: produces html_document object
rvest::read_html(dat[1])
#> {html_document}
#> <html>
#> [1] <body>\n<b>Positives:</b> Rangy, athletic build with room for additional  ...


# Error
rvest::read_html(dat[2])
#> Error in `path_to_connection()`:
#> ! 'Positives: Better football player than his measureables would
#>   indicate. ...' does not exist in current working directory
#>   ('C:/LONG_PATH_HERE').

有没有一种快速的方法来确保

read_html()

将每个字符串视为xml，即使它不包含任何标签，或者删除HTML以达到与
read_html() |> html_text()
相同的效果？

一个想法是简单地附加“”或“ " 到每个字符串的末尾。但是，我想有一种更有效的方法，当字符串缺少任何 HTML 时返回字符串而不进行任何计算，或者使用函数的参数来完成此操作。其他替代方案包括使用正则表达式删除标签，尽管这样做违反了 “不要在 html 上使用正则表达式” 原则。

Answer 1

你可以尝试一下：

### Packages
library(rvest)
library(purrr)

### Data
dat <- c(
  "<B>Positives:</B> Rangy, athletic build with room for additional growth. ...",
  "Positives: Better football player than his measureables would indicate. ..."
)

### Writing a function to convert each string to raw, parse it with read_html then extract the text
clean=function(x) {
  read_html(charToRaw(x)) %>% html_text()
}

### Map the function over the character vector
map_chr(dat,clean,.progress = TRUE)

输出：

[1] "Positives: Rangy, athletic build with room for additional growth. ..."      
[2] "Positives: Better football player than his measureables would indicate. ..."

R：read_html() + html_text() 的替代方案/方法也适用于没有 HTML/XML 标签的字符串

问题描述投票：0回答：1

1个回答

最新问题

R：read_html() + html_text() 的替代方案/方法也适用于没有 HTML/XML 标签的字符串

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1