在此解决方案中,要从字符串中删除 HTML 标签,字符串会传递到
rvest::read_html()
以创建 html_document
对象,然后将该对象传递到 rvest::html_text()
以返回“无 HTML 文本”。
但是,如果字符串不包含 HTML 标记,
read_html()
会抛出错误,因为该字符串被视为文件/连接路径,如下所示。当尝试从许多可能不包含任何标签的字符串中删除 HTML 时,这可能会出现问题。
library(rvest)
# Example data
dat <- c(
"<B>Positives:</B> Rangy, athletic build with room for additional growth. ...",
"Positives: Better football player than his measureables would indicate. ..."
)
# Success: produces html_document object
rvest::read_html(dat[1])
#> {html_document}
#> <html>
#> [1] <body>\n<b>Positives:</b> Rangy, athletic build with room for additional ...
# Error
rvest::read_html(dat[2])
#> Error in `path_to_connection()`:
#> ! 'Positives: Better football player than his measureables would
#> indicate. ...' does not exist in current working directory
#> ('C:/LONG_PATH_HERE').
有没有一种快速的方法来确保
read_html()
将每个字符串视为xml,即使它不包含任何标签,或者删除HTML以达到与read_html() |> html_text()
相同的效果?
一个想法是简单地附加“”或“ " 到每个字符串的末尾。但是,我想有一种更有效的方法,当字符串缺少任何 HTML 时返回字符串而不进行任何计算,或者使用函数的参数来完成此操作。其他替代方案包括使用正则表达式删除标签,尽管这样做违反了 “不要在 html 上使用正则表达式” 原则。
你可以尝试一下:
### Packages
library(rvest)
library(purrr)
### Data
dat <- c(
"<B>Positives:</B> Rangy, athletic build with room for additional growth. ...",
"Positives: Better football player than his measureables would indicate. ..."
)
### Writing a function to convert each string to raw, parse it with read_html then extract the text
clean=function(x) {
read_html(charToRaw(x)) %>% html_text()
}
### Map the function over the character vector
map_chr(dat,clean,.progress = TRUE)
输出:
[1] "Positives: Rangy, athletic build with room for additional growth. ..."
[2] "Positives: Better football player than his measureables would indicate. ..."