我正在尝试将 pdf 中的文本加载到 R 中进行文本分析。 pdf 的格式设置为文本包含额外信息的栏。请参阅下面的屏幕截图。
pdf.file <- "file:///C:/Users/ayaxi/Downloads/program.pdf"
download.file(pdf.file, destfile = "sample.pdf", mode = "wb")
pdf.text <- pdftools::pdf_text("sample.pdf") |>
tannhauser <- pdf.text[657:799]
这就是用上面的代码读入 R 的内容。
tannhauser |> str_squish()
这会在文本加载到 R 时删除多余的空格,但我仍然不知道如何从列中过滤掉单词和文本,以便我可以只查看文本的主体。有什么办法可以做到这一点吗?
我尝试将 str_remove_all 与 str_remove_all("\S{3,}[^\s]+\S{3,}") 一起使用来删除由三个或更多空格包围的所有文本,但这会从主文本中删除单词还有文字。
pdf.file <- "science.pdf"
pdf.text <- pdftools::pdf_text(pdf.file) |>
pdf.text <- pdf.text |> stringr::str_squish()
body <- c()
for(i in pdf.text){
body <- c(body,paste(i, collapse = ""))
500 Spring Books Supplement Nature Vol. 296 8 April 1982 Toulmin and others he views science as a Science in society, society in science spectrum with the value-laden social and C.A. Russell human sciences at one end, but with the largely value-free physical sciences at the Between Science and Values. By Loren R. "restrictionists". The former hold that other. His carefully documented case- Graham. Pp.449. ISBN 0-231-05192-1. science and values did, and do, have much studies offer some support for the view (Columbia University Press: 1981.) $25.90, in common and, while in one sense there recently expressed (by D. M. MacKay) that £14.40. was no separate world "between" them, the notion of universally value-laden they were capable of the most potent science is an illogical extrapolation from SciENCE is about what is - what exists interactions that could have consequences the social sciences (where it is undoubtedly "out there" in the given world of nature.