将 pdf 文本重新格式化为数据框以删除额外信息

问题描述 投票:0回答:1

我正在尝试将 pdf 中的文本加载到 R 中进行文本分析。 pdf 的格式设置为文本包含额外信息的栏。请参阅下面的屏幕截图。

在此输入图片描述

我想加载文本的主体,这样我就可以创建一个语料库来分析搭配、键性等。我想将此程序注释文本与一个世纪前的同一篇文章进行比较由芝加哥交响乐团创作。

这是我目前拥有的代码:

pdf.file <- "file:///C:/Users/ayaxi/Downloads/program.pdf"
download.file(pdf.file, destfile = "sample.pdf", mode = "wb")
pdf.text <- pdftools::pdf_text("sample.pdf") |>
  readr::read_lines()

tannhauser <- pdf.text[657:799]

这就是用上面的代码读入 R 的内容。

在此输入图片描述

我添加了

tannhauser |> str_squish()

这会在文本加载到 R 时删除多余的空格,但我仍然不知道如何从列中过滤掉单词和文本,以便我可以只查看文本的主体。有什么办法可以做到这一点吗?

我尝试将 str_remove_all 与 str_remove_all("\S{3,}[^\s]+\S{3,}") 一起使用来删除由三个或更多空格包围的所有文本,但这会从主文本中删除单词还有文字。

r pdf filter text data-wrangling
1个回答
0
投票

这似乎就是您正在寻找的:

pdf.file <- "science.pdf"

pdf.text <- pdftools::pdf_text(pdf.file) |>
  readr::read_lines()

pdf.text <- pdf.text |> stringr::str_squish()

body <- c()
for(i in pdf.text){
  body <- c(body,paste(i, collapse = ""))
}

cat(body)

500 Spring Books Supplement Nature Vol. 296 8 April 1982 Toulmin and others he views science as a Science in society, society in science spectrum with the value-laden social and C.A. Russell human sciences at one end, but with the largely value-free physical sciences at the Between Science and Values. By Loren R. "restrictionists". The former hold that other. His carefully documented case- Graham. Pp.449. ISBN 0-231-05192-1. science and values did, and do, have much studies offer some support for the view (Columbia University Press: 1981.) $25.90, in common and, while in one sense there recently expressed (by D. M. MacKay) that £14.40. was no separate world "between" them, the notion of universally value-laden they were capable of the most potent science is an illogical extrapolation from SciENCE is about what is - what exists interactions that could have consequences the social sciences (where it is undoubtedly "out there" in the given world of nature.
© www.soinside.com 2019 - 2024. All rights reserved.