我有很多 pdf 文件,我需要在每个文件中搜索特定的关键词/短语。对于每个pdf,我想知道这些关键词/短语出现了多少(但不是它们出现了多少次)。对于出现的每个关键词/短语,我想分配一分(无论出现多少次)。对于每一个没有的,零分。我想要一个可以扫描 pdf 并按上述方式计算关键词/短语的脚本。我已经查看了 readPDF 函数和 PDF 数据提取器 (PDE),但我不确定它们会执行我想要的操作。有指导吗?
您可以使用
pdftools
包从 PDF 中提取文本,然后搜索关键字/短语。
"pdf_dir <- "pdfs"
write.csv(final_results, "keyword_search_results.csv", row.names = FALSE)
这应该可以解决问题。如果您对此有疑问,请询问。
# set the current script's location as working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
#install.packages(c("pdftools", "stringr", "dplyr"))
library(pdftools)
library(stringr)
library(dplyr)
# Define the list of keywords/phrases
keywords <- c("information", "hairs", "example phrase")
# Function to count keywords in a PDF
count_keywords <- function(pdf_path, keywords) {
# Extract text from PDF
pdf_text <- pdf_text(pdf_path)
# Combine all pages into a single text
full_text <- paste(pdf_text, collapse = " ")
# Check if each keyword/phrase appears in the text
keyword_found <- sapply(keywords, function(keyword) {
if (str_detect(full_text, fixed(keyword, ignore_case = TRUE))) {
return(1) # Assign 1 if keyword is found
} else {
return(0) # Assign 0 if keyword is not found
}
})
# Calculate total points for this PDF
total_points <- sum(keyword_found)
# Return a summary
return(data.frame(
PDF = basename(pdf_path),
Keyword = keywords,
Found = keyword_found,
Total_Points = total_points
))
}
# Directory containing PDF files # create a folder on the same level as this script called "pdfs" and store your pdfs there
pdf_dir <- "pdfs"
# Get list of PDF files
pdf_files <- list.files(pdf_dir, pattern = "\\.pdf$", full.names = TRUE)
# Process each PDF and count keywords
results <- lapply(pdf_files, count_keywords, keywords = keywords)
# Combine results into a single data frame
final_results <- bind_rows(results)
# Save to a CSV file
write.csv(final_results, "keyword_search_results.csv", row.names = FALSE)
# Print the results
print(final_results)
结果是:
> print(final_results)
PDF Keyword Found Total_Points
information somatosensory.pdf information 1 2
hairs somatosensory.pdf hairs 1 2
example phrase somatosensory.pdf example phrase 0 2