是否有 R 脚本可用于计算 pdf 中一组关键字中关键字的出现情况?

问题描述 投票:0回答:1

我有很多 pdf 文件,我需要在每个文件中搜索特定的关键词/短语。对于每个pdf,我想知道这些关键词/短语出现了多少(但不是它们出现了多少次)。对于出现的每个关键词/短语,我想分配一分(无论出现多少次)。对于每一个没有的,零分。我想要一个可以扫描 pdf 并按上述方式计算关键词/短语的脚本。我已经查看了 readPDF 函数和 PDF 数据提取器 (PDE),但我不确定它们会执行我想要的操作。有指导吗?

r pdf
1个回答
0
投票

您可以使用

pdftools
包从 PDF 中提取文本,然后搜索关键字/短语。

  1. 您必须在与 r 脚本相同的位置创建一个文件夹:
    "pdf_dir <- "pdfs"
  2. 此脚本将最终结果写入 csv 文件
    write.csv(final_results, "keyword_search_results.csv", row.names = FALSE)

这应该可以解决问题。如果您对此有疑问,请询问。

# set the current script's location as working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
#install.packages(c("pdftools", "stringr", "dplyr"))


library(pdftools)
library(stringr)
library(dplyr)

# Define the list of keywords/phrases
keywords <- c("information", "hairs", "example phrase")

# Function to count keywords in a PDF
count_keywords <- function(pdf_path, keywords) {
  # Extract text from PDF
  pdf_text <- pdf_text(pdf_path)
  
  # Combine all pages into a single text
  full_text <- paste(pdf_text, collapse = " ")
  
  # Check if each keyword/phrase appears in the text
  keyword_found <- sapply(keywords, function(keyword) {
    if (str_detect(full_text, fixed(keyword, ignore_case = TRUE))) {
      return(1)  # Assign 1 if keyword is found
    } else {
      return(0)  # Assign 0 if keyword is not found
    }
  })
  
  # Calculate total points for this PDF
  total_points <- sum(keyword_found)
  
  # Return a summary
  return(data.frame(
    PDF = basename(pdf_path),
    Keyword = keywords,
    Found = keyword_found,
    Total_Points = total_points
  ))
}

# Directory containing PDF files # create a folder on the same level as this script called "pdfs" and store your pdfs there
pdf_dir <- "pdfs"

# Get list of PDF files
pdf_files <- list.files(pdf_dir, pattern = "\\.pdf$", full.names = TRUE)

# Process each PDF and count keywords
results <- lapply(pdf_files, count_keywords, keywords = keywords)

# Combine results into a single data frame
final_results <- bind_rows(results)

# Save to a CSV file
write.csv(final_results, "keyword_search_results.csv", row.names = FALSE)

# Print the results
print(final_results)

结果是:

> print(final_results)
                             PDF        Keyword Found Total_Points
information    somatosensory.pdf    information     1            2
hairs          somatosensory.pdf          hairs     1            2
example phrase somatosensory.pdf example phrase     0            2
© www.soinside.com 2019 - 2024. All rights reserved.