如何从 pdf 文件目录（大约 5000 个 pdf）中搜索 PDF 中出现的多个关键字

Question

我对 python 比较陌生，但决定尝试构建一个工作工具，它将在 PDF 文档的一小部分中查找用户输入的某些关键词。

到目前为止，我一次可以很好地处理 1 个关键字，并且我已经成功升级了 GUI。但是，我无法在文档中搜索超过 1 个关键字/短语。例如，我想搜索一个名称，这样我就可以输入它，它工作正常，它给我返回了出现该名称的所有 PDF（这很棒）。但是当我尝试添加另一个搜索词/短语时，我无法让它工作。

例如，如果我想搜索系统中的所有史密斯先生，然后搜索药物名称（我从事毒理学工作）。我要输入：史密斯，扑热息痛但这不起作用。

它只对孤立的史密斯先生有效。我认为这是因为它使用的是精确的文本，但我不知道如何添加更多关键词。

感谢任何帮助，到目前为止我已经发布了我的代码，其中包括所有 GUI 内容

import requests,webbrowser
from bs4 import BeautifulSoup
from tkinter import *
import os
import fitz
import os
import customtkinter

path= r'O:\Sent Questions'
files = os.listdir(path)

customtkinter.set_appearance_mode("dark")
root = customtkinter.CTk()
root.geometry("700x350") 
root.title("Questions Keyword Search")

label=customtkinter.CTkLabel(root,text="Questions Keyword Search Engine",font=("Inter",30))
label.pack(side=TOP) 
text=StringVar()
def search():
    global entry
    Search = entry.get()
    print(Search)
    for file in files:
        doc=fitz.open(path+'\\'+file)
        for page in doc:
            text = page.get_text()
            # print(text)
            result = text.find(Search)
            if result != -1:
                print(file)
                pass

            #need to add another loop the key words for, so for each page in dock again search for more keywords
            
label_1=customtkinter.CTkLabel(root,text="Enter Keywords Below",font=("Inter",15))
label_1.place(x=275,y=100)

label_2=customtkinter.CTkLabel(root,text="You can input as many as you'd like but they must be in double quotation marks and split by commas",font=("Inter",12))
label_2.place(x=97,y=130)

label_3=customtkinter.CTkLabel(root,text='Example: cocaine, SoHT, LC-MS',font=("Inter",12))
label_3.place(x=250,y=150)

entry=customtkinter.CTkEntry(master=root,width=200)
entry.place(x=252,y=190)
button=customtkinter.CTkButton(master=root,text="Search",command=search)
button.place(x=285,y=230)
root.mainloop()

input("prompt: ")

Answer 1

您似乎想要选择用户输入的提及所有关键字的页面。

我会要求用户用逗号分隔关键字，并且不要强制使用双“撇号作为分隔符。然后用逗号分割用户输入以获取用户关键字列表。为了避免大小写问题，请将所有内容翻译为小写：关键字和提取的文本将如下所示：


def search():
    global entry
    Search = entry.get()
    kw_list = Search.lower().split(",")

    for file in files:
        doc=fitz.open(path+'\\'+file)
        for page in doc:
            text = page.get_text().lower()
            for kw in kw_list:
                if kw not in text:  # skip page for any missing keyword
                    continue
            # all keywords occurs on this page!
            print(f"found on page {page.number+1} of file {doc.name}")

不确定选择逻辑是否符合您的要求。也许您希望不那么严格，如果至少有 n-1 个关键字或其他内容，则已经接受页面。因此，在忽略页面之前，您需要更改选择并查找匹配项。

Answer 2

def search():
    global entry
    Search = entry.get()
    keywords = set(Search.lower().split(","))  # make a set of desired keywords

    for file in files:
        doc=fitz.open(path+'\\'+file)
        file_keywords = set()
        for page in doc:
            text = page.get_text().lower()
            for kw in kw_list:
                if kw in text:
                    file_keywords.add(kw)  # take note this keyword occurred
        # check whether all desired keywords are present in this file
        if keywords <= file_keywords:
            print(f"All keywords in file '{doc.name}'.")

如何从 pdf 文件目录（大约 5000 个 pdf）中搜索 PDF 中出现的多个关键字

问题描述投票：0回答：2

2个回答

最新问题

如何从 pdf 文件目录（大约 5000 个 pdf）中搜索 PDF 中出现的多个关键字

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2