尝试从google colab中的pdf文件中提取信息。它只是将第一个文件中的大部分信息重复到所有其他文件中

问题描述 投票:0回答:1

这是代码:

for file in files.get('files', []):
    # ... (Get file content as before)

    # Extract data from the PDF
    pdf_reader = PyPDF2.PdfReader(BytesIO(file_content))
    page = pdf_reader.pages[0]  # Assuming you want to extract from the first page

    # 1. File Name
    file_name = file['name']
    print(f"File: {file_name}")

    # 2. Process Number
    process_number = None
    process_number_match = None
    process_number_match = re.search(r"(\d{7}-\d{2}.\d{4}.\d.\d{2}.\d{4})", page.extract_text())
    if process_number_match:
        process_number = process_number_match.group(1)
        print(f"Process Number: {process_number}")
    else:
        print("erro, número do processo não encontrado")

    # 3. Name
    name = None  # Reset the name variable
    name_match = None
    name_match = re.search(r"(AUTOR|INTERESSADOS|INTERESSADO|INTERESSADA):\s+([A-Z\s]+)", page.extract_text())
    if name_match:
        name = name_match.group(2)
        print(f"Name: {name}")
    else:
        print("error, nome não encontrado")

    # 4. Keywords
    found_keywords = []  # Reset the found_keywords list
    keywords = ["audiência", "subsídios", "cumprimento"]
    for keyword in keywords:
        if keyword in page.extract_text():
            found_keywords.append(keyword)
    if found_keywords:
        print(f"Keywords Found: {', '.join(found_keywords)}")
    else:
        print("erro, pedido não encontrado")

它将继续打印:

Keywords Found: cumprimento
File: 33-00737.015338.pdf
Process Number: (number1)
Name:(name1)
 
S
Keywords Found: cumprimento
File: 32-00737.012571.pdf
Process Number: (number1)
Name:(name1)
 
S
Keywords Found: cumprimento
File: 31-00737.012592.pdf
Process Number: (number1)
Name:(name1)
 
S
Keywords Found: cumprimento
File: 30-00737.010470.pdf
Process Number: (number1)
Name:(name1)
 
S
Keywords Found: cumprimento
File: 29-00737.007060.pdf
Process Number: (number1)
Name:(name1)

文件编号正在更新,因此正在读取正确的文件。但它不断重复其他字符串。我尝试使用 = None 重置它,但没有成功。

尝试使用

# 3. Name
    name = None  # Reset the name variable
    name_match = None
    name_match = re.search(r"(AUTOR|INTERESSADOS|INTERESSADO|INTERESSADA):\s+([A-Z\s]+)", page.extract_text())
    if name_match:
        name = name_match.group(2)
        print(f"Name: {name}")
    else:
        print("error, nome não encontrado")

我希望打印每个文档的名称。相反,我为第一个文档取了正确的名称,并为所有其他文档重复了该名称。

python google-colaboratory pypdf
1个回答
0
投票

你的代码很糟糕。你读过课程吗?

© www.soinside.com 2019 - 2024. All rights reserved.