使用 Python 从 PDF 中提取文本而不包含注释

问题描述 投票:0回答:1

我一直在尝试使用 Python 从 PDF 文件中提取文本,以自动化我工作中重要而乏味的部分。在 ChatGPT 的帮助下,我编写了多行代码。然而,我遇到了一个我和 ChatGPT 都无法解决的问题。有些 PDF 包含文本框、注释注释、突出显示和上划线形式的注释 — 基本上是所有类型的注释。

我编写的程序运行良好,可以提取所有所需的内容,并将其导出到 Excel 文件。但是,它还会提取这些注释中的文本并将其包含在同一个 Excel 文件中,使其稍微难以阅读。

我的问题是:有没有办法从 PDF 中提取文本,而不从注释中提取文本?

这些是ChatGPT多次提供的建议。然而,如果实施的话,程序会有效地跳过注释,而且还会跳过注释下面的任何文本。因此,它在提取过程中忽略了关键信息。

# Get annotation rectangles
annotations = []
if page.annots():
    for annot in page.annots():
        annotations.append(annot.rect)

def extract_words_from_box(page, rect, annotations):
    words = page.get_text("words")  # Extract words from the page
    words_in_box = [word for word in words if rect.intersects(fitz.Rect(word[:4]))]
    # Exclude words that fall within annotation rectangles
    words_in_box = [word for word in words_in_box if not any(fitz.Rect(word[:4]).intersects(annot_rect) for annot_rect in annotations)]
    return words_in_box

def print_text_in_boxes(pdf_path):
    material_box_def = (40, 85, 70, 705)  # Adjusted the dimensions
    po_number_box_def = (330, 40, 570, 115)  # PO number box dimensions
    destination_box_def = (40, 200, 330, 300)  # Destination box dimensions
    other_boxes_definitions = {
        'product_code': (72, -2, 131, 15),  # Adjusted to avoid concatenation
        'due_date': (217, -2, 270, 15),
        'qty': (286, -2, 314, 15),
        'net_price': (350, -2, 399, 15),
        'material_revision_box': (80, 15, 192, 80)  # Added box for Material Revision
    }

    po_number = None  # Initialize PO number as None
    destination = "Unknown"  # Default destination as "Unknown"
    extracted_data = []

    try:
        doc = fitz.open(pdf_path)

        for page_num in range(len(doc)):
            page = doc[page_num]

            # Get annotation rectangles
            annotations = []
            if page.annots():
                for annot in page.annots():
                    annotations.append(annot.rect)

            # Extract PO number from the first page only
            if page_num == 0:
                po_number_rect = fitz.Rect(*po_number_box_def)
                words_in_po_number_box = extract_words_from_box(page, po_number_rect, annotations)
                po_numbers = [word[4] for word in words_in_po_number_box if word[4].isdigit() and len(word[4]) == 10]
                if po_numbers:
                    po_number = po_numbers[0]

                destination_rect = fitz.Rect(*destination_box_def)
                words_in_destination_box = extract_words_from_box(page, destination_rect, annotations)
                destination_text = ' '.join(word[4] for word in words_in_destination_box)
                destination = determine_destination(destination_text)

            material_rect = fitz.Rect(*material_box_def)
            words_in_material_box = extract_words_from_box(page, material_rect, annotations)
            
            item_numbers = [word for word in words_in_material_box if word[4].isdigit() and len(word[4]) == 5]
            for item_number in item_numbers:
                item_number_rect = fitz.Rect(item_number[:4])
                item_info = extract_item_info(page, item_number_rect, other_boxes_definitions, po_number, destination, annotations)
                
                # Separate product code and name
                if 'product_code' in item_info:
                    description_parts = item_info['product_code'].split(' ', 1)
                    if len(description_parts) == 2:
                        item_info['product_code'] = description_parts[0]
                        item_info['product_name'] = description_parts[1]
                    else:
                        item_info['product_code'] = description_parts[0]
                        item_info['product_name'] = 'No text found'

                # Convert due date to Finnish format
                if 'due_date' in item_info:
                    item_info['due_date'] = convert_date_format(item_info['due_date'])

                extracted_data.append(item_info)  # Add extracted item info to the data list

    except Exception as e:
        print(f"Error processing {pdf_path}: {str(e)}")

    return extracted_data

python text-extraction pymupdf
1个回答
0
投票

这个想法是在知道矩形后删除所有注释。 只要我们无法直接忽略注释文本来进行文本提取,这就是一种规避。

import pymupdf

# retrieve all annotation rectangles
annotations = [annot.rect for annot in page.annots()]

# delete all annotations
doc.xref_set_key(page.xref, "Annots", "null")  # remove all annotations

# update the page
page = doc.reload_page(page)

# now extract text using annotation rectangles
....

请注意,上述操作不会更改 PDF 文件。您可以通过

恢复原来的状态
  1. doc.close()
  2. doc = pymupdf.open(doc.name)
  3. page = doc[pno]
© www.soinside.com 2019 - 2024. All rights reserved.