我一直在尝试使用 Python 从 PDF 文件中提取文本,以自动化我工作中重要而乏味的部分。在 ChatGPT 的帮助下,我编写了多行代码。然而,我遇到了一个我和 ChatGPT 都无法解决的问题。有些 PDF 包含文本框、注释注释、突出显示和上划线形式的注释 — 基本上是所有类型的注释。
我编写的程序运行良好,可以提取所有所需的内容,并将其导出到 Excel 文件。但是,它还会提取这些注释中的文本并将其包含在同一个 Excel 文件中,使其稍微难以阅读。
我的问题是:有没有办法从 PDF 中提取文本,而不从注释中提取文本?
这些是ChatGPT多次提供的建议。然而,如果实施的话,程序会有效地跳过注释,而且还会跳过注释下面的任何文本。因此,它在提取过程中忽略了关键信息。
# Get annotation rectangles
annotations = []
if page.annots():
for annot in page.annots():
annotations.append(annot.rect)
def extract_words_from_box(page, rect, annotations):
words = page.get_text("words") # Extract words from the page
words_in_box = [word for word in words if rect.intersects(fitz.Rect(word[:4]))]
# Exclude words that fall within annotation rectangles
words_in_box = [word for word in words_in_box if not any(fitz.Rect(word[:4]).intersects(annot_rect) for annot_rect in annotations)]
return words_in_box
def print_text_in_boxes(pdf_path):
material_box_def = (40, 85, 70, 705) # Adjusted the dimensions
po_number_box_def = (330, 40, 570, 115) # PO number box dimensions
destination_box_def = (40, 200, 330, 300) # Destination box dimensions
other_boxes_definitions = {
'product_code': (72, -2, 131, 15), # Adjusted to avoid concatenation
'due_date': (217, -2, 270, 15),
'qty': (286, -2, 314, 15),
'net_price': (350, -2, 399, 15),
'material_revision_box': (80, 15, 192, 80) # Added box for Material Revision
}
po_number = None # Initialize PO number as None
destination = "Unknown" # Default destination as "Unknown"
extracted_data = []
try:
doc = fitz.open(pdf_path)
for page_num in range(len(doc)):
page = doc[page_num]
# Get annotation rectangles
annotations = []
if page.annots():
for annot in page.annots():
annotations.append(annot.rect)
# Extract PO number from the first page only
if page_num == 0:
po_number_rect = fitz.Rect(*po_number_box_def)
words_in_po_number_box = extract_words_from_box(page, po_number_rect, annotations)
po_numbers = [word[4] for word in words_in_po_number_box if word[4].isdigit() and len(word[4]) == 10]
if po_numbers:
po_number = po_numbers[0]
destination_rect = fitz.Rect(*destination_box_def)
words_in_destination_box = extract_words_from_box(page, destination_rect, annotations)
destination_text = ' '.join(word[4] for word in words_in_destination_box)
destination = determine_destination(destination_text)
material_rect = fitz.Rect(*material_box_def)
words_in_material_box = extract_words_from_box(page, material_rect, annotations)
item_numbers = [word for word in words_in_material_box if word[4].isdigit() and len(word[4]) == 5]
for item_number in item_numbers:
item_number_rect = fitz.Rect(item_number[:4])
item_info = extract_item_info(page, item_number_rect, other_boxes_definitions, po_number, destination, annotations)
# Separate product code and name
if 'product_code' in item_info:
description_parts = item_info['product_code'].split(' ', 1)
if len(description_parts) == 2:
item_info['product_code'] = description_parts[0]
item_info['product_name'] = description_parts[1]
else:
item_info['product_code'] = description_parts[0]
item_info['product_name'] = 'No text found'
# Convert due date to Finnish format
if 'due_date' in item_info:
item_info['due_date'] = convert_date_format(item_info['due_date'])
extracted_data.append(item_info) # Add extracted item info to the data list
except Exception as e:
print(f"Error processing {pdf_path}: {str(e)}")
return extracted_data
这个想法是在知道矩形后删除所有注释。 只要我们无法直接忽略注释文本来进行文本提取,这就是一种规避。
import pymupdf
# retrieve all annotation rectangles
annotations = [annot.rect for annot in page.annots()]
# delete all annotations
doc.xref_set_key(page.xref, "Annots", "null") # remove all annotations
# update the page
page = doc.reload_page(page)
# now extract text using annotation rectangles
....
请注意,上述操作不会更改 PDF 文件。您可以通过
恢复原来的状态doc.close()
doc = pymupdf.open(doc.name)
page = doc[pno]