如何使用python提取突出显示的文本以及与该突出显示的文本相关的注释?

问题描述 投票:0回答:1
举个例子,假设我有一个 pdf 文件,其中突出显示的文本显示“售价 100 美元”,并且该突出显示的文本附加了一条评论,显示“替换为 99 美元”。

在理想的世界中,我想要做的事情是能够将注释所要突出显示的文本与添加到 pdf 中的实际注释联系起来,无论是在数据框、表格还是 JSON 中。

突出显示的文本评论特价100美元更换为99美元细胞34 号电池
import pymupdf doc = pymupdf.open("demo.pdf") print("Number of pages:", doc.page_count) # Check the number of pages for i in range(doc.page_count): page = doc[i] annotations = list(page.annots()) # Convert the generator to a list if annotations: # Check if there are any annotations print(f"Page {i+1} has {len(annotations)} annotations:") for annot in annotations: print(annot.info) # Print detailed annotation info else: print(f"Page {i+1} has no annotations.") doc.close() # Good practice to close the document
结果:

Number of pages: 12 Page 1 has 2 annotations: {'content': 'replace with excitment', 'name': '', 'title': '', 'creationDate': '', 'modDate': '', 'subject': '', 'id': ''} {'content': 'Remove "rapid progress" from this sentence', 'name': '', 'title': '', 'creationDate': '', 'modDate': '', 'subject': '', 'id': ''} Page 2 has no annotations. Page 3 has no annotations. Page 4 has 1 annotations: {'content': 'replace with initial idea generation', 'name': '', 'title': '', 'creationDate': '', 'modDate': '', 'subject': '', 'id': ''} Page 5 has no annotations. Page 6 has no annotations. Page 7 has no annotations. Page 8 has no annotations. Page 9 has no annotations. Page 10 has no annotations. Page 11 has no annotations. Page 12 has no annotations.
但它只显示添加的评论,而不是突出显示的实际文本以及评论所指向的内容。

python openapi pymupdf
1个回答
0
投票
这是我最终使用的:

import pymupdf from tabulate import tabulate doc = pymupdf.open("demo.pdf") print("Number of pages:", doc.page_count) # Check the number of pages table_data = [] # Initialize a list to store the table data for i in range(doc.page_count): page = doc[i] annotations = list(page.annots()) # Convert the generator to a list if annotations: # Check if there are any annotations print(f"Page {i+1} has {len(annotations)} annotations:") for annot in annotations: if annot.type[1] == "Highlight": # Check if it's a highlight annotation # Get the coordinates of the highlighted area rect = annot.rect # Extract the highlighted text using the 'text' option for get_text() highlighted_text = page.get_text(clip=rect, option = "text") # Extract content from annot.info (assuming it's a JSON-like string) content = annot.info.get("content", "") # Use .get() to handle cases where 'content' might be missing # Append the data as a row to the table_data list table_data.append([i+1, highlighted_text, content]) else: print(f"Page {i+1} has no annotations.") doc.close() # Good practice to close the document # Create a table using tabulate table = tabulate(table_data, headers=["Page", "Highlighted Text", "Content"], tablefmt="grid") print(table)
    
© www.soinside.com 2019 - 2024. All rights reserved.