如何使用python提取突出显示的文本以及与该突出显示的文本相关的注释？

Question

举个例子，假设我有一个 pdf 文件，其中突出显示的文本显示“售价 100 美元”，并且该突出显示的文本附加了一条评论，显示“替换为 99 美元”。

在理想的世界中，我想要做的事情是能够将注释所要突出显示的文本与添加到 pdf 中的实际注释联系起来，无论是在数据框、表格还是 JSON 中。

突出显示的文本评论特价100美元更换为99美元细胞34 号电池

    import pymupdf

doc = pymupdf.open("demo.pdf")
print("Number of pages:", doc.page_count)  # Check the number of pages

for i in range(doc.page_count):
    page = doc[i]
    annotations = list(page.annots())  # Convert the generator to a list
    if annotations:  # Check if there are any annotations
        print(f"Page {i+1} has {len(annotations)} annotations:")
        for annot in annotations:
            print(annot.info)  # Print detailed annotation info
    else:
        print(f"Page {i+1} has no annotations.")

doc.close()  # Good practice to close the document

结果：

Number of pages: 12
Page 1 has 2 annotations:
{'content': 'replace with excitment', 'name': '', 'title': '', 'creationDate': '', 'modDate': '', 'subject': '', 'id': ''}
{'content': 'Remove "rapid progress" from this sentence', 'name': '', 'title': '', 'creationDate': '', 'modDate': '', 'subject': '', 'id': ''}
Page 2 has no annotations.
Page 3 has no annotations.
Page 4 has 1 annotations:
{'content': 'replace with initial idea generation', 'name': '', 'title': '', 'creationDate': '', 'modDate': '', 'subject': '', 'id': ''}
Page 5 has no annotations.
Page 6 has no annotations.
Page 7 has no annotations.
Page 8 has no annotations.
Page 9 has no annotations.
Page 10 has no annotations.
Page 11 has no annotations.
Page 12 has no annotations.

但它只显示添加的评论，而不是突出显示的实际文本以及评论所指向的内容。

Answer 1

这是我最终使用的：

import pymupdf
from tabulate import tabulate

doc = pymupdf.open("demo.pdf")
print("Number of pages:", doc.page_count)  # Check the number of pages

table_data = []  # Initialize a list to store the table data

for i in range(doc.page_count):
    page = doc[i]
    annotations = list(page.annots())  # Convert the generator to a list
    if annotations:  # Check if there are any annotations
        print(f"Page {i+1} has {len(annotations)} annotations:")
        for annot in annotations:
            if annot.type[1] == "Highlight":  # Check if it's a highlight annotation
                # Get the coordinates of the highlighted area
                rect = annot.rect
                # Extract the highlighted text using the 'text' option for get_text()
                highlighted_text = page.get_text(clip=rect, option = "text") 
                # Extract content from annot.info (assuming it's a JSON-like string)
                content = annot.info.get("content", "") # Use .get() to handle cases where 'content' might be missing
                # Append the data as a row to the table_data list
                table_data.append([i+1, highlighted_text, content]) 

    else:
        print(f"Page {i+1} has no annotations.")

doc.close()  # Good practice to close the document

# Create a table using tabulate
table = tabulate(table_data, headers=["Page", "Highlighted Text", "Content"], tablefmt="grid")
print(table)

如何使用python提取突出显示的文本以及与该突出显示的文本相关的注释？

问题描述投票：0回答：1

1个回答

最新问题

如何使用python提取突出显示的文本以及与该突出显示的文本相关的注释？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1