在理想的世界中,我想要做的事情是能够将注释所要突出显示的文本与添加到 pdf 中的实际注释联系起来,无论是在数据框、表格还是 JSON 中。
import pymupdf
doc = pymupdf.open("demo.pdf")
print("Number of pages:", doc.page_count) # Check the number of pages
for i in range(doc.page_count):
page = doc[i]
annotations = list(page.annots()) # Convert the generator to a list
if annotations: # Check if there are any annotations
print(f"Page {i+1} has {len(annotations)} annotations:")
for annot in annotations:
print(annot.info) # Print detailed annotation info
else:
print(f"Page {i+1} has no annotations.")
doc.close() # Good practice to close the document
结果:
Number of pages: 12
Page 1 has 2 annotations:
{'content': 'replace with excitment', 'name': '', 'title': '', 'creationDate': '', 'modDate': '', 'subject': '', 'id': ''}
{'content': 'Remove "rapid progress" from this sentence', 'name': '', 'title': '', 'creationDate': '', 'modDate': '', 'subject': '', 'id': ''}
Page 2 has no annotations.
Page 3 has no annotations.
Page 4 has 1 annotations:
{'content': 'replace with initial idea generation', 'name': '', 'title': '', 'creationDate': '', 'modDate': '', 'subject': '', 'id': ''}
Page 5 has no annotations.
Page 6 has no annotations.
Page 7 has no annotations.
Page 8 has no annotations.
Page 9 has no annotations.
Page 10 has no annotations.
Page 11 has no annotations.
Page 12 has no annotations.
但它只显示添加的评论,而不是突出显示的实际文本以及评论所指向的内容。
import pymupdf
from tabulate import tabulate
doc = pymupdf.open("demo.pdf")
print("Number of pages:", doc.page_count) # Check the number of pages
table_data = [] # Initialize a list to store the table data
for i in range(doc.page_count):
page = doc[i]
annotations = list(page.annots()) # Convert the generator to a list
if annotations: # Check if there are any annotations
print(f"Page {i+1} has {len(annotations)} annotations:")
for annot in annotations:
if annot.type[1] == "Highlight": # Check if it's a highlight annotation
# Get the coordinates of the highlighted area
rect = annot.rect
# Extract the highlighted text using the 'text' option for get_text()
highlighted_text = page.get_text(clip=rect, option = "text")
# Extract content from annot.info (assuming it's a JSON-like string)
content = annot.info.get("content", "") # Use .get() to handle cases where 'content' might be missing
# Append the data as a row to the table_data list
table_data.append([i+1, highlighted_text, content])
else:
print(f"Page {i+1} has no annotations.")
doc.close() # Good practice to close the document
# Create a table using tabulate
table = tabulate(table_data, headers=["Page", "Highlighted Text", "Content"], tablefmt="grid")
print(table)