使用 PyPDF 到目前为止,正在尝试提取内部链接:
import pypdf
internal_links = []
with open(<pdf_path>, 'rb') as pdf_file:
reader = pypdf.PdfReader(pdf_file)
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
if "/Annots" in page:
for annot in page["/Annots"]:
annot_obj = annot.get_object()
if annot_obj["/Subtype"] == "/Link":
if "/Dest" in annot_obj:
dest = annot_obj["/Dest"]
print(f"page: {page_num} dest: {dest}")
但不知道如何从
dest
获取页码。查看了 pypdf 代码库以及返回的对象,但仍然卡住了。任何帮助表示赞赏:)
这是我发现的。
for i in range(len(reader.pages)):
page = reader.pages[i]
# Extract text from page
pdf_text = page.extract_text()
# Print all URL
if "/Annots" in page:
for annot in page["/Annots"]:
annot_obj = annot.get_object()
if annot_obj["/Subtype"] == "/Link":
dest = annot_obj["/A"]["/URI"]
print(f"page: {i} dest: {dest}")