如何使用PyPDF提取内部链接

问题描述 投票:0回答:1

使用 PyPDF 到目前为止,正在尝试提取内部链接:

    import pypdf

    internal_links = []
    with open(<pdf_path>, 'rb') as pdf_file:
        reader = pypdf.PdfReader(pdf_file)
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            if "/Annots" in page:
                for annot in page["/Annots"]:
                    annot_obj = annot.get_object()
                    if annot_obj["/Subtype"] == "/Link":
                        if "/Dest" in annot_obj:
                            dest = annot_obj["/Dest"]
                            print(f"page: {page_num} dest: {dest}")

但不知道如何从

dest
获取页码。查看了 pypdf 代码库以及返回的对象,但仍然卡住了。任何帮助表示赞赏:)

python pypdf
1个回答
0
投票

这是我发现的。

for i in range(len(reader.pages)):
page = reader.pages[i]
# Extract text from page
pdf_text = page.extract_text()
# Print all URL
if "/Annots" in page:
    for annot in page["/Annots"]:
        annot_obj = annot.get_object()
        if annot_obj["/Subtype"] == "/Link":
            dest = annot_obj["/A"]["/URI"]
            print(f"page: {i} dest: {dest}")
© www.soinside.com 2019 - 2024. All rights reserved.