使用Langchain的PyPDFLoader时的图像参考

Question

我想使用 langchain 的 PyPDFLoader 从 PDF 导入文本，并将图像替换为文本参考。我想将图像导出为文件并使用参考检索它们。

有没有办法为摘录中包含的图像添加占位符？

Answer 1

Langchain 的 PyPDFLoader 主要专注于从 PDF 文档中提取文本，可能不具备直接处理这些文件中的图像的功能。但是，您可以利用 PyMuPDF 库（通常称为 Fitz）来管理 PDF 中的图像，该库经常与 PyPDFLoader 集成。以下是直接使用 PyMuPDF 库查询的潜在解决方案：

使用 PyMuPDF 将 PDF 图像导出为文件：

def extract_images_from_pdf(pdf_path, output_folder):
    document = fitz.open(pdf_path)
   
for page_num in range(len(document)):
    page = document[page_num]
    image_list = page.get_images(full=True)
    
    for image_index, img in enumerate(image_list, start=1):
        xref = img[0]
        base_image = document.extract_image(xref)
        image_bytes = base_image["image"]
        image_ext = base_image["ext"]  # could be 'png' or 'jpeg'
        image_filename = f"{output_folder}/page_{page_num+1}_img_{image_index}.{image_ext}"
        with open(image_filename, "wb") as img_file:
            img_file.write(image_bytes)
        print(f"Exported: {image_filename}")

使用 PyMuPDF 的参考检索图像：

import fitz

def retrieve_image_by_reference(pdf_path, xref, output_file):
    document = fitz.open(pdf_path)
    base_image = document.extract_image(xref)
    image_bytes = base_image["image"]
    image_ext = base_image["ext"]  # could be 'png' or 'jpeg'

    with open(f"{output_file}.{image_ext}", "wb") as img_file:
        img_file.write(image_bytes)
    print(f"Image saved as: {output_file}.{image_ext}")

使用Langchain的PyPDFLoader时的图像参考

问题描述投票：0回答：1

1个回答

使用 PyMuPDF 将 PDF 图像导出为文件：

使用 PyMuPDF 的参考检索图像：

最新问题

使用Langchain的PyPDFLoader时的图像参考

问题描述 投票：0回答：1

1个回答

使用 PyMuPDF 将 PDF 图像导出为文件：

使用 PyMuPDF 的参考检索图像：

最新问题

问题描述投票：0回答：1