使用 page.get_text("dict")["blocks"] 使用 PyMuPDF 获取图像的外部参照时出错

问题描述 投票:0回答:1

使用以下Python函数,我尝试从pdf文档中提取文本和图像。另外,我想在提取的文本中相应图像的确切位置放置一个类似

f"<<<image_{image_counter}>>>"
的标签。 这是我的Python函数:

def extract_text_and_save_images_not_working(pdf_path):

    doc = fitz.open(pdf_path)
    full_text = ""
    image_counter = 1  # Initialize the image counter before iterating through pages
    
    for page_num in range(len(doc)): # Iterate through each page of the pdf document

        page = doc.load_page(page_num) # Load the pdf page
        blocks = page.get_text("dict")["blocks"]  # The list of block dictionaries 
        
        for block in blocks:  # Iterate through each block

            if block['type'] == 0:  # If the block is a text block
                for line in block["lines"]:  # Iterate through lines in the block
                    for span in line["spans"]:  # Iterate through spans in the line
                        full_text += span["text"] + " "  # Append text to full_text
                full_text += "\n"  # Add newline after each block

            elif block['type'] == 1:  # If the block is an image block
                image_label = f"<<<image_{image_counter}>>>"  # Label to insert in the extracted text in place of the corresponding image 
                full_text += f"{image_label}\n"  # Insert image label at the image location
                img = block['image']
                xref = img[0]
                print()
                print(xref)
                print()
                base_image = doc.extract_image(xref)  # Attempt to extract image
                image_bytes = base_image["image"]  # Get the image bytes
                image_filename = f"image_{image_counter}.png"

                with open(image_filename, "wb") as img_file:  # Save the image
                    img_file.write(image_bytes)
                
                image_counter += 1  # Increment counter for next image regardless of extraction success

    doc.close() # Close the pdf document
    return full_text

基本上,该函数使用此函数提取每个页面的块字典

blocks = page.get_text("dict")["blocks"]
,并针对每个块检查它是否是文本块(
block['type'] == 0
)或图像块(
block['type'] == 1
)。如果块是图像,则该函数将图像保存在运行脚本的同一目录中,并使用此名称
f"image_{image_counter}.png"
并在提取的文本中标识图像位置的行添加标签 (
f"<<<image_{image_counter}>>>"
)在pdf中。 现在,当我运行这个函数时,我收到以下错误:

Traceback (most recent call last):
  File "c:\Users\xxxx\Desktop\X_Project\extract_images_from_pdf\extract_text_and_images_from_pdf.py", line 93, in <module>
    extracted_text = extract_text_and_save_images_not_working(pdf_path)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\xxxx\Desktop\X_Project\extract_images_from_pdf\extract_text_and_images_from_pdf.py", line 76, in extract_text_and_save_images_not_working
    base_image = doc.extract_image(xref)  # Attempt to extract image
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xxxx\Desktop\X_Project\extract_images_from_pdf\venv\Lib\site-packages\fitz\__init__.py", line 3894, in extract_image
    raise ValueError( MSG_BAD_XREF)
ValueError: bad xref

这个错误是有意义的,因为在变量

xref
中,我应该得到一个代表图像交叉引用号的整数,但我得到了另一个不代表正确交叉引用号的整数。换句话说,在我对正在使用的特定文档 pdf 的练习中,我期望
xref
= 52,但我得到
xref
= 137。

python pdf pymupdf
1个回答
0
投票

将其放入函数中并调用它应该可以工作。那里的舍入可能不需要。

# Get position of current image (round, but not sure this is ever needed)
image_position_in = np.round(block['bbox'], decimals = 3)

# Get all images on page
ims_curr_page = page.get_images()

# Filter to only keep images of matching size
ims_curr_page = [image_curr for image_curr in ims_curr_page if (image_curr[2] == block['width'] and image_curr[3] == block['height'])]

for image_curr in ims_curr_page:
    # Image position
    image_position_curr_all = page.get_image_rects(image_curr[7])

    # As the same image can be reused, loop over set of coordinates
    # In theory the same image can be used across different pages, which is not handled in this code
    for image_position_curr in image_position_curr_all:
        # Round position - not sure it is needed, but must be done as for the target position above
        image_position_curr = np.round(image_position_curr, decimals = 3)

        if all(image_position_curr == image_position_in):
            xref = image_curr[0] 
© www.soinside.com 2019 - 2024. All rights reserved.