使用 PyMuPdf 删除 pdf 水印或重新组装多重图像

我一直在使用 PyMuPDF 从一些 pdf 文档中删除一些水印。


在大多数情况下,水印只是覆盖在图像顶部的 pdf 文本(实际的 pdf 内容)。如果我获取该图像并将其应用到新的 pdf 页面,我可以获得没有水印的 pdf 原始页面。

但是,在某些情况下,尽管水印仍然只是覆盖在图像顶部的文本,但实际的 pdf 页面会分解为多个图像。我可以获得这些图像,但我无法将它们重新组装到原始页面中。

寻找一种替代方法来删除该水印或一种将图像重新组合在一起以看起来像原始 pdf 页面的方法。


    # Open the original pdf file
    doc = fitz.open(os.path.join(input_folder, single_filename))

    # Initialize a new PDF to hold the images
    pdf_output = fitz.open()

    # Iterate through pages in the document
    for page_num in range(doc.page_count):
        page = doc.load_page(page_num)
        output = json.loads(page.get_text("json"))
        if "blocks" in output and len(output["blocks"]) > 0 and "image" in output["blocks"][0]:
            base64_string = output["blocks"][0]["image"]

            # Decode the Base64 string
            image_data = base64.b64decode(base64_string)

            # Insert the image into the new PDF
            img_pix = fitz.Pixmap(image_data)

            # Create a new page with dimensions of the image
            pdf_page = pdf_output.new_page(width=img_pix.width, height=img_pix.height)

            # Insert the image into the new page
            pdf_page.insert_image(pdf_page.rect, pixmap=img_pix)
            pdf_output.save("without_watermark/" + single_filename)

我根据 github 讨论尝试了以下代码删除与其他文本重叠的背景文本。。它在当前的 pymupdf 版本上工作正常。

pip install PyMuPDF

import pymupdf

def process_page(page : pymupdf.Page):
    """Process one page."""
    # doc = page.parent  # the page's owning document
    # page.clean_contents()  # clean page painting syntax
    xref = page.get_contents()[0]  # get xref of resulting /Contents
    changed = 0  # this will be returned
    # read sanitized contents, splitted by line breaks
    cont_lines = page.read_contents().splitlines()
    # print(cont_lines)
    for i in range(len(cont_lines)):  # iterate over the lines
        line = cont_lines[i]
        # print(line)
        if not (line.startswith(b"/Artifact") and b"/Watermark" in line):
            continue  # this was not for us
        # line number i starts the definition, j ends it:
        j = cont_lines.index(b"EMC", i)
        for k in range(i, j):
            # look for image / xobject invocations in this line range
            do_line = cont_lines[k]
            if do_line.endswith(b"Do"):  # this invokes an image / xobject
                cont_lines[k] = b""  # remove / empty this line
                changed += 1
    if changed > 0:  # if we did anything, write back modified /Contents
        doc.update_stream(xref, b"\n".join(cont_lines))
    return changed

fpath = 'your_pdf_file_path/file_name.pdf'
doc = pymupdf.open(fpath)
changed = 0  # indicates successful removals
for page in doc:
    changed += process_page(page)  # increase number of changes
if changed > 0:
    x = "s" if doc.page_count > 1 else ""
    print(f"{changed} watermarks have been removed on {doc.page_count} page{x}.")
    doc.ez_save(doc.name.replace(".pdf", "-nowm.pdf"))
    print("Nothing to change")

