我一直在使用 PyMuPDF 从一些 pdf 文档中删除一些水印。
然而,有些文件被证明比其他文件更困难。
在大多数情况下,水印只是覆盖在图像顶部的 pdf 文本(实际的 pdf 内容)。如果我获取该图像并将其应用到新的 pdf 页面,我可以获得没有水印的 pdf 原始页面。
但是,在某些情况下,尽管水印仍然只是覆盖在图像顶部的文本,但实际的 pdf 页面会分解为多个图像。我可以获得这些图像,但我无法将它们重新组装到原始页面中。
寻找一种替代方法来删除该水印或一种将图像重新组合在一起以看起来像原始 pdf 页面的方法。
我的代码目前看起来像这样:
# Open the original pdf file
doc = fitz.open(os.path.join(input_folder, single_filename))
# Initialize a new PDF to hold the images
pdf_output = fitz.open()
# Iterate through pages in the document
for page_num in range(doc.page_count):
page = doc.load_page(page_num)
output = json.loads(page.get_text("json"))
if "blocks" in output and len(output["blocks"]) > 0 and "image" in output["blocks"][0]:
base64_string = output["blocks"][0]["image"]
# Decode the Base64 string
image_data = base64.b64decode(base64_string)
# Insert the image into the new PDF
img_pix = fitz.Pixmap(image_data)
# Create a new page with dimensions of the image
pdf_page = pdf_output.new_page(width=img_pix.width, height=img_pix.height)
# Insert the image into the new page
pdf_page.insert_image(pdf_page.rect, pixmap=img_pix)
pdf_output.save("without_watermark/" + single_filename)
else:
pass
我根据 github 讨论尝试了以下代码删除与其他文本重叠的背景文本。。它在当前的 pymupdf 版本上工作正常。
pip install PyMuPDF
import pymupdf
def process_page(page : pymupdf.Page):
"""Process one page."""
# doc = page.parent # the page's owning document
# page.clean_contents() # clean page painting syntax
xref = page.get_contents()[0] # get xref of resulting /Contents
changed = 0 # this will be returned
# read sanitized contents, splitted by line breaks
cont_lines = page.read_contents().splitlines()
print(len(cont_lines))
# print(cont_lines)
for i in range(len(cont_lines)): # iterate over the lines
line = cont_lines[i]
# print(line)
if not (line.startswith(b"/Artifact") and b"/Watermark" in line):
continue # this was not for us
# line number i starts the definition, j ends it:
print(line)
j = cont_lines.index(b"EMC", i)
for k in range(i, j):
# look for image / xobject invocations in this line range
do_line = cont_lines[k]
if do_line.endswith(b"Do"): # this invokes an image / xobject
cont_lines[k] = b"" # remove / empty this line
changed += 1
if changed > 0: # if we did anything, write back modified /Contents
doc.update_stream(xref, b"\n".join(cont_lines))
return changed
fpath = 'your_pdf_file_path/file_name.pdf'
doc = pymupdf.open(fpath)
changed = 0 # indicates successful removals
for page in doc:
changed += process_page(page) # increase number of changes
if changed > 0:
x = "s" if doc.page_count > 1 else ""
print(f"{changed} watermarks have been removed on {doc.page_count} page{x}.")
doc.ez_save(doc.name.replace(".pdf", "-nowm.pdf"))
else:
print("Nothing to change")