在此处输入图像描述我正在尝试创建一个Python脚本,它将迭代PDF的每一页并删除水印。有些 PDF 文件有 500 多页,因此在发送给我们的客户之前需要手动删除所有页面的水印。我遇到的一个问题是,在某些页面上,水印是文本框对象,而其他页面是图像对象。没办法,这就是系统打印这些预览文件的方式。
我尝试使用 PyMuPDF 编写一个脚本,该脚本获取水印的像素坐标并删除具有这些精确尺寸的项目。然而,它有点有效,并非所有水印都是相同的(图像与文本),因此尺寸不同。另外,我只想删除水印,而不删除下面的任何内容。如果有人知道我如何继续前进,我将非常感激!
我尝试了以下代码,该代码是根据 pymupdf github 讨论查找并删除 PDF 文件中的水印中的代码进行了一些修改。它在当前的 pymupdf 版本上工作正常。
pip install PyMuPDF
import pymupdf
def process_page(page : pymupdf.Page):
"""Process one page."""
# doc = page.parent # the page's owning document
# page.clean_contents() # clean page painting syntax
xref = page.get_contents()[0] # get xref of resulting /Contents
changed = 0 # this will be returned
# read sanitized contents, splitted by line breaks
cont_lines = page.read_contents().splitlines()
print(len(cont_lines))
# print(cont_lines)
for i in range(len(cont_lines)): # iterate over the lines
line = cont_lines[i]
# print(line)
if not (line.startswith(b"/Artifact") and b"/Watermark" in line):
continue # this was not for us
# line number i starts the definition, j ends it:
print(line)
j = cont_lines.index(b"EMC", i)
for k in range(i, j):
# look for image / xobject invocations in this line range
do_line = cont_lines[k]
if do_line.endswith(b"Do"): # this invokes an image / xobject
cont_lines[k] = b"" # remove / empty this line
changed += 1
if changed > 0: # if we did anything, write back modified /Contents
doc.update_stream(xref, b"\n".join(cont_lines))
return changed
fpath = 'your_pdf_file_path/file_name.pdf'
doc = pymupdf.open(fpath)
changed = 0 # indicates successful removals
for page in doc:
changed += process_page(page) # increase number of changes
if changed > 0:
x = "s" if doc.page_count > 1 else ""
print(f"{changed} watermarks have been removed on {doc.page_count} page{x}.")
doc.ez_save(doc.name.replace(".pdf", "-nowm.pdf"))
else:
print("Nothing to change")