我每天都会收到两份PDF,它们的格式和数据大致相同,但有时会出现不一致的情况。目前我必须手动并排检查它们以记录任何错误。 我怎样才能通过使用Python来“突出”差异来自动化这个过程
import fitz # PyMuPDF
from textblob import TextBlob
# Function to extract text from a PDF
def extract_text(pdf_path):
text = ""
doc = fitz.open(pdf_path)
for page in doc:
text += page.get_text()
return text
# Function to compare PDFs and highlight differing words in red
def compare_and_highlight(pdf_path1, pdf_path2, output_path):
text1 = extract_text(pdf_path1)
text2 = extract_text(pdf_path2)
# Create TextBlob objects for both texts
blob1 = TextBlob(text1)
blob2 = TextBlob(text2)
# Find the words that are different between the two TextBlob objects
differing_words = set(blob2.words) - set(blob1.words)
doc = fitz.open(pdf_path2)
# Track highlighted words to avoid duplicates
highlighted_words = set()
for page in doc:
for word in differing_words:
if word not in highlighted_words:
for inst in page.search_for(word):
# Highlight the differing word with a red background
highlight = page.add_highlight_annot(inst)
highlighted_words.add(word) # Mark the word as highlighted
# Save the modified PDF with highlighted differing words
doc.save(output_path)
doc.close()
# Input PDF file paths
pdf_path1 = 's1.pdf'
pdf_path2 = 's2.pdf'
# Output PDF file path with differing words highlighted in red
output_path = 'out.pdf'
# Compare PDFs and highlight differing words in red
compare_and_highlight(pdf_path1, pdf_path2, output_path)
这段代码工作正常,但它突出显示了正确的文本,也像重复的文本一样。假设 s1 pdf 包含 moment 单词,它是正确的单词,而 s2 pdf 包含 moment 单词和 moment,因此它突出显示两者都需要仅突出显示有差异的一个“您的文本”
对于 ,我们需要跟踪单词的位置。正如,您说文档中存在许多“时刻”单词,但您只想突出显示额外的单词。
你有办法解决这个问题吗?