比较 twp pdf 并使用 python 突出显示差异

Question

我每天都会收到两份PDF，它们的格式和数据大致相同，但有时会出现不一致的情况。目前我必须手动并排检查它们以记录任何错误。我怎样才能通过使用Python来“突出”差异来自动化这个过程

import fitz  # PyMuPDF
from textblob import TextBlob

# Function to extract text from a PDF

def extract_text(pdf_path):


text = ""
doc = fitz.open(pdf_path)  

for page in doc:  

text += page.get_text()


return text

# Function to compare PDFs and highlight differing words in red

def compare_and_highlight(pdf_path1, pdf_path2, output_path):
text1 = extract_text(pdf_path1)
text2 = extract_text(pdf_path2)

    # Create TextBlob objects for both texts
    blob1 = TextBlob(text1)
    blob2 = TextBlob(text2)
    
    # Find the words that are different between the two TextBlob objects
    differing_words = set(blob2.words) - set(blob1.words)
    
    doc = fitz.open(pdf_path2)
    
    # Track highlighted words to avoid duplicates
    highlighted_words = set()
    
    for page in doc:
        for word in differing_words:
            if word not in highlighted_words:
                for inst in page.search_for(word):
                    # Highlight the differing word with a red background
                    highlight = page.add_highlight_annot(inst)
                    highlighted_words.add(word)  # Mark the word as highlighted
    
    # Save the modified PDF with highlighted differing words
    doc.save(output_path)
    doc.close()

# Input PDF file paths

pdf_path1 = 's1.pdf'
pdf_path2 = 's2.pdf'

# Output PDF file path with differing words highlighted in red

output_path = 'out.pdf'

# Compare PDFs and highlight differing words in red

compare_and_highlight(pdf_path1, pdf_path2, output_path)

这段代码工作正常，但它突出显示了正确的文本，也像重复的文本一样。假设 s1 pdf 包含 moment 单词，它是正确的单词，而 s2 pdf 包含 moment 单词和 moment，因此它突出显示两者都需要仅突出显示有差异的一个“您的文本”

Answer 1

对于，我们需要跟踪单词的位置。正如，您说文档中存在许多“时刻”单词，但您只想突出显示额外的单词。

你有办法解决这个问题吗？

比较 twp pdf 并使用 python 突出显示差异

问题描述投票：0回答：1

1个回答

最新问题

比较 twp pdf 并使用 python 突出显示差异

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1