The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Build a Python script to compare PDFs

Here’s a Python script that compares two PDFs by extracting their text and showing differences. It uses PyMuPDF (fitz) for PDF text extraction and Python’s built-in difflib for comparison.

python
import fitz # PyMuPDF import difflib def extract_text_from_pdf(pdf_path): doc = fitz.open(pdf_path) text = "" for page in doc: text += page.get_text() return text def compare_pdfs(pdf1_path, pdf2_path): text1 = extract_text_from_pdf(pdf1_path).splitlines() text2 = extract_text_from_pdf(pdf2_path).splitlines() diff = difflib.unified_diff(text1, text2, fromfile=pdf1_path, tofile=pdf2_path, lineterm='') for line in diff: print(line) if __name__ == "__main__": import sys if len(sys.argv) != 3: print("Usage: python compare_pdfs.py file1.pdf file2.pdf") sys.exit(1) compare_pdfs(sys.argv[1], sys.argv[2])

How to use:

  1. Install PyMuPDF:

    nginx
    pip install pymupdf
  2. Save this script as compare_pdfs.py.

  3. Run from the command line:

    nginx
    python compare_pdfs.py first.pdf second.pdf

The script outputs the line-by-line differences between the text extracted from both PDFs in a unified diff format.

If you want me to help you make it more advanced (like comparing layout or images), just ask!

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About