Translate PDFs with Python

Translating PDFs with Python involves several steps: extracting text from the PDF, translating the text using a translation API or model, and optionally saving the translated content back into a new PDF. Here’s a complete overview of how to do it, with a focus on using popular Python libraries.

Required Libraries

You’ll need the following Python libraries:

PyMuPDF (fitz) or pdfplumber: for extracting text from PDFs.
googletrans or deep-translator: for translating the extracted text.
fpdf or reportlab: to write the translated text into a new PDF.

Install them with:

bash
pip install PyMuPDF googletrans==4.0.0-rc1 fpdf

Step-by-Step Process

Step 1: Extract Text from PDF

Using PyMuPDF (highly accurate with layout retention):

python
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

Step 2: Translate Text

Using googletrans:

python
from googletrans import Translator

def translate_text(text, dest_language='es'):
    translator = Translator()
    translation = translator.translate(text, dest=dest_language)
    return translation.text

Note: googletrans may face rate limits or instability due to API changes. For production use, consider DeepL, AWS Translate, or Google Cloud Translate via official APIs.

Step 3: Write Translated Text to New PDF

Using fpdf:

python
from fpdf import FPDF

def save_to_pdf(text, output_path):
    pdf = FPDF()
    pdf.add_page()
    pdf.set_auto_page_break(auto=True, margin=15)
    pdf.set_font("Arial", size=12)
    for line in text.split('n'):
        pdf.multi_cell(0, 10, line)
    pdf.output(output_path)

Full Example: Translate an English PDF to Spanish

python
def translate_pdf(input_path, output_path, target_lang='es'):
    text = extract_text_from_pdf(input_path)
    translated = translate_text(text, target_lang)
    save_to_pdf(translated, output_path)

# Usage
translate_pdf("input.pdf", "translated_output.pdf", target_lang='es')

Handling Multi-page and Large PDFs

For long documents, you may want to:

Translate page by page to avoid API limits.
Use batching with textwrap or nltk to split long texts.
Add error handling to skip or retry failed translations.

Alternatives and Enhancements

OCR for Scanned PDFs: Use pytesseract with pdf2image for image-based PDFs.
Official Translation APIs:

Example using DeepL:

python
from deep_translator import DeepL

def translate_text_deepl(text, target='ES'):
    return DeepL(api_key="your_api_key").translate(text, target_lang=target)

Conclusion

Translating PDFs with Python is highly achievable using PyMuPDF for extraction, a translation library for converting text, and fpdf to generate the final output. For best results, especially with large or scanned documents, use robust APIs and handle rate limiting and pagination carefully.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Required Libraries

Step-by-Step Process

Step 1: Extract Text from PDF

Step 2: Translate Text

Step 3: Write Translated Text to New PDF

Full Example: Translate an English PDF to Spanish

Handling Multi-page and Large PDFs

Alternatives and Enhancements

Conclusion

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic