The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Translate PDFs with Python

Translating PDFs with Python involves several steps: extracting text from the PDF, translating the text using a translation API or model, and optionally saving the translated content back into a new PDF. Here’s a complete overview of how to do it, with a focus on using popular Python libraries.


Required Libraries

You’ll need the following Python libraries:

  • PyMuPDF (fitz) or pdfplumber: for extracting text from PDFs.

  • googletrans or deep-translator: for translating the extracted text.

  • fpdf or reportlab: to write the translated text into a new PDF.

Install them with:

bash
pip install PyMuPDF googletrans==4.0.0-rc1 fpdf

Step-by-Step Process

Step 1: Extract Text from PDF

Using PyMuPDF (highly accurate with layout retention):

python
import fitz # PyMuPDF def extract_text_from_pdf(pdf_path): doc = fitz.open(pdf_path) text = "" for page in doc: text += page.get_text() return text

Step 2: Translate Text

Using googletrans:

python
from googletrans import Translator def translate_text(text, dest_language='es'): translator = Translator() translation = translator.translate(text, dest=dest_language) return translation.text

Note: googletrans may face rate limits or instability due to API changes. For production use, consider DeepL, AWS Translate, or Google Cloud Translate via official APIs.

Step 3: Write Translated Text to New PDF

Using fpdf:

python
from fpdf import FPDF def save_to_pdf(text, output_path): pdf = FPDF() pdf.add_page() pdf.set_auto_page_break(auto=True, margin=15) pdf.set_font("Arial", size=12) for line in text.split('n'): pdf.multi_cell(0, 10, line) pdf.output(output_path)

Full Example: Translate an English PDF to Spanish

python
def translate_pdf(input_path, output_path, target_lang='es'): text = extract_text_from_pdf(input_path) translated = translate_text(text, target_lang) save_to_pdf(translated, output_path) # Usage translate_pdf("input.pdf", "translated_output.pdf", target_lang='es')

Handling Multi-page and Large PDFs

For long documents, you may want to:

  • Translate page by page to avoid API limits.

  • Use batching with textwrap or nltk to split long texts.

  • Add error handling to skip or retry failed translations.


Alternatives and Enhancements

Example using DeepL:

python
from deep_translator import DeepL def translate_text_deepl(text, target='ES'): return DeepL(api_key="your_api_key").translate(text, target_lang=target)

Conclusion

Translating PDFs with Python is highly achievable using PyMuPDF for extraction, a translation library for converting text, and fpdf to generate the final output. For best results, especially with large or scanned documents, use robust APIs and handle rate limiting and pagination carefully.

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About