Translating PDFs with Python involves several steps: extracting text from the PDF, translating the text using a translation API or model, and optionally saving the translated content back into a new PDF. Here’s a complete overview of how to do it, with a focus on using popular Python libraries.
Required Libraries
You’ll need the following Python libraries:
-
PyMuPDF(fitz) orpdfplumber: for extracting text from PDFs. -
googletransordeep-translator: for translating the extracted text. -
fpdforreportlab: to write the translated text into a new PDF.
Install them with:
Step-by-Step Process
Step 1: Extract Text from PDF
Using PyMuPDF (highly accurate with layout retention):
Step 2: Translate Text
Using googletrans:
Note:
googletransmay face rate limits or instability due to API changes. For production use, considerDeepL, AWS Translate, or Google Cloud Translate via official APIs.
Step 3: Write Translated Text to New PDF
Using fpdf:
Full Example: Translate an English PDF to Spanish
Handling Multi-page and Large PDFs
For long documents, you may want to:
-
Translate page by page to avoid API limits.
-
Use batching with
textwrapornltkto split long texts. -
Add error handling to skip or retry failed translations.
Alternatives and Enhancements
-
OCR for Scanned PDFs: Use
pytesseractwithpdf2imagefor image-based PDFs. -
Official Translation APIs:
Example using DeepL:
Conclusion
Translating PDFs with Python is highly achievable using PyMuPDF for extraction, a translation library for converting text, and fpdf to generate the final output. For best results, especially with large or scanned documents, use robust APIs and handle rate limiting and pagination carefully.