Parsing PDFs with PyPDF2

Parsing PDFs with PyPDF2 involves extracting text, metadata, and other content from PDF files using Python. PyPDF2 is a popular and lightweight library that makes it relatively straightforward to work with PDFs programmatically. Here’s a detailed guide on how to parse PDFs using PyPDF2, covering installation, basic usage, and practical examples.

Installing PyPDF2

Before parsing PDFs, you need to install PyPDF2. You can install it via pip:

bash
pip install PyPDF2

Opening and Reading PDF Files

To start parsing, first import PyPDF2 and open a PDF file in binary read mode. Then create a PdfReader object, which allows access to the document’s contents.

python
import PyPDF2

with open('sample.pdf', 'rb') as file:
    reader = PyPDF2.PdfReader(file)

Accessing PDF Metadata

PDF files often contain metadata such as title, author, subject, and creation date. You can access this metadata using the metadata attribute of the PdfReader object.

python
metadata = reader.metadata
print(metadata.title)
print(metadata.author)
print(metadata.subject)

Extracting Text from Pages

One of the most common tasks is extracting text from PDF pages. PyPDF2 provides a pages attribute, which is a list-like object containing each page of the PDF. You can iterate over these pages and extract text.

python
for page_num in range(len(reader.pages)):
    page = reader.pages[page_num]
    text = page.extract_text()
    print(f"Page {page_num + 1} Text:")
    print(text)

Extracting Text from a Specific Page

If you want to extract text from a specific page only:

python
page = reader.pages[0]  # first page
text = page.extract_text()
print(text)

Handling PDFs with Encrypted Content

Some PDFs are encrypted and require a password to access. PyPDF2 can handle decryption if you provide the correct password.

python
with open('encrypted.pdf', 'rb') as file:
    reader = PyPDF2.PdfReader(file)
    if reader.is_encrypted:
        reader.decrypt('password')
    page = reader.pages[0]
    text = page.extract_text()
    print(text)

Extracting Other Data: Number of Pages

You can easily find out how many pages a PDF contains using:

python
num_pages = len(reader.pages)
print(f"Total pages: {num_pages}")

Combining Text Extraction with Data Processing

After extracting text, you can process it further — such as cleaning, searching for keywords, or saving it to a text file.

python
all_text = ""
for page in reader.pages:
    all_text += page.extract_text() + "n"

# Example: Save extracted text to a file
with open('output.txt', 'w', encoding='utf-8') as output_file:
    output_file.write(all_text)

Limitations of PyPDF2

Text Extraction Accuracy: PyPDF2 extracts text as it is stored in the PDF, which can sometimes lead to garbled or misordered text, especially with complex layouts or scanned documents.
No OCR: PyPDF2 does not support Optical Character Recognition (OCR). It cannot extract text from scanned images or PDFs without embedded text.
Limited Image/Graphics Extraction: PyPDF2 primarily focuses on text and basic metadata. For advanced image or annotation extraction, consider libraries like pdfplumber or PyMuPDF.

Advanced Parsing: Extracting Tables or Structured Data

While PyPDF2 can extract raw text, extracting structured data like tables often requires more specialized tools. You can combine PyPDF2 with other libraries (e.g., tabula-py, camelot) for enhanced PDF data extraction workflows.

Using PyPDF2 offers a simple way to automate PDF text extraction tasks in Python, especially for text-heavy PDFs. For more complex needs, integrating additional libraries or OCR tools might be necessary.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Installing PyPDF2

Opening and Reading PDF Files

Accessing PDF Metadata

Extracting Text from Pages

Extracting Text from a Specific Page

Handling PDFs with Encrypted Content

Extracting Other Data: Number of Pages

Combining Text Extraction with Data Processing

Limitations of PyPDF2

Advanced Parsing: Extracting Tables or Structured Data

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic