Extract tables from PDFs with Python

Extracting tables from PDFs using Python can be efficiently done with several libraries designed for PDF parsing and data extraction. Here’s a detailed guide covering the most popular tools and methods to extract tables from PDFs, along with code examples.

1. Using Tabula-py

Tabula-py is a Python wrapper for Tabula, a Java library that extracts tables from PDFs into DataFrames.

Installation:

bash
pip install tabula-py

You also need to have Java installed on your system because Tabula depends on it.

Basic usage:

python
import tabula

# Path to the PDF file
file_path = "sample.pdf"

# Extract all tables from the PDF as a list of DataFrames
tables = tabula.read_pdf(file_path, pages='all', multiple_tables=True)

for i, table in enumerate(tables):
    print(f"Table {i+1}")
    print(table)

Save extracted tables as CSV:

python
tabula.convert_into(file_path, "output.csv", output_format="csv", pages='all')

2. Using Camelot

Camelot works best with PDFs that have well-defined tables with clear borders.

Installation:

bash
pip install camelot-py[cv]

Note: Camelot requires Ghostscript and Tkinter installed on your system.

Basic usage:

python
import camelot

# Read tables from PDF
tables = camelot.read_pdf('sample.pdf', pages='1')

# Number of tables found
print(f"Total tables extracted: {tables.n}")

# Export tables to CSV or JSON
tables[0].to_csv('table1.csv')
tables[0].to_json('table1.json')

# Print first table as DataFrame
print(tables[0].df)

Camelot also supports two parsing methods:

stream (for tables without borders)
lattice (for tables with borders)

Example:

python
tables = camelot.read_pdf('sample.pdf', flavor='stream', pages='1')

3. Using PyMuPDF (fitz) and Pandas for Custom Extraction

PyMuPDF can extract raw text, but does not directly extract tables. However, if tables are simple, you can extract text and use string processing to parse tables.

Installation:

bash
pip install pymupdf

Example to extract text:

python
import fitz  # PyMuPDF

file_path = "sample.pdf"
doc = fitz.open(file_path)

for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    text = page.get_text()
    print(text)

From the extracted text, use Python string methods or regex to convert to table format.

4. Using pdfplumber

pdfplumber is a powerful tool for extracting tables from PDFs, even for complex layouts.

Installation:

bash
pip install pdfplumber

Extract tables example:

python
import pdfplumber

with pdfplumber.open("sample.pdf") as pdf:
    first_page = pdf.pages[0]
    tables = first_page.extract_tables()
    
    for table in tables:
        for row in table:
            print(row)

You can convert extracted tables into pandas DataFrames:

python
import pandas as pd

with pdfplumber.open("sample.pdf") as pdf:
    first_page = pdf.pages[0]
    tables = first_page.extract_tables()
    
    for i, table in enumerate(tables):
        df = pd.DataFrame(table[1:], columns=table[0])
        print(f"Table {i+1}")
        print(df)

5. Summary of Libraries and Use Cases

Library	Best for	Notes
Tabula-py	Simple tables, multiple pages, Java dependency	Needs Java runtime
Camelot	Tables with clear borders	Requires Ghostscript
pdfplumber	Complex layouts, flexible	Pure Python, easy to install
PyMuPDF	Extracting text, manual parsing required	Not specialized for tables

Tips for Better Table Extraction

Check if the PDF contains actual tables or just images/scans. For scanned PDFs, consider OCR libraries like Tesseract.
Choose the right parsing flavor (Camelot’s lattice vs stream).
Preprocess PDFs if needed (e.g., crop or convert pages).
Test extraction on a sample page before running on the entire document.
Handle merged cells and multi-line headers carefully when converting tables.

With these methods and tools, extracting tables from PDFs using Python becomes manageable and customizable for various types of documents.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

1. Using Tabula-py

2. Using Camelot

3. Using PyMuPDF (fitz) and Pandas for Custom Extraction

4. Using pdfplumber

5. Summary of Libraries and Use Cases

Tips for Better Table Extraction

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic