The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Extract tables from PDFs with Python

Extracting tables from PDFs using Python can be efficiently done with several libraries designed for PDF parsing and data extraction. Here’s a detailed guide covering the most popular tools and methods to extract tables from PDFs, along with code examples.


1. Using Tabula-py

Tabula-py is a Python wrapper for Tabula, a Java library that extracts tables from PDFs into DataFrames.

Installation:

bash
pip install tabula-py

You also need to have Java installed on your system because Tabula depends on it.

Basic usage:

python
import tabula # Path to the PDF file file_path = "sample.pdf" # Extract all tables from the PDF as a list of DataFrames tables = tabula.read_pdf(file_path, pages='all', multiple_tables=True) for i, table in enumerate(tables): print(f"Table {i+1}") print(table)

Save extracted tables as CSV:

python
tabula.convert_into(file_path, "output.csv", output_format="csv", pages='all')

2. Using Camelot

Camelot works best with PDFs that have well-defined tables with clear borders.

Installation:

bash
pip install camelot-py[cv]

Note: Camelot requires Ghostscript and Tkinter installed on your system.

Basic usage:

python
import camelot # Read tables from PDF tables = camelot.read_pdf('sample.pdf', pages='1') # Number of tables found print(f"Total tables extracted: {tables.n}") # Export tables to CSV or JSON tables[0].to_csv('table1.csv') tables[0].to_json('table1.json') # Print first table as DataFrame print(tables[0].df)

Camelot also supports two parsing methods:

  • stream (for tables without borders)

  • lattice (for tables with borders)

Example:

python
tables = camelot.read_pdf('sample.pdf', flavor='stream', pages='1')

3. Using PyMuPDF (fitz) and Pandas for Custom Extraction

PyMuPDF can extract raw text, but does not directly extract tables. However, if tables are simple, you can extract text and use string processing to parse tables.

Installation:

bash
pip install pymupdf

Example to extract text:

python
import fitz # PyMuPDF file_path = "sample.pdf" doc = fitz.open(file_path) for page_num in range(len(doc)): page = doc.load_page(page_num) text = page.get_text() print(text)

From the extracted text, use Python string methods or regex to convert to table format.


4. Using pdfplumber

pdfplumber is a powerful tool for extracting tables from PDFs, even for complex layouts.

Installation:

bash
pip install pdfplumber

Extract tables example:

python
import pdfplumber with pdfplumber.open("sample.pdf") as pdf: first_page = pdf.pages[0] tables = first_page.extract_tables() for table in tables: for row in table: print(row)

You can convert extracted tables into pandas DataFrames:

python
import pandas as pd with pdfplumber.open("sample.pdf") as pdf: first_page = pdf.pages[0] tables = first_page.extract_tables() for i, table in enumerate(tables): df = pd.DataFrame(table[1:], columns=table[0]) print(f"Table {i+1}") print(df)

5. Summary of Libraries and Use Cases

LibraryBest forNotes
Tabula-pySimple tables, multiple pages, Java dependencyNeeds Java runtime
CamelotTables with clear bordersRequires Ghostscript
pdfplumberComplex layouts, flexiblePure Python, easy to install
PyMuPDFExtracting text, manual parsing requiredNot specialized for tables

Tips for Better Table Extraction

  • Check if the PDF contains actual tables or just images/scans. For scanned PDFs, consider OCR libraries like Tesseract.

  • Choose the right parsing flavor (Camelot’s lattice vs stream).

  • Preprocess PDFs if needed (e.g., crop or convert pages).

  • Test extraction on a sample page before running on the entire document.

  • Handle merged cells and multi-line headers carefully when converting tables.


With these methods and tools, extracting tables from PDFs using Python becomes manageable and customizable for various types of documents.

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About