Extract figures and tables from PDFs

Extracting figures and tables from PDFs can be done using various tools and methods depending on your needs (manual vs automated, quality, programming skills). Here’s a detailed guide on approaches and tools you can use:

1. Manual Extraction

Adobe Acrobat Pro
Allows you to select and export images and tables as separate files or copy-paste them.
- Use the Selection tool to highlight the figure or table
- Right-click and Save Image As… or copy the table to Excel/Word
- Export PDF to Excel or Word to get tables in editable formats
Screenshot / Snipping Tools
For figures, sometimes the easiest way is to take a high-res screenshot and crop.

2. Automated Extraction Using Software Tools

Tabula (tabula.technology)
Open-source tool focused on extracting tables from PDFs.
- Upload PDF
- Select table area
- Export as CSV, Excel

Camelot (Python library)
Automated table extraction from PDFs. Works best with machine-generated PDFs (not scanned).

python
import camelot
tables = camelot.read_pdf('file.pdf')
tables.export('tables.csv', f='csv')

pdfplumber (Python library)
Extract text, tables, and metadata, with more granular control.

python
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
    first_page = pdf.pages[0]
    table = first_page.extract_table()
    print(table)

PyMuPDF / fitz
Extract images and figures from PDFs programmatically.

python
import fitz
doc = fitz.open("file.pdf")
for page_number in range(len(doc)):
    page = doc[page_number]
    images = page.get_images()
    for img in images:
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        pix.save(f"image_{page_number}_{xref}.png")

3. Using Online Tools (for occasional use)

Smallpdf
Extract images or convert PDF tables to Excel online.
PDFTables
Online service that converts PDF tables to Excel/CSV.
ExtractTable.com
Online tool to extract tables and export them.

4. OCR for Scanned PDFs

If the PDF is scanned (images of text/tables), you need OCR:

Tesseract OCR + Python libraries like pdfplumber or OpenCV
Extract images, run OCR on them to get table text.
ABBYY FineReader
Commercial OCR tool with good table extraction.

Summary by Task:

Task	Tool/Method	Notes
Extract tables	Tabula, Camelot, pdfplumber	Works well on digital PDFs
Extract images/figures	Adobe Acrobat, PyMuPDF (fitz)	Extract figures as images
Extract from scanned PDFs	ABBYY FineReader, Tesseract OCR	Requires OCR step
Quick online extraction	Smallpdf, PDFTables	Good for one-off quick jobs

If you want, I can help generate Python code tailored to your PDFs for batch extraction of tables or images. Would you prefer a coding-based or tool-based solution?

Share This Page:

1. Manual Extraction

2. Automated Extraction Using Software Tools

3. Using Online Tools (for occasional use)

4. OCR for Scanned PDFs

Summary by Task:

Comments

Leave a Reply Cancel reply

Check Out Our Newest Posts we wrote about

Zero-shot extraction of product attributes

Zero-shot classification for product categorization

Zero-Shot and Few-Shot Learning in Practice

Zero Downtime LLM Deployments