Parse and clean OCR’d receipts

To parse and clean OCR’d (Optical Character Recognition) receipts, the process typically involves extracting structured information (like date, total, vendor, line items, etc.) from raw OCR output, which is often noisy and inconsistent. Here’s a streamlined approach:

Step 1: OCR Text Preprocessing

Common OCR issues:

Misspellings (e.g., “T0TAL” instead of “TOTAL”)
Incorrect characters (e.g., “$” read as “S” or “5”)
Misaligned or jumbled formats

Cleaning techniques:

Normalize text:
- Convert to uppercase or lowercase
- Remove non-printable characters
- Replace commonly misread characters (0 ↔ O, 1 ↔ I)

python
import re

def clean_ocr_text(text):
    text = text.upper()
    text = re.sub(r'[^x00-x7F]+', '', text)  # Remove non-ASCII
    substitutions = {
        '0': 'O', '1': 'I', '5': 'S', '$': 'S'
    }
    for k, v in substitutions.items():
        text = text.replace(k, v)
    return text

Step 2: Segment Receipt Sections

Break the receipt into parts:

Header (store name, address)
Line items (products and prices)
Footer (total, payment method)

python
def segment_receipt(text):
    lines = text.split('n')
    header, items, footer = [], [], []
    found_items = False

    for line in lines:
        if re.search(r'bTOTALb|bSUBTOTALb|bTAXb', line):
            footer.append(line)
            found_items = False
        elif found_items:
            items.append(line)
        elif re.search(r'bITEMb|bDESCRIPTIONb', line):
            found_items = True
        else:
            if not found_items:
                header.append(line)
    return header, items, footer

Step 3: Extract Structured Data

Use regular expressions or keyword detection for important data.

Example: Total, Tax, Vendor Name, Date

python
def extract_totals_footer(footer_lines):
    result = {}
    for line in footer_lines:
        if "TOTAL" in line:
            match = re.search(r'TOTAL.*?(d+.d{2})', line)
            if match:
                result["total"] = float(match.group(1))
        elif "TAX" in line:
            match = re.search(r'TAX.*?(d+.d{2})', line)
            if match:
                result["tax"] = float(match.group(1))
    return result

def extract_vendor_and_date(header_lines):
    vendor, date = None, None
    for line in header_lines:
        if not vendor and re.search(r'[A-Z]{2,}', line):
            vendor = line.strip()
        if not date:
            match = re.search(r'(d{2}/d{2}/d{4})', line)
            if match:
                date = match.group(1)
    return {"vendor": vendor, "date": date}

Step 4: Extract Line Items

Each line item usually contains a name, quantity, and price. Use heuristics or machine learning for complex patterns.

python
def parse_line_items(item_lines):
    parsed_items = []
    for line in item_lines:
        match = re.search(r'(.*?)(d+.d{2})$', line)
        if match:
            name = match.group(1).strip()
            price = float(match.group(2))
            parsed_items.append({"item": name, "price": price})
    return parsed_items

Step 5: Assemble Final Output

python
def parse_receipt(text):
    cleaned = clean_ocr_text(text)
    header, items, footer = segment_receipt(cleaned)
    totals = extract_totals_footer(footer)
    metadata = extract_vendor_and_date(header)
    line_items = parse_line_items(items)

    return {
        "vendor": metadata.get("vendor"),
        "date": metadata.get("date"),
        "items": line_items,
        **totals
    }

Optional Enhancements

Use spaCy or GPT models for more accurate entity recognition.
Incorporate a language model to infer labels when OCR is unreliable.
Train a custom model on labeled receipts for best results.

Let me know if you want to run this on real OCR text or need it adapted to work with PDFs/images.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Step 1: OCR Text Preprocessing

Step 2: Segment Receipt Sections

Step 3: Extract Structured Data

Example: Total, Tax, Vendor Name, Date

Step 4: Extract Line Items

Step 5: Assemble Final Output

Optional Enhancements

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic