The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Parse and clean OCR’d receipts

To parse and clean OCR’d (Optical Character Recognition) receipts, the process typically involves extracting structured information (like date, total, vendor, line items, etc.) from raw OCR output, which is often noisy and inconsistent. Here’s a streamlined approach:


Step 1: OCR Text Preprocessing

Common OCR issues:

  • Misspellings (e.g., “T0TAL” instead of “TOTAL”)

  • Incorrect characters (e.g., “$” read as “S” or “5”)

  • Misaligned or jumbled formats

Cleaning techniques:

  • Normalize text:

    • Convert to uppercase or lowercase

    • Remove non-printable characters

    • Replace commonly misread characters (0O, 1I)

python
import re def clean_ocr_text(text): text = text.upper() text = re.sub(r'[^x00-x7F]+', '', text) # Remove non-ASCII substitutions = { '0': 'O', '1': 'I', '5': 'S', '$': 'S' } for k, v in substitutions.items(): text = text.replace(k, v) return text

Step 2: Segment Receipt Sections

Break the receipt into parts:

  • Header (store name, address)

  • Line items (products and prices)

  • Footer (total, payment method)

python
def segment_receipt(text): lines = text.split('n') header, items, footer = [], [], [] found_items = False for line in lines: if re.search(r'bTOTALb|bSUBTOTALb|bTAXb', line): footer.append(line) found_items = False elif found_items: items.append(line) elif re.search(r'bITEMb|bDESCRIPTIONb', line): found_items = True else: if not found_items: header.append(line) return header, items, footer

Step 3: Extract Structured Data

Use regular expressions or keyword detection for important data.

Example: Total, Tax, Vendor Name, Date

python
def extract_totals_footer(footer_lines): result = {} for line in footer_lines: if "TOTAL" in line: match = re.search(r'TOTAL.*?(d+.d{2})', line) if match: result["total"] = float(match.group(1)) elif "TAX" in line: match = re.search(r'TAX.*?(d+.d{2})', line) if match: result["tax"] = float(match.group(1)) return result def extract_vendor_and_date(header_lines): vendor, date = None, None for line in header_lines: if not vendor and re.search(r'[A-Z]{2,}', line): vendor = line.strip() if not date: match = re.search(r'(d{2}/d{2}/d{4})', line) if match: date = match.group(1) return {"vendor": vendor, "date": date}

Step 4: Extract Line Items

Each line item usually contains a name, quantity, and price. Use heuristics or machine learning for complex patterns.

python
def parse_line_items(item_lines): parsed_items = [] for line in item_lines: match = re.search(r'(.*?)(d+.d{2})$', line) if match: name = match.group(1).strip() price = float(match.group(2)) parsed_items.append({"item": name, "price": price}) return parsed_items

Step 5: Assemble Final Output

python
def parse_receipt(text): cleaned = clean_ocr_text(text) header, items, footer = segment_receipt(cleaned) totals = extract_totals_footer(footer) metadata = extract_vendor_and_date(header) line_items = parse_line_items(items) return { "vendor": metadata.get("vendor"), "date": metadata.get("date"), "items": line_items, **totals }

Optional Enhancements

  • Use spaCy or GPT models for more accurate entity recognition.

  • Incorporate a language model to infer labels when OCR is unreliable.

  • Train a custom model on labeled receipts for best results.


Let me know if you want to run this on real OCR text or need it adapted to work with PDFs/images.

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About