Create a receipt scanner with OCR

Creating a receipt scanner with OCR involves combining image processing and optical character recognition to extract text from receipt images. Below is a detailed step-by-step guide on how to build a basic receipt scanner using Python with popular libraries such as OpenCV for image processing and Tesseract for OCR.

Step 1: Install Required Libraries

bash
pip install opencv-python pytesseract Pillow numpy

Additionally, you need to install Tesseract-OCR software on your machine:

Windows: Download from Tesseract at UB Mannheim
Mac: brew install tesseract
Linux: sudo apt-get install tesseract-ocr

Step 2: Basic Workflow Overview

Image Acquisition: Capture or load a receipt image.
Preprocessing: Improve image quality (grayscale, thresholding, noise removal).
Text Extraction: Use Tesseract OCR to convert image to text.
Postprocessing: Clean and structure the extracted data.

Step 3: Sample Code

python
import cv2
import pytesseract
from PIL import Image
import numpy as np

# Path to the Tesseract executable (update this if needed)
pytesseract.pytesseract.tesseract_cmd = r'C:Program FilesTesseract-OCRtesseract.exe'  # Windows example

def preprocess_image(image_path):
    # Load image in grayscale
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

    # Resize to improve OCR accuracy (optional)
    scale_percent = 150
    width = int(image.shape[1] * scale_percent / 100)
    height = int(image.shape[0] * scale_percent / 100)
    dim = (width, height)
    resized = cv2.resize(image, dim, interpolation=cv2.INTER_AREA)

    # Apply Gaussian blur to reduce noise
    blurred = cv2.GaussianBlur(resized, (5,5), 0)

    # Apply adaptive thresholding to get binary image
    thresh = cv2.adaptiveThreshold(blurred, 255,
                                   cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                   cv2.THRESH_BINARY_INV, 11, 2)

    # Morphological operations to remove small noise
    kernel = np.ones((2,2), np.uint8)
    opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel)

    return opening

def extract_text_from_image(preprocessed_image):
    # Invert image colors for Tesseract (white text on black background)
    inverted = cv2.bitwise_not(preprocessed_image)

    # Convert back to PIL Image for pytesseract
    pil_img = Image.fromarray(inverted)

    # OCR extraction
    custom_config = r'--oem 3 --psm 6'  # LSTM OCR Engine, Assume a single uniform block of text
    text = pytesseract.image_to_string(pil_img, config=custom_config)

    return text

def main():
    image_path = 'receipt.jpg'  # Path to your receipt image

    processed_image = preprocess_image(image_path)
    text = extract_text_from_image(processed_image)

    print("Extracted Text:")
    print(text)

if __name__ == "__main__":
    main()

Explanation:

Preprocessing improves the clarity and contrast of the receipt, helping Tesseract perform better.
Adaptive thresholding turns the image into a high-contrast black-and-white image.
Morphological opening removes small noise pixels.
--oem 3 uses the latest OCR engine.
--psm 6 treats the image as a block of text.

Step 4: Improving Accuracy

Crop the receipt area from a larger photo using contour detection or manual cropping.
Use Tesseract’s --psm modes tailored to the receipt layout (e.g., --psm 4 for column-based text).
Train Tesseract with custom data for specific receipt fonts.
Use specialized libraries like EasyOCR or Google Vision API for better results.

Step 5: Structuring Extracted Data

After getting raw text, parse the text lines to extract key details such as:

Store name
Date and time
Items purchased and prices
Total amount

This can be done using regex and string matching.

If you want, I can also help you write a parser to structure receipt data or create a web/mobile app interface for this scanner. Just let me know!

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Step 1: Install Required Libraries

Step 2: Basic Workflow Overview

Step 3: Sample Code

Explanation:

Step 4: Improving Accuracy

Step 5: Structuring Extracted Data

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic