The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Convert recipe PDFs into structured JSON

Here’s a clear approach and example for converting recipe PDFs into structured JSON format. This includes typical fields to extract and how the JSON might look.


Key Fields to Extract from Recipe PDFs:

  • title: Name of the recipe

  • description: Short summary or introduction

  • ingredients: List of ingredients with quantity and unit

  • instructions: Step-by-step cooking directions

  • prep_time: Preparation time

  • cook_time: Cooking time

  • total_time: Total time (prep + cook)

  • servings: Number of servings

  • notes: Additional tips or notes (optional)


Example JSON Structure for a Recipe

json
{ "title": "Classic Pancakes", "description": "Fluffy and light pancakes perfect for breakfast.", "ingredients": [ { "name": "all-purpose flour", "quantity": 1.5, "unit": "cups" }, { "name": "baking powder", "quantity": 3.5, "unit": "teaspoons" }, { "name": "salt", "quantity": 1, "unit": "teaspoon" }, { "name": "white sugar", "quantity": 1, "unit": "tablespoon" }, { "name": "milk", "quantity": 1.25, "unit": "cups" }, { "name": "egg", "quantity": 1, "unit": "large" }, { "name": "butter, melted", "quantity": 3, "unit": "tablespoons" } ], "instructions": [ "In a large bowl, sift together the flour, baking powder, salt, and sugar.", "Make a well in the center and pour in the milk, egg, and melted butter; mix until smooth.", "Heat a lightly oiled griddle or frying pan over medium high heat.", "Pour or scoop the batter onto the griddle, using approximately 1/4 cup for each pancake.", "Brown on both sides and serve hot." ], "prep_time": "10 minutes", "cook_time": "15 minutes", "total_time": "25 minutes", "servings": 4, "notes": "Add blueberries or chocolate chips for a twist." }

How to Implement PDF to JSON Conversion

  1. Extract text from PDF
    Use libraries like PyMuPDF (fitz), pdfplumber, or pdfminer.six to extract raw text from the PDF.

  2. Parse the text
    Identify and isolate recipe sections:

    • Title usually at the top or bolded

    • Ingredients often in a bulleted or numbered list with quantities

    • Instructions typically numbered or stepwise paragraphs

    • Times and servings may be listed under metadata or a separate section

  3. Use Regex or NLP
    Apply regular expressions or NLP techniques to:

    • Extract quantities and units from ingredients

    • Split instructions into steps

    • Detect time formats (e.g., “10 minutes”, “1 hour”)

    • Extract servings (e.g., “Serves 4”)

  4. Structure the data
    Assemble extracted data into a JSON object as shown above.


If you want, I can help create a Python script example that processes a recipe PDF and outputs JSON structured like this. Would you like that?

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About