Convert recipe PDFs into structured JSON

Here’s a clear approach and example for converting recipe PDFs into structured JSON format. This includes typical fields to extract and how the JSON might look.

Key Fields to Extract from Recipe PDFs:

title: Name of the recipe
description: Short summary or introduction
ingredients: List of ingredients with quantity and unit
instructions: Step-by-step cooking directions
prep_time: Preparation time
cook_time: Cooking time
total_time: Total time (prep + cook)
servings: Number of servings
notes: Additional tips or notes (optional)

Example JSON Structure for a Recipe

json
{
  "title": "Classic Pancakes",
  "description": "Fluffy and light pancakes perfect for breakfast.",
  "ingredients": [
    {
      "name": "all-purpose flour",
      "quantity": 1.5,
      "unit": "cups"
    },
    {
      "name": "baking powder",
      "quantity": 3.5,
      "unit": "teaspoons"
    },
    {
      "name": "salt",
      "quantity": 1,
      "unit": "teaspoon"
    },
    {
      "name": "white sugar",
      "quantity": 1,
      "unit": "tablespoon"
    },
    {
      "name": "milk",
      "quantity": 1.25,
      "unit": "cups"
    },
    {
      "name": "egg",
      "quantity": 1,
      "unit": "large"
    },
    {
      "name": "butter, melted",
      "quantity": 3,
      "unit": "tablespoons"
    }
  ],
  "instructions": [
    "In a large bowl, sift together the flour, baking powder, salt, and sugar.",
    "Make a well in the center and pour in the milk, egg, and melted butter; mix until smooth.",
    "Heat a lightly oiled griddle or frying pan over medium high heat.",
    "Pour or scoop the batter onto the griddle, using approximately 1/4 cup for each pancake.",
    "Brown on both sides and serve hot."
  ],
  "prep_time": "10 minutes",
  "cook_time": "15 minutes",
  "total_time": "25 minutes",
  "servings": 4,
  "notes": "Add blueberries or chocolate chips for a twist."
}

How to Implement PDF to JSON Conversion

Extract text from PDF
Use libraries like PyMuPDF (fitz), pdfplumber, or pdfminer.six to extract raw text from the PDF.
Parse the text
Identify and isolate recipe sections:
- Title usually at the top or bolded
- Ingredients often in a bulleted or numbered list with quantities
- Instructions typically numbered or stepwise paragraphs
- Times and servings may be listed under metadata or a separate section
Use Regex or NLP
Apply regular expressions or NLP techniques to:
- Extract quantities and units from ingredients
- Split instructions into steps
- Detect time formats (e.g., “10 minutes”, “1 hour”)
- Extract servings (e.g., “Serves 4”)
Structure the data
Assemble extracted data into a JSON object as shown above.

If you want, I can help create a Python script example that processes a recipe PDF and outputs JSON structured like this. Would you like that?

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Key Fields to Extract from Recipe PDFs:

Example JSON Structure for a Recipe

How to Implement PDF to JSON Conversion

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic