Hosting Hugging Face Models with FastAPI

Hosting Hugging Face Models with FastAPI involves creating a lightweight and efficient API that serves machine learning models for inference. This approach allows developers to deploy transformer-based models or other Hugging Face models in a scalable and customizable way, providing endpoints for predictions that can be integrated into various applications.

Setting Up the Environment

To start, you need Python installed along with FastAPI and the Hugging Face Transformers library. Install these packages using pip:

bash
pip install fastapi uvicorn transformers

uvicorn is an ASGI server that runs FastAPI applications efficiently.

Loading the Hugging Face Model

Hugging Face provides a large variety of pre-trained models accessible via the transformers library. You can load a model and its tokenizer easily:

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

This example loads a sentiment analysis model fine-tuned on the SST-2 dataset.

Creating the FastAPI App

Start by importing FastAPI and the necessary libraries:

python
from fastapi import FastAPI, Request
from pydantic import BaseModel
import torch

app = FastAPI()

Define a request model for input validation:

python
class TextInput(BaseModel):
    text: str

Writing the Prediction Endpoint

Implement an endpoint that takes a piece of text, processes it, and returns model predictions:

python
@app.post("/predict")
async def predict(input: TextInput):
    inputs = tokenizer(input.text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=1)
    confidence, predicted_class = torch.max(probabilities, dim=1)
    label = model.config.id2label[predicted_class.item()]
    
    return {
        "label": label,
        "confidence": confidence.item()
    }

Running the API Server

Run the FastAPI app with uvicorn:

bash
uvicorn main:app --reload

main refers to the Python file name (main.py), and --reload enables auto-reloading during development.

Testing the Endpoint

You can test your API using curl or tools like Postman. Example using curl:

bash
curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"text":"I love using FastAPI with Hugging Face models!"}'

Expected response:

json
{
  "label": "POSITIVE",
  "confidence": 0.9998
}

Extending to Other Model Types

FastAPI with Hugging Face isn’t limited to text classification. You can host models for:

Text generation (e.g., GPT-2, GPT-3)
Question answering (e.g., BERT, RoBERTa)
Named entity recognition
Translation
Summarization

The process involves changing the model type and adapting the input/output processing accordingly.

Handling Batch Requests and Performance Optimization

For production scenarios, consider:

Batch processing multiple inputs to reduce overhead
Using GPU acceleration (if available) by moving models and inputs to CUDA devices
Adding caching mechanisms for repeated requests
Deploying behind a load balancer for scaling

Example of Batch Prediction Endpoint

python
from typing import List

class BatchInput(BaseModel):
    texts: List[str]

@app.post("/batch_predict")
async def batch_predict(inputs: BatchInput):
    encoded = tokenizer(inputs.texts, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**encoded)
    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=1)
    results = []
    for prob in probabilities:
        confidence, predicted_class = torch.max(prob, dim=0)
        label = model.config.id2label[predicted_class.item()]
        results.append({"label": label, "confidence": confidence.item()})
    return {"results": results}

Deployment Considerations

Containerize your FastAPI app using Docker for easy deployment.
Use cloud platforms like AWS, GCP, or Azure for scalable serving.
Integrate authentication and rate limiting if exposing the API publicly.
Monitor performance and logs for reliability.

Hosting Hugging Face models with FastAPI provides a powerful, flexible way to build ML-powered APIs tailored to your specific needs. It combines the simplicity and speed of FastAPI with the extensive model ecosystem of Hugging Face to enable rapid deployment of state-of-the-art NLP applications.

Share This Page:

Setting Up the Environment

Loading the Hugging Face Model

Creating the FastAPI App

Writing the Prediction Endpoint

Running the API Server

Testing the Endpoint

Extending to Other Model Types

Handling Batch Requests and Performance Optimization

Example of Batch Prediction Endpoint

Deployment Considerations

Comments

Leave a Reply Cancel reply

Check Out Our Newest Posts we wrote about

Writing Thread-Safe Memory Management in C++

Writing Tests for Animation Systems

Writing Secure C++ Code with Proper Memory Management

Writing Secure C++ Code with Proper Memory Management (1)