Categories We Write About

Hosting Hugging Face Models with FastAPI

Hosting Hugging Face Models with FastAPI involves creating a lightweight and efficient API that serves machine learning models for inference. This approach allows developers to deploy transformer-based models or other Hugging Face models in a scalable and customizable way, providing endpoints for predictions that can be integrated into various applications.

Setting Up the Environment

To start, you need Python installed along with FastAPI and the Hugging Face Transformers library. Install these packages using pip:

bash
pip install fastapi uvicorn transformers

uvicorn is an ASGI server that runs FastAPI applications efficiently.

Loading the Hugging Face Model

Hugging Face provides a large variety of pre-trained models accessible via the transformers library. You can load a model and its tokenizer easily:

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer model_name = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name)

This example loads a sentiment analysis model fine-tuned on the SST-2 dataset.

Creating the FastAPI App

Start by importing FastAPI and the necessary libraries:

python
from fastapi import FastAPI, Request from pydantic import BaseModel import torch app = FastAPI()

Define a request model for input validation:

python
class TextInput(BaseModel): text: str

Writing the Prediction Endpoint

Implement an endpoint that takes a piece of text, processes it, and returns model predictions:

python
@app.post("/predict") async def predict(input: TextInput): inputs = tokenizer(input.text, return_tensors="pt", truncation=True, padding=True) with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probabilities = torch.nn.functional.softmax(logits, dim=1) confidence, predicted_class = torch.max(probabilities, dim=1) label = model.config.id2label[predicted_class.item()] return { "label": label, "confidence": confidence.item() }

Running the API Server

Run the FastAPI app with uvicorn:

bash
uvicorn main:app --reload

main refers to the Python file name (main.py), and --reload enables auto-reloading during development.

Testing the Endpoint

You can test your API using curl or tools like Postman. Example using curl:

bash
curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"text":"I love using FastAPI with Hugging Face models!"}'

Expected response:

json
{ "label": "POSITIVE", "confidence": 0.9998 }

Extending to Other Model Types

FastAPI with Hugging Face isn’t limited to text classification. You can host models for:

  • Text generation (e.g., GPT-2, GPT-3)

  • Question answering (e.g., BERT, RoBERTa)

  • Named entity recognition

  • Translation

  • Summarization

The process involves changing the model type and adapting the input/output processing accordingly.

Handling Batch Requests and Performance Optimization

For production scenarios, consider:

  • Batch processing multiple inputs to reduce overhead

  • Using GPU acceleration (if available) by moving models and inputs to CUDA devices

  • Adding caching mechanisms for repeated requests

  • Deploying behind a load balancer for scaling

Example of Batch Prediction Endpoint

python
from typing import List class BatchInput(BaseModel): texts: List[str] @app.post("/batch_predict") async def batch_predict(inputs: BatchInput): encoded = tokenizer(inputs.texts, return_tensors="pt", padding=True, truncation=True) with torch.no_grad(): outputs = model(**encoded) logits = outputs.logits probabilities = torch.nn.functional.softmax(logits, dim=1) results = [] for prob in probabilities: confidence, predicted_class = torch.max(prob, dim=0) label = model.config.id2label[predicted_class.item()] results.append({"label": label, "confidence": confidence.item()}) return {"results": results}

Deployment Considerations

  • Containerize your FastAPI app using Docker for easy deployment.

  • Use cloud platforms like AWS, GCP, or Azure for scalable serving.

  • Integrate authentication and rate limiting if exposing the API publicly.

  • Monitor performance and logs for reliability.


Hosting Hugging Face models with FastAPI provides a powerful, flexible way to build ML-powered APIs tailored to your specific needs. It combines the simplicity and speed of FastAPI with the extensive model ecosystem of Hugging Face to enable rapid deployment of state-of-the-art NLP applications.

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About