How to set performance thresholds for deployed ML

Setting performance thresholds for deployed machine learning (ML) models is crucial for ensuring that the models continue to meet business requirements and provide accurate, reliable predictions in production. Here are steps you can follow to define appropriate performance thresholds for your deployed ML models:

1. Define Business Metrics

Before setting technical thresholds, it’s important to first link the model’s performance to business outcomes. The performance of an ML model should be evaluated based on how well it supports the business objectives. For example:

Conversion rate (if you’re working on a recommendation system)
Customer churn rate (for a classification model predicting churn)
Cost savings (for optimization models)

The key is to set performance goals that matter to the stakeholders and align with business needs.

2. Select Relevant Evaluation Metrics

Choose evaluation metrics based on the type of model and problem you’re solving. Common metrics include:

Accuracy (overall correctness of predictions)
Precision/Recall/F1 Score (for classification problems)
AUC-ROC (for binary classification)
Mean Squared Error (MSE), Root Mean Squared Error (RMSE) (for regression)
Area under the Precision-Recall Curve (PR AUC) (for imbalanced classification)
Latency (how long it takes to produce a prediction)
Throughput (the number of predictions made in a given time window)

These metrics should reflect the real-world value of the model’s predictions.

3. Establish Benchmarks from Historical Data

Using historical data (either past deployments, training sets, or benchmarks from similar models), set baseline performance values. This can include:

Performance on training and validation data
Historical model performance from previous versions or similar systems

Benchmarks should be representative of the data distribution the model will encounter in production. For example, if the production data is noisy or unbalanced, historical data might be a better starting point.

4. Establish a Performance SLO (Service Level Objective)

An SLO is a target for model performance, ensuring it stays above a certain threshold. For instance:

Classification models: If the model’s F1-score drops below 0.85, trigger an alert.
Regression models: If MSE exceeds a certain value (say, 100), the model might need retraining or fine-tuning.

SLOs provide a buffer to measure acceptable performance and help trigger alerts when the model is degrading.

5. Monitor Model Drift

As the model is deployed, real-world data may change, leading to model drift (changes in data distribution or relationships between input features and target variables). Common types of drift to monitor:

Data Drift: Changes in input feature distribution.
Concept Drift: Changes in the underlying relationship between input features and the output prediction.

Set thresholds for allowable drift:

Data drift: If the distribution of input features changes by more than a set percentage (e.g., 20%), retraining may be necessary.
Performance drift: If performance drops by more than a set amount over time (e.g., an F1-score decrease by 5%), trigger an investigation or retraining.

6. Define Latency and Throughput Requirements

For real-time applications, setting performance thresholds for latency (response time) and throughput (number of predictions per second) is critical.

For example, you might set a threshold that a prediction must be made in under 100ms to be considered performant.
Similarly, the system should handle at least 1000 requests per second (throughput) to meet production needs.

These thresholds ensure that the model remains usable and efficient in real-time environments.

7. Set Alerts and Automation

Once you define the thresholds for performance metrics like accuracy, drift, latency, and throughput, implement monitoring systems with automated alerts. Tools such as:

Prometheus (for system and performance metrics)
Grafana (for visualization)
TensorFlow Model Analysis (for drift detection)
Evidently AI (for model monitoring)

These tools will allow you to detect performance degradation early and take corrective actions like retraining, model rollback, or other updates.

8. Create a Feedback Loop for Retraining

A continuous feedback loop is essential for long-term model maintenance:

Retrain when performance thresholds are breached: If the model’s performance drops below predefined thresholds, retraining should be initiated, possibly with new data or model improvements.
Use model performance data: Collect data on how the model performs in production and use it to fine-tune the model periodically.

9. Consider the Trade-off Between Precision and Recall

In certain use cases, it might be more acceptable to sacrifice precision for recall or vice versa. Define thresholds based on the risk tolerance of your business:

High precision: If false positives are costly or problematic (e.g., fraud detection).
High recall: If false negatives are more problematic (e.g., medical diagnosis).

10. Define Degradation Patterns

Predict what would cause a model’s performance to degrade. This could include:

Data shifts (e.g., seasonality in retail, changes in customer behavior)
Model performance dropping during peak load times
Environmental or infrastructural issues like hardware failure affecting throughput or latency

Setting expectations around how and when performance may degrade will help teams react quickly and predictively.

Conclusion

Setting performance thresholds for deployed ML models is an iterative process, requiring continuous monitoring and adjustments. By aligning technical metrics with business goals, monitoring for drift, defining latency and throughput requirements, and setting robust retraining and alerting strategies, you can ensure that your ML models continue to perform optimally in real-world environments.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page