How to build system health indicators for data, models, and infra

Building system health indicators for data, models, and infrastructure involves tracking a range of metrics that help you assess the performance, reliability, and efficiency of the entire machine learning ecosystem. These indicators can be broken down into three key categories:

1. Data Health Indicators

Data is the foundation of any machine learning system, so it’s critical to monitor its quality and availability. Here are key indicators for data health:

Data Completeness
- Metric: Percentage of missing values in datasets.
- Threshold: Alerts when missing data exceeds a predefined threshold (e.g., 10% missing values).
- Tools: Data validation frameworks (e.g., Great Expectations).
Data Consistency
- Metric: Distribution of data across batches or timeframes. You can track how stable the data distribution is over time.
- Threshold: A shift in distribution beyond an acceptable range triggers an alert.
- Tools: Data validation scripts or custom monitoring solutions.
Data Freshness
- Metric: Time since the last data update.
- Threshold: Alerts if data hasn’t been updated within a specific time window (e.g., 24 hours).
- Tools: Automated checks based on timestamps in the data source.
Data Integrity
- Metric: Accuracy of values in the data (e.g., no out-of-range values, duplicates).
- Threshold: Alerts when integrity issues exceed a defined percentage of the total dataset.
- Tools: Custom scripts for data validation.
Data Drift
- Metric: Statistical divergence between training data and current/real-time data.
- Threshold: Alerts when drift exceeds a predefined statistical threshold (e.g., KL divergence).
- Tools: Drift detection libraries (e.g., Evidently, Alibi Detect).

2. Model Health Indicators

Your models are the core of the system’s functionality, and it’s essential to track their performance throughout their lifecycle.

Model Accuracy / Performance
- Metric: Common evaluation metrics such as accuracy, precision, recall, F1 score, or custom metrics depending on the task.
- Threshold: Alerts if performance drops below an acceptable threshold.
- Tools: Automated performance monitoring tools (e.g., MLflow, TensorBoard).
Model Drift
- Metric: Statistical change in the model’s predictions over time (e.g., changes in prediction distributions).
- Threshold: If prediction accuracy or output distribution drifts beyond a given tolerance.
- Tools: Drift detection libraries or monitoring dashboards.
Inference Latency
- Metric: Time taken by the model to make predictions.
- Threshold: Alerts if latency exceeds a predefined threshold (e.g., >100 ms).
- Tools: Application performance monitoring tools (e.g., Prometheus, Grafana).
Model Availability
- Metric: Uptime of the model endpoints or services.
- Threshold: Alerts when the model service is unavailable for a certain period.
- Tools: Cloud-native monitoring (e.g., AWS CloudWatch, GCP Stackdriver).
Model Resource Usage
- Metric: CPU, memory, and GPU usage during inference.
- Threshold: Alerts if usage exceeds certain limits, which could indicate inefficient or problematic inference.
- Tools: Cloud monitoring tools or custom resource tracking.

3. Infrastructure Health Indicators

The infrastructure that supports data processing and model serving is essential to the system’s performance. These indicators will monitor the physical and virtual resources.

Service Availability
- Metric: Uptime and availability of critical services (databases, storage systems, compute nodes).
- Threshold: Alerts if availability falls below a defined threshold (e.g., 99.9% uptime).
- Tools: Cloud monitoring services (e.g., AWS CloudWatch, Azure Monitor).
Infrastructure Load
- Metric: CPU, memory, disk I/O, and network load across infrastructure components.
- Threshold: Alerts when resource usage exceeds predefined limits (e.g., CPU > 90% utilization).
- Tools: Infrastructure monitoring tools (e.g., Prometheus, Datadog).
Throughput & Latency
- Metric: Request throughput (e.g., number of requests per minute) and response latency across the system.
- Threshold: Alerts when throughput decreases or latency increases beyond expected ranges.
- Tools: Load balancing and request tracking (e.g., NGINX, Kong, Istio).
Error Rates
- Metric: Error rates for requests to the model or data storage systems (e.g., 5xx HTTP errors, database connection errors).
- Threshold: Alerts when error rates exceed an acceptable threshold.
- Tools: Application logging and monitoring systems (e.g., Sentry, ELK Stack).
Infrastructure Scaling
- Metric: Monitoring auto-scaling events and performance when scaling up or down.
- Threshold: Alerts if scaling doesn’t happen as expected or there are delays in provisioning resources.
- Tools: Cloud auto-scaling tools (e.g., AWS Auto Scaling, Kubernetes HPA).

Key Practices for Building and Monitoring These Indicators:

Alerting: Set up automated alerts for any health indicators that cross a threshold. Alerts should be actionable and easy to understand.
Dashboards: Use visualization tools like Grafana, Prometheus, or Kibana to display key metrics in real time. Make the dashboards accessible for monitoring purposes.
Automated Testing: Automate end-to-end tests that verify the health of data pipelines, model outputs, and infrastructure components.
Historical Data: Track long-term trends, not just real-time data. This helps you spot potential issues before they become critical.
Continuous Improvement: Use feedback from health monitoring to continuously improve the system.

By integrating these indicators into your monitoring system, you’ll be able to detect potential issues early and ensure the reliability of your machine learning system end to end.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

How to build system health indicators for data, models, and infra

1. Data Health Indicators

2. Model Health Indicators

3. Infrastructure Health Indicators

Key Practices for Building and Monitoring These Indicators:

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic