How to Detect Multivariate Outliers Using Mahalanobis Distance

Detecting multivariate outliers is a crucial step in data analysis, especially when dealing with datasets involving multiple variables that interact with each other. Unlike univariate outliers, which are identified by analyzing each variable separately, multivariate outliers occur when a combination of variables deviates significantly from the expected pattern. One powerful and widely used method for identifying such outliers is the Mahalanobis distance.

Understanding Multivariate Outliers

Multivariate outliers are observations that do not fit well into the overall distribution of a multivariate dataset. These outliers might not be apparent when looking at individual variables but become noticeable when considering the relationships among multiple variables. Detecting these outliers is essential because they can distort statistical analyses, affect model performance, and lead to misleading conclusions.

What is Mahalanobis Distance?

Mahalanobis distance is a measure of distance between a point and a distribution. It accounts for the correlations between variables, providing a scale-invariant distance metric. In simpler terms, it calculates how many standard deviations away an observation is from the mean of the multivariate data, considering the shape of the data distribution.

Mathematically, for a data point $mathbf{x}$ , the Mahalanobis distance $D_M$ is defined as:

D_M(mathbf{x}) = sqrt{(mathbf{x} – mathbf{mu})^T mathbf{S}^{-1} (mathbf{x} – mathbf{mu})}

where:

$mathbf{x}$ is the vector of observed values,
$mathbf{mu}$ is the vector of means for each variable,
$mathbf{S}$ is the covariance matrix of the data,
$mathbf{S}^{-1}$ is the inverse of the covariance matrix,
$T$ denotes the transpose.

This formula incorporates the covariance matrix, which reflects the correlation and variance among variables, making Mahalanobis distance especially suitable for multivariate data.

Why Use Mahalanobis Distance for Outlier Detection?

Accounts for Correlation: Unlike Euclidean distance, Mahalanobis distance considers correlations between variables, making it more accurate in multivariate contexts.
Scale-Invariant: It adjusts for different scales of variables, preventing variables with larger scales from dominating the distance measure.
Theoretical Foundation: It is based on the multivariate normal distribution and relates to the chi-square distribution, which helps in setting thresholds for outlier detection.

Step-by-Step Process to Detect Multivariate Outliers Using Mahalanobis Distance

1. Data Preparation

Standardize or Normalize Variables: While Mahalanobis distance inherently accounts for scale via the covariance matrix, ensuring variables are on comparable scales or standardized is still recommended for numerical stability.
Check for Missing Values: Handle or remove missing data to prevent errors in covariance calculation.

2. Calculate the Mean Vector and Covariance Matrix

Compute the mean vector $mathbf{mu}$ representing the average of each variable, and the covariance matrix $mathbf{S}$ , which captures variance within variables and covariance between them.

3. Compute the Mahalanobis Distance for Each Observation

For each data point $mathbf{x}_i$ , calculate the Mahalanobis distance using the formula above.

4. Determine the Threshold for Outliers

Since the squared Mahalanobis distance follows a chi-square distribution with degrees of freedom equal to the number of variables $p$ , you can use chi-square critical values to set thresholds.

For example, if $p = 3$ variables and you want a significance level $alpha = 0.01$ , find the critical value $chi^2_{0.99, 3}$ .
Observations with squared Mahalanobis distance greater than this critical value are considered outliers.

5. Identify and Analyze Outliers

Flag observations exceeding the threshold.
Investigate these data points for data entry errors, unusual but valid cases, or influential points affecting the analysis.

Practical Example

Imagine you have a dataset with three variables: height, weight, and age. You want to identify individuals whose combined profile significantly differs from the typical population pattern.

Calculate the mean and covariance matrix for these variables.
Compute the Mahalanobis distance for each individual.
Using the chi-square distribution with 3 degrees of freedom, set a cutoff (e.g., 99th percentile).
Individuals with distances above the cutoff are flagged as potential outliers.

Implementing Mahalanobis Distance in Python

python
import numpy as np
import pandas as pd
from scipy.stats import chi2

# Sample data: height, weight, age
data = pd.DataFrame({
    'height': [170, 165, 180, 190, 175],
    'weight': [65, 70, 80, 90, 75],
    'age': [25, 30, 22, 35, 28]
})

# Calculate mean vector and covariance matrix
mean_vector = data.mean()
cov_matrix = data.cov()

# Calculate inverse covariance matrix
inv_cov_matrix = np.linalg.inv(cov_matrix)

# Function to compute Mahalanobis distance for each observation
def mahalanobis_distance(x, mean, inv_cov):
    diff = x - mean
    return np.sqrt(diff.T @ inv_cov @ diff)

# Calculate distances
distances = data.apply(lambda row: mahalanobis_distance(row, mean_vector, inv_cov_matrix), axis=1)

# Set threshold based on chi-square distribution
p = data.shape[1]  # number of variables
alpha = 0.01
threshold = np.sqrt(chi2.ppf((1 - alpha), p))

# Identify outliers
outliers = distances > threshold

print("Mahalanobis distances:n", distances)
print("Outliers:n", data[outliers])

Considerations and Limitations

Assumption of Multivariate Normality: Mahalanobis distance works best when the data roughly follow a multivariate normal distribution. For heavily skewed or non-normal data, transformations or alternative methods might be required.
Influence of Outliers on Covariance Matrix: Since covariance is sensitive to outliers, extreme outliers can distort the covariance matrix, affecting the Mahalanobis distance calculation. Robust covariance estimators can mitigate this.
High-Dimensional Data: With many variables, the covariance matrix might become singular or ill-conditioned, complicating inversion. Dimensionality reduction or regularization methods may be necessary.

Alternative Approaches for Multivariate Outlier Detection

While Mahalanobis distance is a classic method, there are other approaches like:

Robust Mahalanobis Distance: Uses robust estimators for mean and covariance to reduce sensitivity to outliers.
Local Outlier Factor (LOF): Measures local density deviation to find anomalies.
Isolation Forest: A tree-based model designed to isolate outliers.
Principal Component Analysis (PCA): Can help reduce dimensionality and identify outliers based on principal components.

Summary

Mahalanobis distance is a fundamental tool for detecting multivariate outliers by measuring the distance of observations from the center of a multivariate distribution while considering variable correlations. It enables analysts to flag unusual data points that might distort results, ensuring cleaner data and more reliable insights. Proper application includes computing the distance, setting thresholds from chi-square distribution, and handling data characteristics like non-normality or high dimensionality carefully.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page