How to Detect Multivariate Outliers Using EDA for Better Analysis

Detecting multivariate outliers is a critical step in Exploratory Data Analysis (EDA) to ensure the accuracy and reliability of statistical analyses and machine learning models. Outliers can distort the findings of an analysis, lead to biased predictions, and reduce the overall performance of algorithms. Therefore, detecting and handling outliers is essential for better analysis.

In this article, we’ll explore the techniques to detect multivariate outliers using EDA, which combines visualization, statistical methods, and domain knowledge to identify and understand the presence of outliers in your data. By leveraging these methods, you can improve your model’s performance and the quality of your insights.

Understanding Multivariate Outliers

A multivariate outlier is an observation that deviates significantly from the general distribution or pattern of the data in multiple dimensions. Unlike univariate outliers, which deal with a single feature, multivariate outliers are outliers in the context of multiple variables considered together.

For instance, a data point that seems normal when examined individually might be an outlier when considering its relationship with other variables. This makes multivariate outliers more challenging to detect, as it involves understanding the interdependencies between features.

Steps to Detect Multivariate Outliers in EDA

1. Visualizing the Data: Scatter Plots and Pair Plots

Visualization is often the first step in identifying outliers. When dealing with multivariate data, scatter plots and pair plots can be extremely useful. They provide a way to visualize the relationship between multiple variables simultaneously.

Scatter Plot: A scatter plot is a basic way to plot two variables against each other. While this works well for bivariate relationships, for multivariate data, you can plot multiple pairs of variables.

Pair Plot: In case of three or more variables, a pair plot (also known as a scatterplot matrix) can be used. It helps you understand the pairwise relationships between variables and can quickly reveal unusual data points that deviate from the general trend.

For example, using Python’s Seaborn library, you can create a pair plot like this:

python
import seaborn as sns
sns.pairplot(data)

This will generate scatter plots for each pair of features in your dataset, making it easier to spot any anomalies.

2. Using Box Plots in Multiple Dimensions

A box plot can be helpful for univariate outlier detection, but it can also be extended to multivariate data by plotting box plots for each feature, individually. While this won’t directly help with multivariate outliers, it can show you which variables might be contributing to outliers in a given dataset.

However, the real strength of box plots in multivariate EDA lies in identifying correlations between features. Once you spot outliers in univariate box plots, you can drill down into those instances across other variables and identify whether they exhibit outlier characteristics in more dimensions.

python
import matplotlib.pyplot as plt
import seaborn as sns

sns.boxplot(data=data)
plt.show()

This will create box plots for each feature, allowing you to visually inspect any points that lie outside the typical range of values.

3. Statistical Methods for Outlier Detection

Several statistical techniques can be applied to detect multivariate outliers, and they work by measuring how far data points are from the central tendency (such as mean or median) of the data.

Mahalanobis Distance: This distance measure calculates the “distance” of a point from the mean of the data, taking into account the covariance of the variables. Points with a large Mahalanobis distance can be considered as multivariate outliers. You can compute the Mahalanobis distance using the following steps:

Calculate the mean and covariance matrix of the data.
For each point, compute its Mahalanobis distance.

python
import numpy as np
from scipy.stats import chi2

mean = np.mean(data, axis=0)
cov = np.cov(data.T)
inv_cov = np.linalg.inv(cov)

mahalanobis_dist = np.sqrt(np.dot(np.dot((data - mean), inv_cov), (data - mean).T))

# Compute p-value
p_values = chi2.sf(mahalanobis_dist, df=data.shape[1])

If the p-value for a given point is below a threshold (commonly 0.01 or 0.05), the point can be considered an outlier.

Z-Score: Z-scores can also be extended to multivariate data. The Z-score indicates how many standard deviations away a point is from the mean. A Z-score greater than 3 (in absolute value) often indicates an outlier in univariate analysis. However, for multivariate data, you can use a similar method by standardizing each feature and then calculating the Z-score.

python
from scipy.stats import zscore
z_scores = np.abs(zscore(data))
outliers = np.where(z_scores > 3)

Any data point with a Z-score higher than a predefined threshold can be considered an outlier.

4. Isolation Forest

Isolation Forest is an algorithm specifically designed for anomaly detection. It works by recursively partitioning the data using random splits and isolating observations that deviate significantly from others. It’s a robust and efficient technique for detecting outliers in high-dimensional data. It can be used in an unsupervised manner and works well for large datasets.

python
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.05)  # Adjust contamination according to the expected proportion of outliers
outliers = iso_forest.fit_predict(data)

The algorithm will return 1 for normal points and -1 for outliers.

5. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a clustering method that groups points that are closely packed together. Points that are far away from these dense regions are considered outliers. This can be a great option when you expect your data to have some natural clusters, and any point that does not fit into a cluster should be flagged as an outlier.

python
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=5)  # Adjust parameters according to your data
labels = db.fit_predict(data)
outliers = np.where(labels == -1)

Here, points labeled as -1 are considered outliers.

6. Visualizing with PCA or t-SNE

Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) are dimensionality reduction techniques that can help you visualize high-dimensional data in 2D or 3D. Once the data is reduced, you can use scatter plots to visualize the points that appear far from the rest of the data, which could be potential outliers.

PCA:

python
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)

plt.scatter(reduced_data[:, 0], reduced_data[:, 1])
plt.show()

t-SNE:

python
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2)
reduced_data = tsne.fit_transform(data)

plt.scatter(reduced_data[:, 0], reduced_data[:, 1])
plt.show()

Points that appear isolated from others in the reduced 2D or 3D space can be considered potential outliers.

Conclusion

Detecting multivariate outliers is an essential part of Exploratory Data Analysis (EDA) because outliers can have a significant impact on the outcomes of your analysis. Using a combination of visualization techniques like pair plots and statistical methods like Mahalanobis distance, Z-scores, Isolation Forest, and DBSCAN, you can identify these outliers effectively. Reducing the influence of these outliers can lead to more accurate models and insights.

The approach to outlier detection should be chosen based on the nature of the data and the problem you’re working on. By incorporating outlier detection into your EDA process, you’ll ensure that your analysis is robust and that your machine learning models perform better.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

How to Detect Multivariate Outliers Using EDA for Better Analysis

Understanding Multivariate Outliers

Steps to Detect Multivariate Outliers in EDA

1. Visualizing the Data: Scatter Plots and Pair Plots

2. Using Box Plots in Multiple Dimensions

3. Statistical Methods for Outlier Detection

4. Isolation Forest

5. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

6. Visualizing with PCA or t-SNE

Conclusion

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic