How to Perform Outlier Detection Using Boxplots and Z-Scores

Outlier detection is a critical step in data preprocessing and analysis. Outliers can distort statistical analyses and machine learning models, leading to misleading results. Two widely used techniques for detecting outliers are boxplots and z-scores. Each method offers distinct advantages depending on the distribution and nature of the data. This article explores how to effectively perform outlier detection using both boxplots and z-scores, including implementation steps, examples, and best practices.

Understanding Outliers

An outlier is an observation that lies an abnormal distance from other values in a dataset. Outliers can occur due to measurement errors, data entry errors, or natural variability. Identifying them is essential because:

They can bias the results of analyses.
They may indicate variability in measurement.
They could point to interesting phenomena worthy of further investigation.

Boxplot-Based Outlier Detection

A boxplot (or box-and-whisker plot) is a graphical representation of the distribution of a dataset that highlights its central tendency and variability. It also visualizes outliers based on the Interquartile Range (IQR).

Key Components of a Boxplot:

Median (Q2): The central value of the dataset.
First Quartile (Q1): 25th percentile.
Third Quartile (Q3): 75th percentile.
IQR: Q3 − Q1.
Whiskers: Typically extend to 1.5 * IQR from the quartiles.
Outliers: Any data point beyond Q1 − 1.5 * IQR or Q3 + 1.5 * IQR.

Steps to Detect Outliers Using Boxplots:

Calculate Q1 and Q3:
- Q1 is the value at the 25th percentile.
- Q3 is the value at the 75th percentile.
Compute the IQR:
- IQR = Q3 − Q1
Determine the outlier bounds:
- Lower Bound = Q1 − 1.5 * IQR
- Upper Bound = Q3 + 1.5 * IQR
Identify outliers:
- Any value < Lower Bound or > Upper Bound is an outlier.

Example:

Suppose we have a dataset:
data = [10, 12, 12, 13, 12, 14, 15, 16, 18, 110]

Q1 = 12
Q3 = 16
IQR = 16 − 12 = 4
Lower Bound = 12 − (1.5 × 4) = 6
Upper Bound = 16 + (1.5 × 4) = 22

The value 110 is greater than 22 and is considered an outlier.

Visualization in Python:

python
import matplotlib.pyplot as plt
import seaborn as sns

data = [10, 12, 12, 13, 12, 14, 15, 16, 18, 110]
sns.boxplot(data=data)
plt.show()

Z-Score-Based Outlier Detection

The z-score method is a statistical approach that identifies how many standard deviations a data point is from the mean.

Z-Score Formula:

Z = frac{(X – mu)}{sigma}

Where:

$X$ = data point
$mu$ = mean of the data
$sigma$ = standard deviation

Outlier Threshold:

Common practice considers a data point an outlier if:

$|Z| > 3$

This means the data point is more than 3 standard deviations away from the mean.

Steps to Detect Outliers Using Z-Scores:

Calculate the mean and standard deviation of the dataset.
Compute the z-score for each data point.
Set a threshold, typically ±3.
Flag outliers with z-scores beyond the threshold.

Example:

Given a dataset:
data = [10, 12, 12, 13, 12, 14, 15, 16, 18, 110]

Mean (μ) ≈ 23.2
Standard deviation (σ) ≈ 30.4

The z-score for 110:

Z = frac{(110 – 23.2)}{30.4} ≈ 2.85

Depending on the exact mean and standard deviation, 110 might be borderline or a clear outlier.

Python Implementation:

python
import numpy as np

data = np.array([10, 12, 12, 13, 12, 14, 15, 16, 18, 110])
mean = np.mean(data)
std = np.std(data)

z_scores = (data - mean) / std
outliers = data[np.abs(z_scores) > 3]
print("Outliers:", outliers)

Comparing Boxplot and Z-Score Methods

Feature	Boxplot Method	Z-Score Method
Assumption	Non-parametric	Assumes normal distribution
Best for	Skewed distributions	Symmetric, bell-shaped distributions
Outlier threshold	1.5 * IQR	±3 standard deviations
Robust to outliers	Yes	No
Visualization	Easy with boxplot	Requires additional calculations

When to Use Each:

Use boxplot method when dealing with non-normal data, especially if the distribution is skewed or contains several extreme values.
Use z-score method for normally distributed data or datasets where a statistical basis for deviation from the mean is preferred.

Best Practices

Visualize first: Always explore the data visually using histograms and boxplots to understand distribution and spread.
Combine methods: Use both techniques to cross-validate outliers, especially in critical applications.
Investigate further: Do not blindly remove outliers. Understand their source—could they represent rare but valid phenomena?
Use domain knowledge: Statistical outliers are not always data entry errors. Use contextual understanding before deciding to retain or remove.
Scale the data if needed: For z-score calculations, ensure data is standardized or normalized if multiple features are involved.

Handling Detected Outliers

After identifying outliers, possible actions include:

Removing them: If they are errors or not relevant.
Imputing values: Replacing them with mean, median, or mode.
Transforming data: Using log or square root transformations to reduce impact.
Using robust models: Algorithms like Random Forests or tree-based methods handle outliers better.

Conclusion

Detecting outliers using boxplots and z-scores is a fundamental aspect of data analysis that significantly influences the quality of insights drawn from a dataset. Boxplots offer a robust, visual way to detect outliers in skewed data, while z-scores provide a statistically sound approach for normally distributed datasets. Choosing the right method—and often using both in tandem—ensures cleaner data and more accurate analyses, leading to better decision-making and model performance.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

How to Perform Outlier Detection Using Boxplots and Z-Scores

Understanding Outliers

Boxplot-Based Outlier Detection

Key Components of a Boxplot:

Steps to Detect Outliers Using Boxplots:

Example:

Visualization in Python:

Z-Score-Based Outlier Detection

Z-Score Formula:

Outlier Threshold:

Steps to Detect Outliers Using Z-Scores:

Example:

Python Implementation:

Comparing Boxplot and Z-Score Methods

When to Use Each:

Best Practices

Handling Detected Outliers

Conclusion

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic