How to Perform Outlier Detection Using IQR and Z-Scores in EDA

Outlier detection is a critical component of Exploratory Data Analysis (EDA), as outliers can significantly distort statistical analyses and model performance. Two of the most common techniques for identifying outliers are the Interquartile Range (IQR) method and Z-score analysis. Each approach has its strengths and is suitable for different data distributions. Understanding how to effectively use IQR and Z-scores can provide deeper insights into data and ensure the accuracy of predictive models.

Understanding Outliers

Outliers are data points that deviate markedly from the rest of the dataset. They may arise due to variability in the data, measurement errors, or experimental anomalies. Outliers can affect the mean, variance, and other statistical summaries, potentially leading to skewed interpretations and poor model performance.

Why Detect Outliers?

Improved Model Accuracy: Many algorithms, particularly those based on distance metrics, are sensitive to outliers.
Better Data Understanding: Detecting outliers can reveal hidden trends, unusual behaviors, or data entry errors.
Robust Statistical Analysis: Outlier handling leads to more reliable statistical conclusions.

Outlier Detection Using IQR

The Interquartile Range (IQR) method is a non-parametric technique that does not assume a normal distribution. It uses the spread of the middle 50% of the data to determine the presence of outliers.

Steps to Use the IQR Method

Calculate the Quartiles:
- Q1 (First Quartile): 25th percentile
- Q3 (Third Quartile): 75th percentile
Compute the IQR:
$text{IQR} = Q3 – Q1$
Determine the Outlier Thresholds:
- Lower Bound: $Q1 – 1.5 times text{IQR}$
- Upper Bound: $Q3 + 1.5 times text{IQR}$
Identify Outliers:
- Any data point outside the bounds is considered an outlier.

Example in Python

python
import pandas as pd

# Example dataset
data = {'Values': [12, 15, 14, 10, 100, 13, 12, 11, 14, 13]}
df = pd.DataFrame(data)

Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['Values'] < lower_bound) | (df['Values'] > upper_bound)]

When to Use IQR

When the dataset is not normally distributed.
When a robust method is needed that is not influenced by extreme values.
Particularly useful in boxplot-based visualizations.

Outlier Detection Using Z-Scores

The Z-score method assumes the data follows a normal distribution and detects outliers based on how far a data point deviates from the mean.

Steps to Use the Z-Score Method

Calculate the Mean and Standard Deviation of the dataset.
Compute the Z-Score for each data point:
$Z = frac{(X – mu)}{sigma}$
Where:
- $X$ = data point
- $mu$ = mean of the dataset
- $sigma$ = standard deviation of the dataset
Set a Threshold:
- Common thresholds: Z > 3 or Z < -3
Identify Outliers:
- Points with Z-scores beyond the threshold are considered outliers.

Example in Python

python
from scipy import stats
import numpy as np

# Example dataset
data = np.array([12, 15, 14, 10, 100, 13, 12, 11, 14, 13])
z_scores = np.abs(stats.zscore(data))

outliers = data[z_scores > 3]

When to Use Z-Score

When the data is normally distributed.
When working with standardized data.
Effective for large datasets with relatively consistent distributions.

Visualizing Outliers

Visualization plays a key role in EDA. Both IQR and Z-score methods benefit from graphical techniques:

Boxplots: Clearly show the IQR, median, and outliers.
Histograms: Show distribution and possible data anomalies.
Scatterplots: Help detect multivariate outliers.

Python Example: Boxplot

python
import matplotlib.pyplot as plt

plt.boxplot(df['Values'])
plt.title('Boxplot for Outlier Detection')
plt.show()

Python Example: Z-Score Histogram

python
import seaborn as sns

sns.histplot(z_scores, bins=10, kde=True)
plt.title('Z-score Distribution')
plt.axvline(3, color='red', linestyle='--')
plt.axvline(-3, color='red', linestyle='--')
plt.show()

Comparison: IQR vs Z-Score

Feature	IQR Method	Z-Score Method
Assumption	No distributional assumption	Assumes normal distribution
Sensitivity	Robust to outliers	Sensitive to extreme values
Usability	Small to medium datasets	Large, normally distributed data
Visualization Tool	Boxplots	Histograms, Standardized plots

Handling Outliers After Detection

Once outliers are identified, possible actions include:

Removal: If they are data entry errors or irrelevant.
Transformation: Apply log or square root transformations to reduce skew.
Capping: Winsorizing replaces extreme values with a percentile cap.
Segregation: Analyze separately if outliers represent a meaningful subgroup.

Python Example: Capping Outliers

python
# Cap using percentiles
df['Capped'] = np.where(df['Values'] > upper_bound, upper_bound,
                        np.where(df['Values'] < lower_bound, lower_bound, df['Values']))

Best Practices

Always visualize before and after removing outliers to assess impact.
Combine multiple methods when appropriate, especially in high-dimensional data.
Understand the domain context; not all outliers are bad—some may hold key insights.
Automate detection pipelines for larger datasets with real-time updates.

Conclusion

Outlier detection using IQR and Z-scores is essential for thorough Exploratory Data Analysis. The IQR method excels in non-normal data and offers a robust approach, while Z-scores are ideal for normally distributed datasets. By combining statistical rigor with visualization, data scientists can ensure cleaner data, build more reliable models, and uncover hidden patterns that might otherwise go unnoticed.

Share This Page:

How to Perform Outlier Detection Using IQR and Z-Scores in EDA

Understanding Outliers

Why Detect Outliers?

Outlier Detection Using IQR

Steps to Use the IQR Method

Example in Python

When to Use IQR

Outlier Detection Using Z-Scores

Steps to Use the Z-Score Method

Example in Python

When to Use Z-Score

Visualizing Outliers

Python Example: Boxplot

Python Example: Z-Score Histogram

Comparison: IQR vs Z-Score

Handling Outliers After Detection

Python Example: Capping Outliers

Best Practices

Conclusion

Check Out Our Newest Posts we wrote about

Writing Thread-Safe Memory Management in C++

Writing Tests for Animation Systems

Writing Secure C++ Code with Proper Memory Management

Writing Secure C++ Code with Proper Memory Management (1)