How to Detect Outliers Using Tukey’s Fences in EDA

Detecting outliers is an essential part of Exploratory Data Analysis (EDA) to ensure data integrity before applying statistical models. One of the most effective techniques for detecting outliers is Tukey’s Fences, which uses the interquartile range (IQR) to identify data points that significantly deviate from the rest of the dataset. Here’s a breakdown of how to use Tukey’s Fences to detect outliers in EDA.

1. Understanding Tukey’s Fences

Tukey’s Fences are a set of boundaries or “fences” used to classify data points as mild or extreme outliers. These fences are calculated based on the interquartile range (IQR), which measures the statistical spread between the first quartile (Q1) and the third quartile (Q3).

Q1: 25th percentile of the data
Q3: 75th percentile of the data
IQR: The difference between Q3 and Q1 (IQR = Q3 – Q1)

Tukey’s method defines two types of outliers:

Mild outliers: Data points that lie outside the inner fence but within the outer fence.
Extreme outliers: Data points that lie outside the outer fence.

2. Formula for Tukey’s Fences

Tukey’s Fences are calculated as follows:

Inner fence:
$text{Lower inner fence} = Q1 – (1.5 times IQR)$ $text{Upper inner fence} = Q3 + (1.5 times IQR)$
Outer fence:
$text{Lower outer fence} = Q1 – (3.0 times IQR)$ $text{Upper outer fence} = Q3 + (3.0 times IQR)$
Mild outliers are those that fall outside the inner fence but inside the outer fence.
Extreme outliers are those that fall outside the outer fence.

3. Steps to Detect Outliers Using Tukey’s Fences

Step 1: Calculate Q1 and Q3

Start by sorting your data in ascending order. The first quartile (Q1) is the median of the lower half of the dataset, and the third quartile (Q3) is the median of the upper half.

Step 2: Compute the IQR

Once you have Q1 and Q3, calculate the IQR:

IQR = Q3 – Q1

Step 3: Calculate the Fences

Now calculate the inner and outer fences using the formulas:

Lower inner fence = Q1 – 1.5 * IQR
Upper inner fence = Q3 + 1.5 * IQR
Lower outer fence = Q1 – 3.0 * IQR
Upper outer fence = Q3 + 3.0 * IQR

Step 4: Classify the Data Points

Any data points below the lower inner fence or above the upper inner fence are classified as potential mild outliers.
Any data points below the lower outer fence or above the upper outer fence are classified as extreme outliers.

Step 5: Visualize the Results (Optional)

Visualizing the data using a boxplot is a helpful way to visually detect outliers. In a boxplot, the whiskers represent the limits of the inner fence, and any data points outside of these whiskers are shown as outliers.

4. Example of Tukey’s Fences in Action

Consider the following dataset:

{ 5, 7, 8, 12, 13, 15, 18, 19, 20, 25, 28, 30, 33, 35, 38, 50, 60 }

Calculate Q1 and Q3:
- Sorted dataset: { 5, 7, 8, 12, 13, 15, 18, 19, 20, 25, 28, 30, 33, 35, 38, 50, 60 }
- Q1 (25th percentile) = 12
- Q3 (75th percentile) = 33
Compute the IQR:
$IQR = 33 – 12 = 21$
Calculate the fences:
- Lower inner fence = 12 – 1.5 * 21 = –20.5
- Upper inner fence = 33 + 1.5 * 21 = 65.5
- Lower outer fence = 12 – 3.0 * 21 = –51
- Upper outer fence = 33 + 3.0 * 21 = 96
Classify the data:
- Mild outliers: Data points between –20.5 and 65.5 are acceptable, so 50 and 60 are mild outliers.
- Extreme outliers: Data points beyond –51 and 96 are extreme outliers. There are no extreme outliers in this case.

5. Advantages of Tukey’s Fences

Simple to compute: Tukey’s Fences are based on simple statistics (Q1, Q3, and IQR), making them easy to implement and understand.
Non-parametric: This method does not assume any specific distribution of the data (e.g., normal distribution), so it works well for a wide variety of datasets.
Robustness: Since the method is based on percentiles rather than mean and standard deviation, it is less sensitive to skewed distributions and extreme values.

6. Limitations of Tukey’s Fences

Subjectivity in Outlier Definition: The choice of 1.5 and 3 as constants in the inner and outer fences is somewhat arbitrary, though it is widely accepted. Depending on the nature of the data, these values might need adjustment.
Not Always Effective for All Data Types: Tukey’s Fences may not be as effective for highly skewed datasets or those with heavy tails, where different methods (e.g., Z-scores or robust regression) may be more appropriate.

Conclusion

Tukey’s Fences provide a straightforward and efficient way to detect outliers in datasets, especially when working with data that is not normally distributed. By using the interquartile range (IQR), the method highlights potential outliers based on their distance from the quartiles, making it an essential tool for any exploratory data analysis (EDA) process. However, like all statistical techniques, it is important to combine it with domain knowledge and other methods to ensure that the outlier detection process is appropriate for your specific dataset.

Share This Page: