Detecting outliers is an essential part of Exploratory Data Analysis (EDA) to ensure data integrity before applying statistical models. One of the most effective techniques for detecting outliers is Tukey’s Fences, which uses the interquartile range (IQR) to identify data points that significantly deviate from the rest of the dataset. Here’s a breakdown of how to use Tukey’s Fences to detect outliers in EDA.
1. Understanding Tukey’s Fences
Tukey’s Fences are a set of boundaries or “fences” used to classify data points as mild or extreme outliers. These fences are calculated based on the interquartile range (IQR), which measures the statistical spread between the first quartile (Q1) and the third quartile (Q3).
-
Q1: 25th percentile of the data
-
Q3: 75th percentile of the data
-
IQR: The difference between Q3 and Q1 (IQR = Q3 – Q1)
Tukey’s method defines two types of outliers:
-
Mild outliers: Data points that lie outside the inner fence but within the outer fence.
-
Extreme outliers: Data points that lie outside the outer fence.
2. Formula for Tukey’s Fences
Tukey’s Fences are calculated as follows:
-
Inner fence:
-
Outer fence:
-
Mild outliers are those that fall outside the inner fence but inside the outer fence.
-
Extreme outliers are those that fall outside the outer fence.
3. Steps to Detect Outliers Using Tukey’s Fences
Step 1: Calculate Q1 and Q3
Start by sorting your data in ascending order. The first quartile (Q1) is the median of the lower half of the dataset, and the third quartile (Q3) is the median of the upper half.
Step 2: Compute the IQR
Once you have Q1 and Q3, calculate the IQR:
Step 3: Calculate the Fences
Now calculate the inner and outer fences using the formulas:
-
Lower inner fence = Q1 – 1.5 * IQR
-
Upper inner fence = Q3 + 1.5 * IQR
-
Lower outer fence = Q1 – 3.0 * IQR
-
Upper outer fence = Q3 + 3.0 * IQR
Step 4: Classify the Data Points
-
Any data points below the lower inner fence or above the upper inner fence are classified as potential mild outliers.
-
Any data points below the lower outer fence or above the upper outer fence are classified as extreme outliers.
Step 5: Visualize the Results (Optional)
Visualizing the data using a boxplot is a helpful way to visually detect outliers. In a boxplot, the whiskers represent the limits of the inner fence, and any data points outside of these whiskers are shown as outliers.
4. Example of Tukey’s Fences in Action
Consider the following dataset:
-
Calculate Q1 and Q3:
-
Sorted dataset: { 5, 7, 8, 12, 13, 15, 18, 19, 20, 25, 28, 30, 33, 35, 38, 50, 60 }
-
Q1 (25th percentile) = 12
-
Q3 (75th percentile) = 33
-
-
Compute the IQR:
-
Calculate the fences:
-
Lower inner fence = 12 – 1.5 * 21 = –20.5
-
Upper inner fence = 33 + 1.5 * 21 = 65.5
-
Lower outer fence = 12 – 3.0 * 21 = –51
-
Upper outer fence = 33 + 3.0 * 21 = 96
-
-
Classify the data:
-
Mild outliers: Data points between –20.5 and 65.5 are acceptable, so 50 and 60 are mild outliers.
-
Extreme outliers: Data points beyond –51 and 96 are extreme outliers. There are no extreme outliers in this case.
-
5. Advantages of Tukey’s Fences
-
Simple to compute: Tukey’s Fences are based on simple statistics (Q1, Q3, and IQR), making them easy to implement and understand.
-
Non-parametric: This method does not assume any specific distribution of the data (e.g., normal distribution), so it works well for a wide variety of datasets.
-
Robustness: Since the method is based on percentiles rather than mean and standard deviation, it is less sensitive to skewed distributions and extreme values.
6. Limitations of Tukey’s Fences
-
Subjectivity in Outlier Definition: The choice of 1.5 and 3 as constants in the inner and outer fences is somewhat arbitrary, though it is widely accepted. Depending on the nature of the data, these values might need adjustment.
-
Not Always Effective for All Data Types: Tukey’s Fences may not be as effective for highly skewed datasets or those with heavy tails, where different methods (e.g., Z-scores or robust regression) may be more appropriate.
Conclusion
Tukey’s Fences provide a straightforward and efficient way to detect outliers in datasets, especially when working with data that is not normally distributed. By using the interquartile range (IQR), the method highlights potential outliers based on their distance from the quartiles, making it an essential tool for any exploratory data analysis (EDA) process. However, like all statistical techniques, it is important to combine it with domain knowledge and other methods to ensure that the outlier detection process is appropriate for your specific dataset.