Skewness in data refers to the asymmetry or deviation from the normal distribution in a dataset. In the context of Exploratory Data Analysis (EDA), detecting and interpreting skewness is a critical step for understanding the distribution of variables and making informed decisions about data preprocessing, transformation, and modeling. Skewness impacts statistical analysis and machine learning models, especially those sensitive to assumptions of normality.
Understanding Skewness
Skewness measures the asymmetry of a probability distribution. It can be categorized as:
-
Positive Skew (Right Skew): Tail is stretched to the right; most data points are concentrated on the left.
-
Negative Skew (Left Skew): Tail is stretched to the left; most data points are concentrated on the right.
-
Zero Skew: Distribution is symmetrical, often (but not necessarily) indicating a normal distribution.
The skewness coefficient is calculated using the formula:
Skewness = (n/((n-1)(n-2))) * Σ((xᵢ – x̄) / s)³
Where:
-
n is the sample size
-
xᵢ is each value
-
x̄ is the mean
-
s is the standard deviation
A perfectly symmetrical dataset has a skewness of 0. Generally:
-
Skewness < -1 or > 1 indicates highly skewed data.
-
Skewness between -1 and -0.5 or 0.5 and 1 indicates moderate skew.
-
Skewness between -0.5 and 0.5 suggests near symmetry.
Detecting Skewness in EDA
1. Visual Techniques
Visualizations are often the first tools used in EDA to detect skewness:
a. Histogram
A histogram provides a clear visualization of the distribution of a variable.
-
Right skew: Long tail on the right
-
Left skew: Long tail on the left
-
Symmetric: Bell-shaped curve
b. Box Plot
Box plots highlight outliers and the spread of the data. A longer tail on one side signals skewness.
c. Density Plot (KDE Plot)
Kernel Density Estimation plots smooth the histogram, making it easier to see skew.
d. QQ Plot
Quantile-Quantile plots compare the quantiles of the data against a normal distribution. Deviations from the 45-degree line suggest skewness.
2. Statistical Methods
a. Skewness Value
Use statistical libraries to compute the skewness:
Interpret the result according to the guidelines mentioned earlier.
b. Normality Tests
These can supplement skewness detection:
-
Shapiro-Wilk Test
-
Kolmogorov–Smirnov Test
-
Anderson-Darling Test
Example using Shapiro-Wilk:
Interpreting Skewness in Context
Positive Skewness
-
Long tail to the right
-
Mean > Median > Mode
-
Common in income, wealth, and outlier-prone datasets
-
May affect algorithms that assume normality (e.g., linear regression)
Negative Skewness
-
Long tail to the left
-
Mean < Median < Mode
-
Occurs in test scores (e.g., when most students score high)
-
May require transformation for certain models
Implications on Data Analysis
Skewed data can distort:
-
Measures of central tendency
-
Hypothesis testing
-
Confidence intervals
-
Model assumptions
Machine learning algorithms like linear regression, logistic regression, and KNN often assume data is normally distributed. Skewness can impair model performance and lead to overfitting or underfitting.
Handling Skewed Data
1. Data Transformation
a. Log Transformation
Effective for reducing right skew.
b. Square Root or Cube Root
Works for moderately skewed data.
c. Box-Cox Transformation
Optimizes a power transform for normality (only positive values).
d. Yeo-Johnson Transformation
Suitable for data with zero or negative values.
2. Outlier Removal
Outliers can cause or exaggerate skewness. Removing or capping extreme values can help.
3. Binning or Bucketing
Discretizing a continuous variable into bins can reduce the effect of skewness on modeling.
Practical Use Case: EDA Workflow with Skewness Detection
-
Load Data:
-
Initial Visual Inspection:
Usehistplot,boxplot, andkdeplotto inspect distribution. -
Calculate Skewness:
-
Normality Test:
Useshapiroorqqplot. -
Decide on Action:
-
If data is moderately or highly skewed, apply transformations.
-
Compare model performance before and after transformation.
-
Re-check Skewness:
After transformation, re-plot and re-compute skewness.
Conclusion
Detecting and interpreting skewness through EDA is essential for robust statistical analysis and machine learning modeling. Visual tools like histograms, box plots, and QQ plots combined with statistical methods provide a comprehensive picture of data symmetry. Interpreting skewness in context ensures better feature engineering, more accurate models, and reliable insights. By handling skewness proactively, data scientists can significantly improve data quality and downstream analytical performance.