How to Detect and Interpret Data Skewness Using EDA

Skewness in data refers to the asymmetry or deviation from the normal distribution in a dataset. In the context of Exploratory Data Analysis (EDA), detecting and interpreting skewness is a critical step for understanding the distribution of variables and making informed decisions about data preprocessing, transformation, and modeling. Skewness impacts statistical analysis and machine learning models, especially those sensitive to assumptions of normality.

Understanding Skewness

Skewness measures the asymmetry of a probability distribution. It can be categorized as:

Positive Skew (Right Skew): Tail is stretched to the right; most data points are concentrated on the left.
Negative Skew (Left Skew): Tail is stretched to the left; most data points are concentrated on the right.
Zero Skew: Distribution is symmetrical, often (but not necessarily) indicating a normal distribution.

The skewness coefficient is calculated using the formula:

Skewness = (n/((n-1)(n-2))) * Σ((xᵢ – x̄) / s)³

Where:

n is the sample size
xᵢ is each value
x̄ is the mean
s is the standard deviation

A perfectly symmetrical dataset has a skewness of 0. Generally:

Skewness < -1 or > 1 indicates highly skewed data.
Skewness between -1 and -0.5 or 0.5 and 1 indicates moderate skew.
Skewness between -0.5 and 0.5 suggests near symmetry.

Detecting Skewness in EDA

1. Visual Techniques

Visualizations are often the first tools used in EDA to detect skewness:

a. Histogram

A histogram provides a clear visualization of the distribution of a variable.

Right skew: Long tail on the right
Left skew: Long tail on the left
Symmetric: Bell-shaped curve

python
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(data['feature'], kde=True)
plt.show()

b. Box Plot

Box plots highlight outliers and the spread of the data. A longer tail on one side signals skewness.

python
sns.boxplot(x=data['feature'])
plt.show()

c. Density Plot (KDE Plot)

Kernel Density Estimation plots smooth the histogram, making it easier to see skew.

python
sns.kdeplot(data['feature'])
plt.show()

d. QQ Plot

Quantile-Quantile plots compare the quantiles of the data against a normal distribution. Deviations from the 45-degree line suggest skewness.

python
import scipy.stats as stats
import matplotlib.pyplot as plt

stats.probplot(data['feature'], dist="norm", plot=plt)
plt.show()

2. Statistical Methods

a. Skewness Value

Use statistical libraries to compute the skewness:

python
from scipy.stats import skew

skew_value = skew(data['feature'])
print(f"Skewness: {skew_value}")

Interpret the result according to the guidelines mentioned earlier.

b. Normality Tests

These can supplement skewness detection:

Shapiro-Wilk Test
Kolmogorov–Smirnov Test
Anderson-Darling Test

Example using Shapiro-Wilk:

python
from scipy.stats import shapiro

stat, p = shapiro(data['feature'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Sample looks Gaussian (normal distribution)')
else:
    print('Sample does not look Gaussian (skewed)')

Interpreting Skewness in Context

Positive Skewness

Long tail to the right
Mean > Median > Mode
Common in income, wealth, and outlier-prone datasets
May affect algorithms that assume normality (e.g., linear regression)

Negative Skewness

Long tail to the left
Mean < Median < Mode
Occurs in test scores (e.g., when most students score high)
May require transformation for certain models

Implications on Data Analysis

Skewed data can distort:

Measures of central tendency
Hypothesis testing
Confidence intervals
Model assumptions

Machine learning algorithms like linear regression, logistic regression, and KNN often assume data is normally distributed. Skewness can impair model performance and lead to overfitting or underfitting.

Handling Skewed Data

1. Data Transformation

a. Log Transformation

Effective for reducing right skew.

python
data['log_feature'] = np.log1p(data['feature'])

b. Square Root or Cube Root

Works for moderately skewed data.

python
data['sqrt_feature'] = np.sqrt(data['feature'])

c. Box-Cox Transformation

Optimizes a power transform for normality (only positive values).

python
from scipy.stats import boxcox

data['boxcox_feature'], _ = boxcox(data['feature'])

d. Yeo-Johnson Transformation

Suitable for data with zero or negative values.

python
from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')
data['yj_feature'] = pt.fit_transform(data[['feature']])

2. Outlier Removal

Outliers can cause or exaggerate skewness. Removing or capping extreme values can help.

python
from scipy import stats

z_scores = stats.zscore(data['feature'])
data = data[(z_scores > -3) & (z_scores < 3)]

3. Binning or Bucketing

Discretizing a continuous variable into bins can reduce the effect of skewness on modeling.

python
data['binned_feature'] = pd.qcut(data['feature'], q=4)

Practical Use Case: EDA Workflow with Skewness Detection

Load Data:

python
import pandas as pd

data = pd.read_csv('dataset.csv')

Initial Visual Inspection:
Use histplot, boxplot, and kdeplot to inspect distribution.
Calculate Skewness:

python
print(data['feature'].skew())

Normality Test:
Use shapiro or qqplot.
Decide on Action:

If data is moderately or highly skewed, apply transformations.
Compare model performance before and after transformation.

Re-check Skewness:
After transformation, re-plot and re-compute skewness.

Conclusion

Detecting and interpreting skewness through EDA is essential for robust statistical analysis and machine learning modeling. Visual tools like histograms, box plots, and QQ plots combined with statistical methods provide a comprehensive picture of data symmetry. Interpreting skewness in context ensures better feature engineering, more accurate models, and reliable insights. By handling skewness proactively, data scientists can significantly improve data quality and downstream analytical performance.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page