How to Visualize and Handle Data Skewness Using EDA

Data skewness refers to the asymmetry in the distribution of data. In the context of Exploratory Data Analysis (EDA), recognizing and addressing skewness is crucial because many statistical models assume that the data is normally distributed. Skewed data can lead to biased model predictions, inefficient parameter estimates, and misleading data interpretations. This article explores how to visualize and handle data skewness using EDA techniques.

Understanding Skewness

Skewness measures the degree and direction of asymmetry. It is generally categorized into three types:

Symmetrical (zero skewness): Mean = Median = Mode
Positively skewed (right-skewed): Mean > Median > Mode
Negatively skewed (left-skewed): Mean < Median < Mode

Skewness can be numerically calculated using libraries such as pandas or scipy. A skewness value:

Between -0.5 and 0.5 indicates fairly symmetrical data
Between -1 and -0.5 or 0.5 and 1 indicates moderate skewness
Less than -1 or greater than 1 indicates high skewness

Visualizing Skewness in Data

Visualization is a core component of EDA that helps identify skewness intuitively.

1. Histogram

A histogram shows the frequency distribution of data. Skewed data will have a longer tail on one side:

python
import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(data['feature_name'], kde=True)
plt.title('Histogram of Feature')
plt.show()

In a right-skewed histogram, the tail is longer on the right; for left-skewed, it’s longer on the left.

2. Box Plot

Box plots highlight the median, quartiles, and outliers. Skewness is evident when the median is off-center and whiskers are uneven:

python
sns.boxplot(x=data['feature_name'])
plt.title('Box Plot of Feature')
plt.show()

A longer upper whisker indicates positive skew, while a longer lower whisker indicates negative skew.

3. Q-Q Plot (Quantile-Quantile Plot)

Q-Q plots compare the quantiles of the data to a normal distribution. If the data follows the 45-degree line, it is normally distributed. Deviations indicate skewness:

python
from scipy import stats
import matplotlib.pyplot as plt

stats.probplot(data['feature_name'], dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.show()

4. Density Plot (KDE)

Kernel Density Estimate plots provide a smoothed curve of the data distribution. Skewness is visible as an asymmetrical curve:

python
sns.kdeplot(data['feature_name'], shade=True)
plt.title('Density Plot')
plt.show()

5. Skewness Value (Numerical)

Use Python’s libraries to compute the skewness value:

python
data['feature_name'].skew()

This value complements visual insights with a quantifiable metric.

Handling Skewness in Data

Once skewness is identified, especially high skewness, transforming the data may be necessary. The choice of transformation depends on the type of skewness.

1. Log Transformation

Useful for right-skewed data. It compresses the long tail and brings the distribution closer to normal:

python
import numpy as np
data['log_feature'] = np.log1p(data['feature_name'])  # log1p handles zero values

2. Square Root Transformation

Also used for moderate right skewness. It reduces the impact of larger values:

python
data['sqrt_feature'] = np.sqrt(data['feature_name'])

3. Box-Cox Transformation

Applies a power transformation to make the data as normal as possible. It requires positive data:

python
from scipy.stats import boxcox
data['boxcox_feature'], _ = boxcox(data['feature_name'])

4. Yeo-Johnson Transformation

An extension of Box-Cox that works with zero or negative values:

python
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
data['yj_feature'] = pt.fit_transform(data[['feature_name']])

5. Reciprocal Transformation

This method can normalize data, especially when values are not too close to zero:

python
data['reciprocal_feature'] = 1 / (data['feature_name'] + 1)

6. Winsorization

Instead of transforming, extreme values can be capped (Winsorized) to reduce skewness:

python
from scipy.stats.mstats import winsorize
data['winsorized_feature'] = winsorize(data['feature_name'], limits=[0.05, 0.05])

Impact of Skewness on Modeling

Ignoring skewness can affect:

Linear Models: These assume normal distribution of errors. Skewness can bias coefficients.
Tree-Based Models: Less sensitive to skewness, but feature scaling may still help.
Clustering and PCA: Sensitive to feature distributions; normalization and skew correction improve performance.
Outlier Detection: Skewed distributions can inflate false positives.

Always evaluate model performance before and after transformation to determine effectiveness.

Skewness in Target Variable

Skewness in the target variable, especially for regression tasks, can degrade model predictions. Transforming the target variable (e.g., using a log transformation) can improve model performance, but predictions will need inverse transformation to interpret.

python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, np.log1p(y))  # Fit on log-transformed target
preds = np.expm1(model.predict(X_test))  # Inverse transform predictions

When Not to Transform

Skewness is negligible or moderate: Some models, like Random Forests or XGBoost, handle skewed data well.
Skewness carries meaning: In certain domains (e.g., income distribution), skewness reflects real-world phenomena and should be preserved.
Robust algorithms: Algorithms robust to outliers and skewness may not require transformation.

Conclusion

Visualizing and handling data skewness is an essential step in EDA to ensure better model performance and data interpretation. Histograms, box plots, and Q-Q plots provide intuitive insights, while statistical metrics like skewness values offer quantifiable support. Transformation techniques such as log, square root, and power transformations are effective in reducing skewness and improving data symmetry. Properly addressing skewness leads to more reliable insights and more accurate predictive models.

Share This Page: