Categories We Write About

How to Visualize Class Imbalances in Your Data Using EDA

Class imbalance is a common issue in classification problems where the number of observations in one class significantly outweighs those in other classes. This imbalance can distort model performance, especially in metrics like accuracy, which may be misleading. To address this early, Exploratory Data Analysis (EDA) provides several effective techniques to visualize and understand class distributions. By using visual tools, data scientists can better grasp the nature of the imbalance and take steps to mitigate its impact during modeling. Here’s how to effectively visualize class imbalances using EDA.

Understand the Nature of Class Imbalance

Before jumping into visualizations, it’s essential to comprehend what class imbalance entails. In binary classification, if 90% of data belongs to class 0 and only 10% to class 1, this 9:1 ratio can lead to a model that simply predicts the majority class to achieve high accuracy while performing poorly on the minority class.

In multiclass classification, the imbalance can span multiple classes with varying degrees of representation. Thus, recognizing the skewness early in the data analysis process is key to selecting proper modeling strategies later on.

1. Use Value Counts for a Quick Overview

One of the first steps in EDA for class imbalance is using value counts to assess how many samples belong to each class.

python
import pandas as pd # Example df = pd.read_csv('your_dataset.csv') print(df['target'].value_counts())

This command gives you a raw count of samples per class. While informative, it lacks visual clarity, which is crucial for presentation and deeper insights.

2. Visualize Class Distribution with Bar Plots

Bar plots offer a straightforward way to visualize class distributions. They can highlight imbalances clearly and are useful in both binary and multiclass settings.

python
import seaborn as sns import matplotlib.pyplot as plt sns.countplot(x='target', data=df) plt.title('Class Distribution') plt.xlabel('Class') plt.ylabel('Count') plt.show()

This plot helps detect imbalance immediately and provides a baseline for future resampling strategies like SMOTE, undersampling, or class weighting.

3. Pie Charts for Proportion View

While not always preferred in statistical analysis due to interpretability issues, pie charts can still serve a purpose in visually communicating the proportion of classes to non-technical stakeholders.

python
class_counts = df['target'].value_counts() plt.figure(figsize=(6, 6)) plt.pie(class_counts, labels=class_counts.index, autopct='%1.1f%%', startangle=140) plt.title('Class Proportion') plt.axis('equal') plt.show()

Pie charts work well when the number of classes is small, typically in binary or low-cardinality multiclass problems.

4. Analyze Class Distribution Over Time

In time-series or transactional data, class imbalance may vary across different time intervals. Plotting the class distribution over time can provide insights into evolving patterns or data collection biases.

python
df['timestamp'] = pd.to_datetime(df['timestamp']) df['year'] = df['timestamp'].dt.year class_by_year = df.groupby(['year', 'target']).size().unstack() class_by_year.plot(kind='bar', stacked=True, figsize=(10,6)) plt.title('Class Distribution Over Years') plt.xlabel('Year') plt.ylabel('Count') plt.show()

This method is helpful in understanding if imbalance is systemic or due to recent data collection issues.

5. Use a Heatmap for Confusion Matrix After Preliminary Modeling

Although not strictly part of EDA, early experimentation with a simple model can help visualize class imbalance through a confusion matrix. This helps demonstrate how a classifier might be biased towards the majority class.

python
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression X = df.drop('target', axis=1) y = df['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42) model = LogisticRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm) disp.plot(cmap='Blues') plt.title('Confusion Matrix') plt.show()

A skewed confusion matrix where most predictions fall into one class can further confirm the presence and severity of imbalance.

6. Visualize Class Distribution in Feature Space

Sometimes class imbalance interacts with feature distributions. Visualizing how classes are spread in feature space can offer valuable context.

Pair Plots or Scatter Plots:

python
sns.pairplot(df, hue='target', vars=['feature1', 'feature2']) plt.suptitle('Feature Distribution by Class', y=1.02) plt.show()

Or a simple scatter plot:

python
sns.scatterplot(data=df, x='feature1', y='feature2', hue='target') plt.title('Feature Scatter Plot by Class') plt.show()

This helps reveal if the minority class overlaps heavily with the majority class, hinting at the need for advanced sampling or modeling techniques.

7. KDE Plots for Density Comparison

Kernel Density Estimation (KDE) plots allow comparing the distribution of a continuous feature across different classes.

python
for label in df['target'].unique(): sns.kdeplot(df[df['target'] == label]['feature1'], label=f'Class {label}', fill=True) plt.title('Feature Density by Class') plt.xlabel('Feature1') plt.show()

This visualization helps understand whether features can linearly separate classes and how the imbalance might affect learning.

8. Use Log-Scale Plots for Extreme Imbalance

When class imbalance is very severe (e.g., 99:1), standard linear scale plots may hide the minority class completely. Switching to a logarithmic scale helps visualize all classes.

python
sns.countplot(x='target', data=df) plt.yscale('log') plt.title('Class Distribution (Log Scale)') plt.xlabel('Class') plt.ylabel('Log Count') plt.show()

This method makes it easier to detect even minute presence of the minority class and is particularly helpful in fraud or rare-event detection.

9. Interactive Dashboards for Exploratory Visualization

Using tools like Plotly or interactive dashboards with Streamlit can enhance EDA, especially when dealing with complex datasets and needing dynamic filtering.

python
import plotly.express as px fig = px.histogram(df, x='target', title='Interactive Class Distribution') fig.show()

Dashboards allow users to explore how class distributions change with different filters, segments, or time periods.

Conclusion

Effective visualization of class imbalance during EDA is critical for developing robust classification models. It helps detect skewness early, guides preprocessing strategies like resampling or weighting, and informs modeling decisions. By combining static plots (bar, pie, KDE) with dynamic techniques (heatmaps, pair plots, log-scale adjustments, interactive dashboards), data scientists can develop a comprehensive understanding of class distributions in their datasets. This leads to better modeling strategies, improved evaluation metrics, and ultimately, more accurate and fair machine learning systems.

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About