How to Visualize Missing Data and Handle It Using EDA

Visualizing and handling missing data is a crucial step in Exploratory Data Analysis (EDA). Ignoring missing data can lead to inaccurate models, biased results, and misleading insights. A systematic EDA process ensures that missing data is both understood and treated appropriately before further analysis or modeling. This guide explains how to detect, visualize, and handle missing data using various EDA techniques.

Understanding the Nature of Missing Data

Before visualizing or handling missing values, it’s essential to understand the types of missing data:

Missing Completely at Random (MCAR): The missingness is unrelated to any other data.
Missing at Random (MAR): The missingness is related to observed data but not the missing data itself.
Missing Not at Random (MNAR): The missingness is related to the unobserved (missing) data itself.

Identifying the nature of the missing data informs the most suitable imputation or handling technique.

Detecting Missing Data

To begin, a simple statistical overview provides quick insight:

python
import pandas as pd

# Load data
df = pd.read_csv('your_dataset.csv')

# Summary of missing values
missing_summary = df.isnull().sum()
missing_percentage = df.isnull().mean() * 100

# Combine for easier review
missing_data = pd.DataFrame({'Missing Count': missing_summary, 'Missing %': missing_percentage})
missing_data = missing_data[missing_data['Missing Count'] > 0].sort_values(by='Missing %', ascending=False)

This table immediately highlights which features have missing data and to what extent.

Visualizing Missing Data

Visual tools offer deeper insight into the structure and patterns of missingness.

1. Heatmap Using Seaborn

python
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()

A heatmap shows missing values across the dataset, making it easy to spot blocks of missing entries.

2. Missing Data Matrix with Missingno

The missingno library provides intuitive visualizations.

python
import missingno as msno

msno.matrix(df)
plt.show()

The matrix shows not only where data is missing, but also data density and relationships between missing entries.

3. Bar Chart of Missing Values

python
msno.bar(df)
plt.show()

This plot highlights which features have missing values and the extent of the missingness in each.

4. Dendrogram for Correlation of Missing Values

python
msno.dendrogram(df)
plt.show()

This dendrogram helps to detect whether missing values in different features are related, which is useful for deciding on imputation strategies.

Exploring Patterns in Missing Data

Once missing data is visualized, the next step is exploring patterns:

Check correlation between missing data and other features.
Create missingness indicator variables (isnull flags) to analyze associations.
Segment data by a category (e.g., gender, region) to inspect if missing data is biased by a specific group.

python
df['FeatureA_missing'] = df['FeatureA'].isnull().astype(int)
df.groupby('Category')['FeatureA_missing'].mean()

This reveals whether certain categories have systematically more missing values.

Handling Missing Data

After detecting and understanding missing data patterns, apply appropriate handling strategies.

1. Removing Missing Data

Drop rows: If missingness is minimal and random.

python
df_cleaned = df.dropna()

Drop columns: If a feature has a high percentage of missing data (typically >50%).

python
df_cleaned = df.drop(columns=['HighMissingFeature'])

2. Imputation Techniques

Imputation fills in missing values based on other available data.

a. Mean/Median/Mode Imputation

Best for numerical and categorical features with low variance.

python
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Category'] = df['Category'].fillna(df['Category'].mode()[0])

b. Forward or Backward Fill

Used for time series or ordered data.

python
df['Value'] = df['Value'].fillna(method='ffill')  # forward fill

c. K-Nearest Neighbors (KNN) Imputation

Imputes values based on similarity with other rows.

python
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

KNN is effective when data has local relationships but may be computationally expensive.

d. Multivariate Imputation by Chained Equations (MICE)

Builds a model for each missing value considering multiple features.

python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer()
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

Ideal for datasets where features have complex interdependencies.

3. Predictive Modeling

Use machine learning to predict missing values based on known features.

python
from sklearn.linear_model import LinearRegression

# Example: Predict 'Age' based on other features
train = df[df['Age'].notnull()]
test = df[df['Age'].isnull()]

model = LinearRegression()
model.fit(train[['Feature1', 'Feature2']], train['Age'])

df.loc[df['Age'].isnull(), 'Age'] = model.predict(test[['Feature1', 'Feature2']])

This is a powerful method but requires careful feature selection and validation.

Validating Imputation

Once imputation is complete:

Compare distributions before and after imputation to ensure consistency.
Use cross-validation if using imputed values for modeling.
Store imputation parameters for reproducibility.

python
import matplotlib.pyplot as plt

plt.hist(df['Age'].dropna(), bins=30, alpha=0.5, label='Original')
plt.hist(df_imputed['Age'], bins=30, alpha=0.5, label='Imputed')
plt.legend()
plt.show()

Documenting the Process

Keep track of:

Which features had missing data.
What percentage was missing.
What method was used to handle missingness.
Why a particular method was chosen.

This ensures transparency and reproducibility, especially in collaborative environments or production systems.

Conclusion

EDA is not complete without thoroughly understanding and addressing missing data. Using visualization tools like missingno, seaborn, and matplotlib, you can identify patterns and correlations in missingness. Appropriate handling—whether through removal or imputation—depends on the context, nature, and distribution of missing data. Systematic treatment of missing values ensures robust insights and builds a solid foundation for any data analysis or machine learning project.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page