How to Perform Exploratory Data Analysis on Image Data

Exploratory Data Analysis (EDA) on image data is a critical step in understanding the structure, patterns, and quality of image datasets before applying any machine learning or deep learning models. Unlike structured tabular data, image data presents unique challenges due to its high dimensionality, format complexity, and visual nature. Performing effective EDA on image data requires a combination of statistical analysis, visualization, and metadata inspection. This article outlines a comprehensive approach to conducting EDA on image data, covering essential steps and tools used by data scientists and computer vision engineers.

Understanding the Dataset Structure

Start by examining the structure of the dataset. Image datasets typically include:

Image files (JPEG, PNG, etc.)
Associated labels (for classification tasks)
Metadata (dimensions, color channels, source info)

Inspect the file directory to understand how the data is organized. Common structures include:

A folder for each class (e.g., cats/dogs/others)
A CSV or JSON file mapping file names to labels

Check Image Count and Distribution

Use Python and libraries such as os, pandas, and matplotlib to:

Count the total number of images
Analyze class distribution for imbalances
Identify any missing or mislabeled data

python
import os
from collections import Counter

data_dir = 'path/to/images'
classes = os.listdir(data_dir)
image_counts = {cls: len(os.listdir(os.path.join(data_dir, cls))) for cls in classes}

Visualize the distribution using a bar chart:

python
import matplotlib.pyplot as plt

plt.bar(image_counts.keys(), image_counts.values())
plt.title("Image Count per Class")
plt.xlabel("Class")
plt.ylabel("Count")
plt.show()

Image Dimension and Format Analysis

Images may vary in size, aspect ratio, and format. Inconsistent dimensions can affect model training. To analyze this:

Load a sample of images using PIL or OpenCV
Check dimensions and color modes

python
from PIL import Image

dimensions = []
for cls in classes:
    folder = os.path.join(data_dir, cls)
    for img_name in os.listdir(folder):
        img_path = os.path.join(folder, img_name)
        with Image.open(img_path) as img:
            dimensions.append(img.size)

# Plot histogram of widths and heights
widths, heights = zip(*dimensions)
plt.hist(widths, bins=20, alpha=0.5, label='Width')
plt.hist(heights, bins=20, alpha=0.5, label='Height')
plt.legend()
plt.title("Image Dimensions Distribution")
plt.show()

Check image formats and channels:

python
formats = []
modes = []

for cls in classes:
    folder = os.path.join(data_dir, cls)
    for img_name in os.listdir(folder):
        img_path = os.path.join(folder, img_name)
        with Image.open(img_path) as img:
            formats.append(img.format)
            modes.append(img.mode)

print(Counter(formats))
print(Counter(modes))

Visual Inspection of Images

Randomly display images from different classes to:

Spot anomalies
Detect poor quality images
Ensure correct labeling

python
import random

for cls in classes:
    folder = os.path.join(data_dir, cls)
    samples = random.sample(os.listdir(folder), 5)
    for img_name in samples:
        img_path = os.path.join(folder, img_name)
        img = Image.open(img_path)
        plt.imshow(img)
        plt.title(cls)
        plt.axis('off')
        plt.show()

This step helps catch issues like:

Corrupt or unreadable images
Incorrect class assignments
Unwanted artifacts in images

Statistical Pixel Analysis

Perform pixel intensity analysis to gain insights into the brightness and contrast of images. Convert images to grayscale for simplified intensity analysis.

python
import numpy as np

gray_means = []
gray_stds = []

for cls in classes:
    folder = os.path.join(data_dir, cls)
    for img_name in os.listdir(folder):
        img_path = os.path.join(folder, img_name)
        img = Image.open(img_path).convert('L')  # Convert to grayscale
        img_array = np.array(img)
        gray_means.append(np.mean(img_array))
        gray_stds.append(np.std(img_array))

plt.hist(gray_means, bins=30, alpha=0.7)
plt.title("Mean Pixel Intensity Distribution")
plt.xlabel("Mean Intensity")
plt.ylabel("Frequency")
plt.show()

Understanding pixel distribution helps in:

Detecting overly bright/dark images
Choosing appropriate normalization techniques
Identifying contrast issues

Color Analysis

For RGB images, analyze each channel separately to identify color biases or unusual dominance.

python
r_means, g_means, b_means = [], [], []

for cls in classes:
    folder = os.path.join(data_dir, cls)
    for img_name in os.listdir(folder):
        img_path = os.path.join(folder, img_name)
        img = Image.open(img_path).convert('RGB')
        img_array = np.array(img)
        r_means.append(np.mean(img_array[:,:,0]))
        g_means.append(np.mean(img_array[:,:,1]))
        b_means.append(np.mean(img_array[:,:,2]))

plt.hist(r_means, bins=30, alpha=0.5, label='Red')
plt.hist(g_means, bins=30, alpha=0.5, label='Green')
plt.hist(b_means, bins=30, alpha=0.5, label='Blue')
plt.legend()
plt.title("Color Channel Mean Intensity")
plt.show()

This is useful for tasks involving natural scenes or detecting dataset-specific color patterns.

Duplicate and Corrupt Image Detection

Remove or flag:

Identical duplicates (same pixel data)
Corrupt files (unreadable or improperly saved images)

Use hashing for duplicate detection:

python
import hashlib

def get_image_hash(img_path):
    with Image.open(img_path) as img:
        return hashlib.md5(img.tobytes()).hexdigest()

hashes = {}
duplicates = []

for cls in classes:
    folder = os.path.join(data_dir, cls)
    for img_name in os.listdir(folder):
        img_path = os.path.join(folder, img_name)
        try:
            img_hash = get_image_hash(img_path)
            if img_hash in hashes:
                duplicates.append((img_path, hashes[img_hash]))
            else:
                hashes[img_hash] = img_path
        except:
            print(f"Corrupt image found: {img_path}")

Removing duplicates ensures dataset integrity and avoids bias in training.

Dimensionality Reduction for Visualization

Use techniques like PCA or t-SNE on image embeddings to visualize clusters or detect anomalies. Start by converting images into feature vectors using pre-trained models like VGG16 or ResNet.

python
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array, load_img
from sklearn.decomposition import PCA
import seaborn as sns

model = VGG16(include_top=False, input_shape=(224,224,3), pooling='avg')

features = []
labels = []

for cls in classes:
    folder = os.path.join(data_dir, cls)
    for img_name in os.listdir(folder)[:50]:  # Sample for speed
        img_path = os.path.join(folder, img_name)
        img = load_img(img_path, target_size=(224,224))
        x = img_to_array(img)
        x = preprocess_input(x)
        x = np.expand_dims(x, axis=0)
        feat = model.predict(x)
        features.append(feat.flatten())
        labels.append(cls)

pca = PCA(n_components=2)
pca_result = pca.fit_transform(features)

sns.scatterplot(x=pca_result[:,0], y=pca_result[:,1], hue=labels)
plt.title("PCA of Image Features")
plt.show()

This step helps visually identify class overlap, outliers, or noise.

Summary of EDA Best Practices for Image Data

Start with dataset structure: Understand class folders, labels, and metadata.
Check image counts and distribution: Spot imbalances and mislabeling.
Inspect dimensions and formats: Ensure consistency in shape and channels.
Visualize sample images: Catch labeling or quality issues early.
Analyze pixel and color stats: Support normalization and preprocessing choices.
Detect duplicates and corrupt files: Clean the dataset before modeling.
Use feature extraction and PCA: Explore latent structures and relationships.

Proper EDA on image data sets the foundation for robust model performance, ensuring data quality, balance, and relevance. By leveraging the right tools and visualizations, you can uncover insights that lead to more informed preprocessing, augmentation, and training strategies.

Share This Page:

How to Perform Exploratory Data Analysis on Image Data

Understanding the Dataset Structure

Check Image Count and Distribution

Image Dimension and Format Analysis

Visual Inspection of Images

Statistical Pixel Analysis

Color Analysis

Duplicate and Corrupt Image Detection

Dimensionality Reduction for Visualization

Summary of EDA Best Practices for Image Data

Comments

Leave a Reply Cancel reply

Check Out Our Newest Posts we wrote about

Writing Thread-Safe Memory Management in C++

Writing Tests for Animation Systems

Writing Secure C++ Code with Proper Memory Management

Writing Secure C++ Code with Proper Memory Management (1)