Categories We Write About

How to Perform Exploratory Data Analysis on Image Data

Exploratory Data Analysis (EDA) on image data is a critical step in understanding the structure, patterns, and quality of image datasets before applying any machine learning or deep learning models. Unlike structured tabular data, image data presents unique challenges due to its high dimensionality, format complexity, and visual nature. Performing effective EDA on image data requires a combination of statistical analysis, visualization, and metadata inspection. This article outlines a comprehensive approach to conducting EDA on image data, covering essential steps and tools used by data scientists and computer vision engineers.

Understanding the Dataset Structure

Start by examining the structure of the dataset. Image datasets typically include:

  • Image files (JPEG, PNG, etc.)

  • Associated labels (for classification tasks)

  • Metadata (dimensions, color channels, source info)

Inspect the file directory to understand how the data is organized. Common structures include:

  • A folder for each class (e.g., cats/dogs/others)

  • A CSV or JSON file mapping file names to labels

Check Image Count and Distribution

Use Python and libraries such as os, pandas, and matplotlib to:

  • Count the total number of images

  • Analyze class distribution for imbalances

  • Identify any missing or mislabeled data

python
import os from collections import Counter data_dir = 'path/to/images' classes = os.listdir(data_dir) image_counts = {cls: len(os.listdir(os.path.join(data_dir, cls))) for cls in classes}

Visualize the distribution using a bar chart:

python
import matplotlib.pyplot as plt plt.bar(image_counts.keys(), image_counts.values()) plt.title("Image Count per Class") plt.xlabel("Class") plt.ylabel("Count") plt.show()

Image Dimension and Format Analysis

Images may vary in size, aspect ratio, and format. Inconsistent dimensions can affect model training. To analyze this:

  • Load a sample of images using PIL or OpenCV

  • Check dimensions and color modes

python
from PIL import Image dimensions = [] for cls in classes: folder = os.path.join(data_dir, cls) for img_name in os.listdir(folder): img_path = os.path.join(folder, img_name) with Image.open(img_path) as img: dimensions.append(img.size) # Plot histogram of widths and heights widths, heights = zip(*dimensions) plt.hist(widths, bins=20, alpha=0.5, label='Width') plt.hist(heights, bins=20, alpha=0.5, label='Height') plt.legend() plt.title("Image Dimensions Distribution") plt.show()

Check image formats and channels:

python
formats = [] modes = [] for cls in classes: folder = os.path.join(data_dir, cls) for img_name in os.listdir(folder): img_path = os.path.join(folder, img_name) with Image.open(img_path) as img: formats.append(img.format) modes.append(img.mode) print(Counter(formats)) print(Counter(modes))

Visual Inspection of Images

Randomly display images from different classes to:

  • Spot anomalies

  • Detect poor quality images

  • Ensure correct labeling

python
import random for cls in classes: folder = os.path.join(data_dir, cls) samples = random.sample(os.listdir(folder), 5) for img_name in samples: img_path = os.path.join(folder, img_name) img = Image.open(img_path) plt.imshow(img) plt.title(cls) plt.axis('off') plt.show()

This step helps catch issues like:

  • Corrupt or unreadable images

  • Incorrect class assignments

  • Unwanted artifacts in images

Statistical Pixel Analysis

Perform pixel intensity analysis to gain insights into the brightness and contrast of images. Convert images to grayscale for simplified intensity analysis.

python
import numpy as np gray_means = [] gray_stds = [] for cls in classes: folder = os.path.join(data_dir, cls) for img_name in os.listdir(folder): img_path = os.path.join(folder, img_name) img = Image.open(img_path).convert('L') # Convert to grayscale img_array = np.array(img) gray_means.append(np.mean(img_array)) gray_stds.append(np.std(img_array)) plt.hist(gray_means, bins=30, alpha=0.7) plt.title("Mean Pixel Intensity Distribution") plt.xlabel("Mean Intensity") plt.ylabel("Frequency") plt.show()

Understanding pixel distribution helps in:

  • Detecting overly bright/dark images

  • Choosing appropriate normalization techniques

  • Identifying contrast issues

Color Analysis

For RGB images, analyze each channel separately to identify color biases or unusual dominance.

python
r_means, g_means, b_means = [], [], [] for cls in classes: folder = os.path.join(data_dir, cls) for img_name in os.listdir(folder): img_path = os.path.join(folder, img_name) img = Image.open(img_path).convert('RGB') img_array = np.array(img) r_means.append(np.mean(img_array[:,:,0])) g_means.append(np.mean(img_array[:,:,1])) b_means.append(np.mean(img_array[:,:,2])) plt.hist(r_means, bins=30, alpha=0.5, label='Red') plt.hist(g_means, bins=30, alpha=0.5, label='Green') plt.hist(b_means, bins=30, alpha=0.5, label='Blue') plt.legend() plt.title("Color Channel Mean Intensity") plt.show()

This is useful for tasks involving natural scenes or detecting dataset-specific color patterns.

Duplicate and Corrupt Image Detection

Remove or flag:

  • Identical duplicates (same pixel data)

  • Corrupt files (unreadable or improperly saved images)

Use hashing for duplicate detection:

python
import hashlib def get_image_hash(img_path): with Image.open(img_path) as img: return hashlib.md5(img.tobytes()).hexdigest() hashes = {} duplicates = [] for cls in classes: folder = os.path.join(data_dir, cls) for img_name in os.listdir(folder): img_path = os.path.join(folder, img_name) try: img_hash = get_image_hash(img_path) if img_hash in hashes: duplicates.append((img_path, hashes[img_hash])) else: hashes[img_hash] = img_path except: print(f"Corrupt image found: {img_path}")

Removing duplicates ensures dataset integrity and avoids bias in training.

Dimensionality Reduction for Visualization

Use techniques like PCA or t-SNE on image embeddings to visualize clusters or detect anomalies. Start by converting images into feature vectors using pre-trained models like VGG16 or ResNet.

python
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input from tensorflow.keras.preprocessing.image import img_to_array, load_img from sklearn.decomposition import PCA import seaborn as sns model = VGG16(include_top=False, input_shape=(224,224,3), pooling='avg') features = [] labels = [] for cls in classes: folder = os.path.join(data_dir, cls) for img_name in os.listdir(folder)[:50]: # Sample for speed img_path = os.path.join(folder, img_name) img = load_img(img_path, target_size=(224,224)) x = img_to_array(img) x = preprocess_input(x) x = np.expand_dims(x, axis=0) feat = model.predict(x) features.append(feat.flatten()) labels.append(cls) pca = PCA(n_components=2) pca_result = pca.fit_transform(features) sns.scatterplot(x=pca_result[:,0], y=pca_result[:,1], hue=labels) plt.title("PCA of Image Features") plt.show()

This step helps visually identify class overlap, outliers, or noise.

Summary of EDA Best Practices for Image Data

  1. Start with dataset structure: Understand class folders, labels, and metadata.

  2. Check image counts and distribution: Spot imbalances and mislabeling.

  3. Inspect dimensions and formats: Ensure consistency in shape and channels.

  4. Visualize sample images: Catch labeling or quality issues early.

  5. Analyze pixel and color stats: Support normalization and preprocessing choices.

  6. Detect duplicates and corrupt files: Clean the dataset before modeling.

  7. Use feature extraction and PCA: Explore latent structures and relationships.

Proper EDA on image data sets the foundation for robust model performance, ensuring data quality, balance, and relevance. By leveraging the right tools and visualizations, you can uncover insights that lead to more informed preprocessing, augmentation, and training strategies.

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About