Detect image duplicates with hashing

Detecting image duplicates using hashing is an efficient and widely used method to identify identical or near-identical images. Here’s a comprehensive article explaining the concept, techniques, and practical implementations for detecting duplicate images using hashing.

In the digital era, the volume of images generated and shared every day is enormous. Managing these images often requires identifying duplicates or near-duplicates, which helps in saving storage, improving search efficiency, and organizing image collections. Image hashing is one of the most effective methods to detect duplicates by converting images into compact, comparable hash codes.

What is Image Hashing?

Image hashing transforms an image into a fixed-size string or number (hash) that represents its content. Unlike cryptographic hashes (e.g., MD5, SHA), which change drastically with small modifications, image hashes are designed to be robust against minor variations like resizing, compression, or color adjustments. This means similar images produce similar hashes, enabling detection of duplicates or near-duplicates.

Types of Image Hashing Techniques

Average Hash (aHash)
- Converts the image to grayscale and resizes it to a small fixed dimension (e.g., 8×8).
- Computes the average pixel value.
- Each pixel is compared to the average; pixels above average become 1, below become 0.
- The resulting bits form the hash.
- Simple and fast, but not very resistant to complex distortions.
Perceptual Hash (pHash)
- Uses Discrete Cosine Transform (DCT) to capture frequency components of the image.
- Focuses on low-frequency parts that represent basic image structure.
- More robust to color changes and minor edits.
- Popular for near-duplicate detection.
Difference Hash (dHash)
- Works by comparing adjacent pixels to capture relative gradients.
- Faster than pHash and less sensitive to lighting changes.
- Produces a binary string representing differences between pixels.
Wavelet Hash (wHash)
- Uses discrete wavelet transform instead of DCT.
- Captures more detailed frequency information.
- Good for complex distortions and image edits.

How Hashing Helps Detect Duplicates

Once hashes are computed for images, comparison becomes efficient. Instead of comparing pixel data, which is costly, hashes are compared using a Hamming distance — the number of differing bits between two hashes.

Exact duplicates: Hashes are identical.
Near duplicates: Hashes have a small Hamming distance (thresholds depend on hash size and technique).

By setting an appropriate threshold, images with small variations can be detected as duplicates.

Implementing Image Duplicate Detection Using Hashing (Python Example)

Here’s how to implement a simple duplicate image detection system using imagehash and PIL libraries in Python.

python
from PIL import Image
import imagehash
import os

def find_duplicates(image_folder, hash_func=imagehash.phash, hash_size=8, threshold=5):
    """
    Detects duplicate images in a folder using perceptual hashing.

    Parameters:
    - image_folder: directory containing images
    - hash_func: hashing function to use (default: pHash)
    - hash_size: hash size (default: 8)
    - threshold: max Hamming distance to consider as duplicates

    Returns:
    - duplicates: dict mapping image paths to list of duplicates
    """
    hashes = {}
    duplicates = {}

    for filename in os.listdir(image_folder):
        if not filename.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif')):
            continue
        path = os.path.join(image_folder, filename)
        image = Image.open(path)
        image_hash = hash_func(image, hash_size=hash_size)

        # Compare hash with existing hashes
        found_duplicate = False
        for existing_hash, files in hashes.items():
            if image_hash - existing_hash <= threshold:
                duplicates.setdefault(files[0], []).append(path)
                found_duplicate = True
                break

        if not found_duplicate:
            hashes.setdefault(image_hash, []).append(path)

    return duplicates

# Usage example:
folder = 'path_to_your_images'
dupes = find_duplicates(folder)
for original, copies in dupes.items():
    print(f"Original: {original}")
    for copy in copies:
        print(f"   Duplicate: {copy}")

Advantages of Hashing for Duplicate Detection

Speed: Hash comparison is fast and scalable for large datasets.
Storage: Hashes are small (e.g., 64 bits) and efficient to store.
Robustness: Image hashes tolerate minor changes like compression, scaling, or slight color edits.
Simplicity: Implementation is straightforward with available libraries.

Limitations and Considerations

Threshold tuning: Setting the right Hamming distance threshold is critical. Too low may miss near duplicates; too high can cause false positives.
Non-robustness to extreme changes: Large rotations, cropping, or heavy edits may break similarity detection.
False positives: Different images with similar patterns may produce close hashes.
Format support: Ensure supported image formats are handled.

Use Cases for Image Duplicate Detection

Photo management: Remove duplicates to save space.
Content moderation: Detect reposted or copied images.
Web crawling: Avoid indexing duplicate images.
E-commerce: Identify duplicate product images.
Digital forensics: Find altered or reused images.

Conclusion

Hashing is a powerful technique to detect image duplicates efficiently. By converting images into compact hashes and comparing them via Hamming distance, it becomes possible to find identical or near-identical images with high speed and accuracy. Choosing the right hashing algorithm and thresholds is key to optimizing results for specific applications.

If you want, I can also provide code examples using other languages or more advanced methods such as deep learning for image similarity!

Share This Page:

Detect image duplicates with hashing