Build a duplicate file finder

Creating a duplicate file finder involves scanning directories, calculating file hashes to identify duplicates, and then reporting or handling the duplicates found. Below is a Python script example that implements a simple duplicate file finder.

This script:

Walks through a directory and its subdirectories.
Calculates a hash (SHA-256) of each file’s contents.
Groups files by hash.
Reports files with identical hashes (duplicates).

python
import os
import hashlib
from collections import defaultdict

def hash_file(filepath, chunk_size=8192):
    """Calculate SHA-256 hash of a file."""
    sha256 = hashlib.sha256()
    try:
        with open(filepath, 'rb') as f:
            while chunk := f.read(chunk_size):
                sha256.update(chunk)
        return sha256.hexdigest()
    except (OSError, IOError):
        return None  # Could not read file

def find_duplicate_files(root_dir):
    """Find duplicate files under the given directory."""
    files_by_size = defaultdict(list)
    duplicates = defaultdict(list)

    # Group files by size first (quick filter)
    for dirpath, _, filenames in os.walk(root_dir):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            try:
                filesize = os.path.getsize(filepath)
                files_by_size[filesize].append(filepath)
            except OSError:
                continue

    # For files with same size, compare hash to confirm duplicates
    for size, files in files_by_size.items():
        if len(files) < 2:
            continue  # Unique by size, skip
        hashes = defaultdict(list)
        for file in files:
            file_hash = hash_file(file)
            if file_hash:
                hashes[file_hash].append(file)
        for file_hash, files_list in hashes.items():
            if len(files_list) > 1:
                duplicates[file_hash].extend(files_list)

    return duplicates

def print_duplicates(duplicates):
    if not duplicates:
        print("No duplicate files found.")
        return
    print("Duplicate files found:")
    for file_hash, files in duplicates.items():
        print(f"nHash: {file_hash}")
        for file in files:
            print(f"  {file}")

if __name__ == "__main__":
    root_directory = input("Enter directory path to scan for duplicates: ").strip()
    if not os.path.isdir(root_directory):
        print("Invalid directory path.")
    else:
        dups = find_duplicate_files(root_directory)
        print_duplicates(dups)

This script is a basic but effective tool to find duplicate files by content, regardless of their filenames. It’s easily extendable — for example, you could add options to delete duplicates, move them, or generate reports.

Would you like me to help with any enhancements like a GUI, integration with specific OS features, or handling very large files?

Share This Page:

Comments

Leave a Reply Cancel reply

Check Out Our Newest Posts we wrote about

Write scripts to automate online shopping

Write a Python script to clean HTML files

Why You Need an AI Content Operations Strategy

Why You Need a Business Case for Every Model