Categories We Write About

Build a duplicate file finder

Creating a duplicate file finder involves scanning directories, calculating file hashes to identify duplicates, and then reporting or handling the duplicates found. Below is a Python script example that implements a simple duplicate file finder.

This script:

  • Walks through a directory and its subdirectories.

  • Calculates a hash (SHA-256) of each file’s contents.

  • Groups files by hash.

  • Reports files with identical hashes (duplicates).

python
import os import hashlib from collections import defaultdict def hash_file(filepath, chunk_size=8192): """Calculate SHA-256 hash of a file.""" sha256 = hashlib.sha256() try: with open(filepath, 'rb') as f: while chunk := f.read(chunk_size): sha256.update(chunk) return sha256.hexdigest() except (OSError, IOError): return None # Could not read file def find_duplicate_files(root_dir): """Find duplicate files under the given directory.""" files_by_size = defaultdict(list) duplicates = defaultdict(list) # Group files by size first (quick filter) for dirpath, _, filenames in os.walk(root_dir): for filename in filenames: filepath = os.path.join(dirpath, filename) try: filesize = os.path.getsize(filepath) files_by_size[filesize].append(filepath) except OSError: continue # For files with same size, compare hash to confirm duplicates for size, files in files_by_size.items(): if len(files) < 2: continue # Unique by size, skip hashes = defaultdict(list) for file in files: file_hash = hash_file(file) if file_hash: hashes[file_hash].append(file) for file_hash, files_list in hashes.items(): if len(files_list) > 1: duplicates[file_hash].extend(files_list) return duplicates def print_duplicates(duplicates): if not duplicates: print("No duplicate files found.") return print("Duplicate files found:") for file_hash, files in duplicates.items(): print(f"nHash: {file_hash}") for file in files: print(f" {file}") if __name__ == "__main__": root_directory = input("Enter directory path to scan for duplicates: ").strip() if not os.path.isdir(root_directory): print("Invalid directory path.") else: dups = find_duplicate_files(root_directory) print_duplicates(dups)

This script is a basic but effective tool to find duplicate files by content, regardless of their filenames. It’s easily extendable — for example, you could add options to delete duplicates, move them, or generate reports.

Would you like me to help with any enhancements like a GUI, integration with specific OS features, or handling very large files?

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About