The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Create a file deduplication tool

Here is a simple file deduplication tool written in Python. This script scans a directory (and optionally subdirectories), calculates hashes for files, and deletes duplicates, keeping only one copy.

python
import os import hashlib def hash_file(path, block_size=65536): hasher = hashlib.sha256() with open(path, 'rb') as f: while chunk := f.read(block_size): hasher.update(chunk) return hasher.hexdigest() def find_duplicates(directory, recursive=True): seen_hashes = {} duplicates = [] for root, _, files in os.walk(directory): for name in files: file_path = os.path.join(root, name) try: file_hash = hash_file(file_path) if file_hash in seen_hashes: duplicates.append(file_path) else: seen_hashes[file_hash] = file_path except Exception as e: print(f"Error processing {file_path}: {e}") if not recursive: break return duplicates def remove_duplicates(duplicates): for file_path in duplicates: try: os.remove(file_path) print(f"Removed duplicate: {file_path}") except Exception as e: print(f"Error removing {file_path}: {e}") if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description="File Deduplication Tool") parser.add_argument("directory", help="Directory to scan for duplicates") parser.add_argument("-r", "--recursive", action="store_true", help="Recursively scan subdirectories") parser.add_argument("-d", "--delete", action="store_true", help="Delete duplicates found") args = parser.parse_args() print(f"Scanning directory: {args.directory}") duplicates = find_duplicates(args.directory, args.recursive) if duplicates: print(f"Found {len(duplicates)} duplicate(s):") for dup in duplicates: print(dup) if args.delete: confirm = input("Are you sure you want to delete these files? [y/N]: ").lower() if confirm == 'y': remove_duplicates(duplicates) else: print("Deletion aborted.") else: print("No duplicates found.")

Features:

  • Uses SHA-256 for hashing.

  • Handles large files with efficient block reading.

  • Optional recursive search.

  • Safe deletion with confirmation.

Usage:

bash
python deduplicate.py /path/to/directory -r -d
  • -r to include subdirectories.

  • -d to delete duplicates.

Let me know if you need a GUI version or integration with specific file systems or cloud storage.

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About