Categories We Write About

Build a script to detect duplicate code

Detecting duplicate code in a project is a common task for improving code quality and maintainability. One way to achieve this is through static analysis tools or using a script that searches for code similarities. Here’s a basic script in Python that uses the difflib module to find duplicate code blocks by comparing functions or code snippets across files.

This is a simple approach, which you can expand based on your needs.

Python Script to Detect Duplicate Code

python
import os import difflib def read_file(file_path): """Reads the content of a file.""" with open(file_path, 'r') as file: return file.readlines() def compare_code_snippets(code1, code2): """Compares two code snippets and returns similarity score.""" d = difflib.Differ() diff = list(d.compare(code1, code2)) # Count the number of matching lines matches = sum(1 for line in diff if line.startswith(' ')) # Calculate similarity as a ratio of matching lines return matches / max(len(code1), len(code2)) def find_duplicate_code(directory, threshold=0.8): """Scans a directory for files and compares code snippets to find duplicates.""" # Get all Python files in the directory python_files = [f for f in os.listdir(directory) if f.endswith('.py')] duplicates = [] for i, file1 in enumerate(python_files): file1_path = os.path.join(directory, file1) code1 = read_file(file1_path) for file2 in python_files[i+1:]: file2_path = os.path.join(directory, file2) code2 = read_file(file2_path) # Compare code snippets from both files similarity = compare_code_snippets(code1, code2) if similarity >= threshold: duplicates.append((file1, file2, similarity)) return duplicates def main(): # Set the directory you want to scan for duplicate code directory = './your_project_directory' threshold = 0.8 # You can adjust the similarity threshold duplicates = find_duplicate_code(directory, threshold) if duplicates: print("Duplicate Code Found:") for dup in duplicates: print(f"Files: {dup[0]} and {dup[1]} - Similarity: {dup[2]:.2f}") else: print("No duplicate code detected.") if __name__ == '__main__': main()

How the Script Works:

  1. Reading Files: The script reads all the Python files in the specified directory.

  2. Comparing Code: It compares the contents of two files using the difflib module, which calculates how similar the two code snippets are.

  3. Similarity Threshold: It uses a similarity threshold to flag code snippets as duplicates. You can adjust this threshold to suit your needs. A threshold of 0.8 means 80% or more similarity.

  4. Duplicate Detection: If the similarity is greater than or equal to the threshold, the two files are considered as having duplicate code.

Steps to Use:

  1. Place this script in the root of your project or adjust the directory variable to point to your project directory.

  2. Run the script.

  3. The script will print out pairs of files with their similarity score if they contain duplicate code.

Customization:

  • Threshold: You can adjust the threshold to be more or less strict. A higher threshold means stricter duplicate detection.

  • Language Support: This script is currently designed for Python files (.py). You can adapt it to other languages by changing the file extension filter in the python_files list comprehension.

  • Function Granularity: This basic script compares whole files. If you want to compare individual functions, you’ll need to split the files into function-level snippets.

This approach provides a foundation for detecting duplicate code but can be improved with more advanced techniques such as abstract syntax tree (AST) analysis, tokenization, or integrating with dedicated tools like SonarQube or PMD for larger projects.

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About