Categories We Write About

Excel File Comparison with Python

Comparing Excel files is a common task in data analysis, quality control, and reporting. Python offers powerful libraries that make this process efficient and customizable, especially when dealing with large datasets or multiple sheets. Here’s a detailed guide on how to perform Excel file comparison using Python.

Key Libraries for Excel Comparison in Python

  • pandas: For reading Excel files into DataFrames and performing data manipulation.

  • openpyxl: Useful for reading/writing Excel files, especially for formatting and accessing cell-level information.

  • xlrd/xlwt: Older libraries, now largely replaced by pandas and openpyxl for Excel files.

  • difflib: To compare text strings if needed.

Steps for Excel File Comparison

  1. Read the Excel files into pandas DataFrames.

  2. Normalize the data (handle missing values, data types, trimming spaces).

  3. Compare the DataFrames row-wise and column-wise.

  4. Highlight or extract the differences.

  5. Output the results in a user-friendly format (e.g., Excel report, CSV, or console output).


Example: Comparing Two Excel Files Using pandas

Assuming you have two Excel files, file1.xlsx and file2.xlsx, with similar structure, the goal is to identify differences between them.

python
import pandas as pd # Load Excel files df1 = pd.read_excel('file1.xlsx') df2 = pd.read_excel('file2.xlsx') # Optional: Sort dataframes if order doesn't matter df1_sorted = df1.sort_values(by=df1.columns.tolist()).reset_index(drop=True) df2_sorted = df2.sort_values(by=df2.columns.tolist()).reset_index(drop=True) # Compare shapes first if df1_sorted.shape != df2_sorted.shape: print("Files have different shapes.") else: # Check if dataframes are exactly the same if df1_sorted.equals(df2_sorted): print("Files are identical.") else: # Find rows that are different comparison_df = df1_sorted.compare(df2_sorted) print("Differences found:") print(comparison_df)

Advanced Comparison with Highlighting Differences

To highlight differences in the Excel output, use the openpyxl engine combined with pandas’ Styler.

python
import pandas as pd def highlight_differences(df1, df2): def highlight(cell): return 'background-color: yellow' if cell else '' # Create a boolean DataFrame showing where the differences are diff_mask = df1.ne(df2) # Apply the style to highlight differences return diff_mask.applymap(lambda x: 'background-color: yellow' if x else '') # Load Excel files df1 = pd.read_excel('file1.xlsx') df2 = pd.read_excel('file2.xlsx') # Align dataframes by index and columns df1, df2 = df1.align(df2, join='outer', axis=1) # Fill NaNs for comparison consistency df1 = df1.fillna('') df2 = df2.fillna('') # Get styling styled = df1.style.apply(highlight_differences, df2=df2, axis=None) # Export to Excel with styles styled.to_excel('comparison_output.xlsx', engine='openpyxl')

Handling Multiple Sheets in Excel

If your Excel files contain multiple sheets, you can loop through each sheet and compare them individually:

python
import pandas as pd xls1 = pd.ExcelFile('file1.xlsx') xls2 = pd.ExcelFile('file2.xlsx') sheets1 = set(xls1.sheet_names) sheets2 = set(xls2.sheet_names) common_sheets = sheets1.intersection(sheets2) for sheet in common_sheets: df1 = pd.read_excel(xls1, sheet_name=sheet) df2 = pd.read_excel(xls2, sheet_name=sheet) if df1.equals(df2): print(f"Sheet '{sheet}' is identical in both files.") else: print(f"Differences found in sheet '{sheet}'.") diff = df1.compare(df2) print(diff)

Tips for Effective Excel File Comparison

  • Data Cleaning: Remove or standardize whitespace, handle missing values, and unify data types before comparison.

  • Key Columns: If the dataset is large, consider comparing based on key columns (e.g., IDs) to find mismatches.

  • Performance: For very large files, consider chunking or using database solutions.

  • Output: Generate summary reports that highlight only changed rows or cells to improve readability.


Summary

Python makes Excel file comparison flexible and scalable through pandas and openpyxl. Whether you need a simple diff or a styled Excel report highlighting differences, you can customize the process for your exact needs. This approach helps in auditing data, verifying updates, or tracking changes between reports with ease.

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About