Find and remove duplicate rows

To find and remove duplicate rows in a dataset, you can follow these steps depending on the platform or programming language you’re using. Below are the methods for common tools like Excel, Python (using pandas), and SQL:

1. Using Excel

Find Duplicates:
1. Select the range of cells or the entire dataset where you suspect duplicates.
2. Go to the Data tab on the ribbon.
3. Click on Remove Duplicates in the Data Tools group.
4. In the pop-up window, you can choose which columns to check for duplicates.
5. Click OK. Excel will notify you of how many duplicates were found and removed.

2. Using Python (pandas)

Find and Remove Duplicates:

python
import pandas as pd

# Load your dataset (replace 'your_dataset.csv' with your file name)
df = pd.read_csv('your_dataset.csv')

# Find duplicates
print(df.duplicated())

# Remove duplicate rows
df_cleaned = df.drop_duplicates()

# Optionally, save the cleaned dataset back to a file
df_cleaned.to_csv('cleaned_dataset.csv', index=False)

The duplicated() function shows a boolean Series indicating duplicate rows, and drop_duplicates() removes them.

3. Using SQL

Find and Remove Duplicates:
To find duplicates:

sql
SELECT column1, column2, COUNT(*)
FROM your_table
GROUP BY column1, column2
HAVING COUNT(*) > 1;

To remove duplicates, you can use a CTE (Common Table Expression) with ROW_NUMBER():

sql
WITH CTE AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS row_num
    FROM your_table
)
DELETE FROM your_table
WHERE id IN (SELECT id FROM CTE WHERE row_num > 1);

This deletes the duplicate rows, keeping only the first occurrence.

Each of these methods helps you identify and remove duplicates from your dataset, depending on your environment. Let me know if you need further details on any of these methods!

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

1. Using Excel

2. Using Python (pandas)

3. Using SQL

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic