Rajendra Gupta
Remove duplicates by keeping maximum and minimum value

Removing duplicates in an Excel sheet using Python scripts

March 24, 2020 by

In the article, Python scripts to format data in Microsoft Excel, we used Python scripts for creating an excel and do various data formatting. Python is an interesting high-level programming language. You can go through various use cases of Python on SQLShack.

In this article, we will look at removing duplicate data from excel using the Python.

A quick recap of removing duplicate rows in Microsoft Excel

Suppose we have the following data in an excel sheet. We want to get rid of duplicate values in this sheet.

duplicate rows in Microsoft Excel

In Microsoft Excel, we use the Remove Duplicates button from the Data menu. This option checks duplicate values and retains the FIRST unique value and removes other values.

Let’s click on Remove Duplicates and select all columns.

Remove duplicates

Click ok, and it removes the duplicate values 3 duplicate values and retains 5 unique values.

Duplicate removal confirmation

We have the following data after removing duplicates from this.

Excel data after removing duplicates

Suppose you are working in excel using Python language. If that excel contains duplicate values, we might not want to use Excel functionalities for it. Our script should be capable of handling such duplicate data and remove per our requirements such as remove all duplicates, remove all but the last duplicate, remove all but first duplicate.

Let’s look at the Python way of handling duplicate data in excel.

Python scripts for removing duplicates in an excel

Before we start with Python, make sure you run through the pre-requisites specified in the article, Python scripts to format data in Microsoft Excel.

Launch SQL Notebook in Azure Data Studio and verify pandas, NumPy packages existence. You can click on Manage Extensions in Azure Data Studio for it.

Azure Data Studio SQL Notebooks

Once you click on Manage Packages, it gives you a list of installed packages. Here, we can see both pandas and NumPy package along with pip utility.

Manage packages

We use the pandas read_excel() function to import an excel file. Create a new code block in SQL Notebook and execute the code. Here, the print statement prints the data frame that consists of excel sheet data.

First, we import the pandas library to read and write the excel sheets.

In this data, few columns contain NaN in the remarks column. Python display NaN for the cells that do not have any value/text.

In the output, we also see index values for individual rows. The first row starts with index id 0 and increments by 1 with each new row.

Read data from excel

We use drop_duplicates() function to remove duplicate records from a data frame in Python scripts.

Syntax of drop_duplicates() in Python scripts

DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

  • Subset: In this argument, we define the column list to consider for identifying duplicate rows. If it considers all columns in case, we do not specify any values
  • Keep: Here, we can specify the following values:
    • First: Remove all duplicate rows except the first one
    • Last: Remove all duplicate rows except the last one
    • False: Remove all duplicate rows
  • Inplace: By default, Python does not change the source data frame. We can specify this argument to change this behavior

Example 1: Use drop_duplicates() without any arguments

In the following query, it calls drop.duplicates() function for [data] dataframe.

In the output, we can see that it removes rows with index id 1,5 and 7. It is the default behavior of the drop_duplicate() function. It retains the first value and removes further duplicates.

Use drop_duplicates() without any arguments

Example 2: Use drop_duplicates() along with column names

By default, Pandas creates a data frame for all available columns and checks for duplicate data. Suppose, we want to exclude Remarks columns for checking duplicates. It means if the row contains similar values in the rest of the columns, it should be a duplicate row. We have few records in our excel sheet that contains more duplicate values if we do not consider the remarks column.

In the following Python scripts, we specify column names in the subset argument. Pandas check for these columns and remove the duplicate values. It excludes the remarks column in this case.

Example 2a: Keep=” first” argument

We also specify another argument keep=first to instruct Python to keep the first value and remove further duplicates. It is the default behaviors so that we can exclude this parameter here as well.

Look at the output, and we have only three records available. It removed all duplicate rows for the specified columns. We have rows with index id 0,2 and 4 in the output by using the first value in keep argument.

Use drop_duplicates() along with column names

Example 2b: Keep=” last” argument

We can change the argument keep=last. It keeps the last row from the duplicates and removes previous duplicate rows. Let’s change the argument and view the output.

In this case, the output changes and we have rows with index id 1,2,7 in the output.

Keep=”last” argument

Example 2c: Keep=” false” argument

Previously, we kept either first or last rows and removed other duplicate rows. Suppose we want to remove all duplicate values in the excel sheet. We can specify value False in the keep parameter for it.

If we execute the above Python Script, we get the following error message.

ValueError: keep must be either “first”, “last” or False

Keep=”false” argument

For the first and last value, we use double-quotes, but we need to specify False value without any quotes. Let’s remove the quote and execute the code.

In the output, we do not get any row because we do not have any unique rows in the excel sheet.

Keep=”false” argument output

To test the above code, let’s add a new row in the excel and execute the above code. We should get the row in the output. The script works, and we get unique rows in the output.

Insert a new row

Example 3: Remove duplicates by keeping the maximum and minimum value

Now, suppose we have a new column Age in the excel sheet. Someone has entered the wrong age for the employees. We want to remove the duplicate values but keep the row that has maximum age value for an employee. For example, Rohan has two entries in this sheet. Both rows look similar; however, one row shows age 22 while another row has age value 23. We want to remove the row with a minimum age. In this case, the row with age 22 for Rohan should be removed.

Remove duplicates by keeping maximum and minimum value

For this requirement, we use an additional Python script function sort_values(). In the following code, we sort the age values in ascending order using data.sort_values() function. In the ascending order, data is sorted from minimum to maximum age so we can keep the last value and remove other data rows.

In the output, we can see it has rows with maximum age for each employee. For example, Rohan shows age 23 that is the maximum age available from both records.

Remove duplicates by keeping maximum value

Similarly, we can change the data sort in descending order and remove the duplicates with minimum age values.

Remove duplicates by keeping minimum value

Example 4: drop_duplicate() function using inplace argument

By default, Pandas returns a new data frame and does not changes the source data frame. We can specify the argument inplace=True, and it changes the source data frame also.

Execute the following query and call the data frame in the end; it returns the content of the source data frame.

inplace argument

Let’s change the default value of inplace argument and view the change in the output.

inplace=True argument

Highlight duplicate values with custom color codes

In many cases, we just want to check the duplicate data instead of removing it. Instead, we require to highlight the duplicate values and send them to the appropriate team for correction. It might be feasible in case we are receiving data from a third party.

We can use conditional formatting and give a visual style ( color coding ) to duplicate rows. In the following code, we define a Python script function to highlight duplicate values in the orange background color. We will cover more about conditional formatting in upcoming articles.

It gives us the following output, and we can easily interpret which rows contain duplicate values. It is useful, especially when we have a large number of rows. We cannot go through each row in this case, and color-coding helps us to identify the duplicate values.

Highlight duplicate values

Alternatively, we can use further functions such as GROUPBY and count the duplicate rows.

If any row count is greater than 1, it is a duplicate row. In the following output, we can note than Dolly appeared 4 times in the excel sheet which means it is a duplicate row. Rajendra does not contain any duplicate row, so its count is 1 in the output.

Count of duplicates rows

Conclusion

In this article, we explored the process to remove duplicate rows in an Excel sheet using Python scripts. I liked the way to deal with Excel files using Python. We will cover more useful scripts in the upcoming articles. Stay tuned!

Rajendra Gupta
Latest posts by Rajendra Gupta (see all)
168 Views