Finding Duplicates in PowerPivot – Within Current Table (HasDupeInThisTable)
or Across Tables (HasDupeInTable2)
I got a question on Twitter today while changing planes in the Memphis airport (on my flight from Vegas to Memphis, I sat next to a software engineer who knows Dan English, and we talked the whole way. Great conversation, and although it disrupted my plans to blog on that flight, it was well worth it).
The questions was: how do I find duplicate rows in PowerPivot? Actually, the question was phrased as “in Access there’s a feature called Find Unmatched – how do I do that in PowerPivot?” Being an Excel guy, I translated that to “Find Duplicates,” and hopefully that wasn’t a mistake
Variant One: Finding Dupes Within a Single Table
OK, this is similar to the Excel feature (Remove Duplicates), which only works in the same table.
I made a three-column sample data set, where each column has a random character from “a” to “e” – this yields 125 possible unique combinations. I extended that data set to be 161 rows in the table, so I’m guaranteed to have some duplicates (Pigeonhole Principle for the win!) but probably not every row will be a dupe.
I added a single calc column, named [HasDupeInThisTable], with the following formula:
So it returns 1 if more than 1 row has that set of values for all three columns, and 0 if it’s just one row. The CALCULATE() will never return 0 since each row of the table matches itself
What does EARLIER() do here? As the FILTER() function examines each row in the table, if I didn’t use EARLIER(), it would compare each of those rows to itself, which always matches. Instead I want it to compare each row to the current row (the one whose value for the calc column is currently be determined) to see if there’s a match. EARLIER() is kinda like a “THISROW()” function in this case.
Of course, this doesn’t remove duplicate rows like the Excel feature does – it just flags rows that have dupes (and it flags all rows in each duplicate “family” – if three rows match each other, all three rows gets flagged, and not just two of the three).
Variant Two: Finding Dupes Across Tables
This was, I think, the question I was asked. I have two tables and want to see if any rows in Table2 “match” rows in Table1.
So I created a Table2 using the same randomization technique as Table1, and then added the following calc column to Table1:
It’s a very similar formula, but it lacks the EARLIER function. Why? Well since I’m comparing a Table1 value to a Table2 value rather than Table1 to itself, and FILTER() is stepping through the rows of Table2 to calculate a single calc column value (that’s going to be returned in Table1), I have no need for a “this row” function.
Maybe that seems complicated or maybe that seems simple. I suspect that over-explaining this one would be counterproductive actually. I need “this row” for comparisons within this table, and I don’t when I go across tables, and that’s probably good enough.
Works with Different Numbers of Columns
Another obvious point: this clearly works in cases other than 3 columns
And you don’t have to match on all columns in a table either, you can choose to match based on a subset of columns. That second argument to FILTER(), where I use && to combine tests, is the only thing you need to change.
Download the Workbook
The simple workbook I used for this post is available: