skip to Main Content

 
Finding Duplicates in PowerPivot – Within Current Table or Across Tables

Finding Duplicates in PowerPivot – Within Current Table (HasDupeInThisTable)
or Across Tables (HasDupeInTable2)

I got a question on Twitter today while changing planes in the Memphis airport (on my flight from Vegas to Memphis, I sat next to a software engineer who knows Dan English, and we talked the whole way.  Great conversation, and although it disrupted my plans to blog on that flight, it was well worth it).

The questions was:  how do I find duplicate rows in PowerPivot?  Actually, the question was phrased as “in Access there’s a feature called Find Unmatched – how do I do that in PowerPivot?”  Being an Excel guy, I translated that to “Find Duplicates,” and hopefully that wasn’t a mistake Smile

Variant One:  Finding Dupes Within a Single Table

OK, this is similar to the Excel feature (Remove Duplicates), which only works in the same table.

I made a three-column sample data set, where each column has a random character from “a” to “e” – this yields 125 possible unique combinations.  I extended that data set to be 161 rows in the table, so I’m guaranteed to have some duplicates (Pigeonhole Principle for the win!) but probably not every row will be a dupe.

I added a single calc column, named [HasDupeInThisTable], with the following formula:

=IF(
    CALCULATE(COUNTROWS(Table1),
              FILTER(Table1,
                  Table1[Column1]=EARLIER(Table1[Column1]) &&
                  Table1[Column2]=EARLIER(Table1[Column2]) &&
                  Table1[Column3]=EARLIER(Table1[Column3])
              )
    )>1,1,0
)

So it returns 1 if more than 1 row has that set of values for all three columns, and 0 if it’s just one row.  The CALCULATE() will never return 0 since each row of the table matches itself Smile

What does EARLIER() do here?  As the FILTER() function examines each row in the table, if I didn’t use EARLIER(), it would compare each of those rows to itself, which always matches.  Instead I want it to compare each row to the current row (the one whose value for the calc column is currently be determined) to see if there’s a match.  EARLIER() is kinda like a “THISROW()” function in this case.

Of course, this doesn’t remove duplicate rows like the Excel feature does – it just flags rows that have dupes (and it flags all rows in each duplicate “family” – if three rows match each other, all three rows gets flagged, and not just two of the three).

Variant Two:  Finding Dupes Across Tables

This was, I think, the question I was asked.  I have two tables and want to see if any rows in Table2 “match” rows in Table1.

So I created a Table2 using the same randomization technique as Table1, and then added the following calc column to Table1:

=IF(
    CALCULATE(COUNTROWS(Table2),
              FILTER(Table2,
                     Table2[Column1]=Table1[Column1] &&
                     Table2[Column2]=Table1[Column2] &&
                     Table2[Column3]=Table1[Column3]              )
    )=0,0,1
)

It’s a very similar formula, but it lacks the EARLIER function.  Why?  Well since I’m comparing a Table1 value to a Table2 value rather than Table1 to itself, and FILTER() is stepping through the rows of Table2 to calculate a single calc column value (that’s going to be returned in Table1), I have no need for a “this row” function.

Maybe that seems complicated or maybe that seems simple.  I suspect that over-explaining this one would be counterproductive actually.  I need “this row” for comparisons within this table, and I don’t when I go across tables, and that’s probably good enough.

Works with Different Numbers of Columns

Another obvious point:  this clearly works in cases other than 3 columns Smile

And you don’t have to match on all columns in a table either, you can choose to match based on a subset of columns.  That second argument to FILTER(), where I use && to combine tests, is the only thing you need to change.

Download the Workbook

The simple workbook  I used for this post is available:

Download the Workbook Here

Rob Collie

One of the original engineering leaders behind Power BI and Power Pivot during his 14-year career at Microsoft, Rob Collie founded a consulting company in 2013 that is 100% devoted to “the new way forward” made possible by Power BI and its related technologies. Since 2013, PowerPivotPro has rapidly grown to become the leading firm in the industry, pioneering an agile, results-first methodology never before seen in the Business Intelligence space. A sought-after public speaker and author of the #1-selling Power BI book, Rob and his team would like to help you revolutionize your business and your career.

This Post Has 10 Comments
  1. You translation of “Find unmatched” is not accurate :-). This sort of query in Access actually finds the rows in Table1 that are not present in Table2 (NOT IN). This is not an issue since adapting your formulas just requires changing the output of the IF expression.

    Also, let me note that since CALCULATE in a calculated column transforms the row context in a filter context, CALCULATE(COUNTROWS(Table1)) should do the same as CALCULATE(COUNTROWS(Table1), FILTER(…))

    When Table1 and Table2 are linked to the same lookup tables, then CALCULATE(COUNTROWS(Table2)) should also return the same as CALCULATE(COUNTROWS(Table2), FILTER(…)).

    Please let me know if I overlooked anything.

  2. Thanks for working this out, Rob!

    Laurent is correct, what I am looking for is a “Not In” join:
    SELECT * FROM TableA a Left Join TableB b ON a.key = b.key WHERE b.key is NULL

    Sort of the opposite of duplicates, as in non-existent.

    Since Laurent said it could be done I worked on my data until I proved it would work and I did get it. Here is my solution:
    As a calculated column in TableA, where the relationship on the two tables is TableA[ColumnA] = TableB[ColumnA]:
    =IF(
    CALCULATE(COUNTROWS(TableB),
    FILTER(TableB, TableB[ColumnA]=TableA[ColumnA]
    )
    ),1,0
    )

    I may not have found the most elegant way to solve this, and maybe someone else will see what worked for me and improve on it, but here’s basically what it says.

    Go to the other table and look to see if there is a record there that matches this one. If so, count 1, if not count 0.

    All of my zeroes are unmatched. All of my ones are matched.

    To test, I took one of the 0s and searched in TableB and it was not there (as expected).

    Thanks for helping me solve this! Now I can easily display unmatched records between two sets, and perhaps never have to open Access again!

    Geoff

  3. In your first variant/example (one table) is there any way to identify one of the rows that are duplicated? I want to include one of the rows and ignore the others in my processing along with the non-duplicated rows.

    1. Hi, I have the same issue, how to mark only one row from the duplicates – I need this so I could create a unique ID and have a 1 to 1 relationship with my other table.

  4. Here’s a quick formula to identify duplicates values (in column C), and then mark one of the duplicated rows (with a “1”) and the others (with “0”). Values that don’t have duplicates get marked with “1”

    =IF(COUNTIF(C:C,C2)>1,IF(COUNTIF($C$2:C4,C2)>1,0,1),1)

  5. I want to remove the duplicates using the your technique. I have spotted the duplicate and I want to keep one of them. Is that Possible in Power Pivot?

Leave a Comment or Question