Post by Rob Collie
In 2004, There Were 2,561 Customers Who Bought Something in the Southwest.
But How Many of Those Were Repeat Customers? Depending on How We
Define “Repeat,” We Can Get at Least Three Different Answers.
A Right Turn at Albuquerque…
I sat down today to write about “Disconnected Cube Formulas” – yes, you heard that right. A brand new technique that I think has some pretty exciting (yet admittedly narrow) applications.
But along the way, like Bugs Bunny, I ended up doing something at least as interesting. So let’s do that one first.
Setting Up the Problem
I have four relevant tables: Territories, Customers, Calendar, and Sales:
The first three are Lookup (aka Dimension) tables, and Sales is a Data (aka Fact) table.
Active Customers is a pretty easy formula:
We use DISTINCTCOUNT against the CustomerKey column in the Sales table, rather than COUNTROWS of Customers, since the latter would count all customers, whether they bought something or not. And a simple COUNTROWS of the Sales table would treat every single line of the Sales table as if it were a different customer, which is obviously not true, because some customers DO buy more than one thing. DISCTINCTCOUNT of the CustomerKey column in the Sales table gives us precisely what we want.
And when we filter the pivot to 2004, we only get the count of customers who bought something in 2004. OK, so far so good.
Repeat Customers – Version 1
If we define “Repeat Customer” as “A Customer Who Has More than One Line in the Sales Table,” we end up with a formula like this:
[Repeat Cust v1]:=
So, we use the FILTER function to temporarily create a copy of the Customers table (that exists only during the calculation of this measure, behind the scenes!) and that temporary version of Customers will only contain rows for which CALCULATE(COUNTROWS(Sales))>1.
Whoa! Why CALCULATE of COUNTROWS instead of just COUNTROWS??? It’s very strange, I agree, to see CALCULATE without any “filter” arguments.
The short version here is that, without the CALCULATE, COUNTROWS(Sales) would not “respect” each row of Customers, and count up rows in Sales regardless of which customer they are from. Which would result in basically every row of the Customers table passing the test, even if they haven’t bought anything.
By applying CALCULATE, we allow the Sales table to respect the filters coming from the Customers table. The fancy phrase for this is “promoting row context to also become a filter context,” and.
This is NOT something that you normally have to know about the CALCULATE function. I’d file it firmly under Things You Might Learn in Year Two. I didn’t “get” this either, for a long time, and that didn’t stop me from using CALCULATE to death. So for now, you can just follow the pattern happily
(More on this topic is located in Matt’s excellent post.)
Version 1 Sucks Though
Sometimes, a single customer buys multiple products at the same time. If they buy multiple products in a single order, and we never see them again, we probably don’t want to count them as a Repeat Customer, do we?
From the Sales Table: Customer 18759 Bought 3 Different Products as Part of One Order.
And We Don’t Want to Count Multi-Line Orders as a Repeat Customer.
To fix this, we want to count customers who “appear” in multiple different orders! So, something like this…
[Repeat Cust v2]:=
It still uses the “CALCULATE to promote row context into filter context” trick, but now, rather than enforce “more than one row,” we want to enforce “more than one different value for SalesOrderNumber,” which is equivalent to “more than one order.”
Version 2 Only Counts Repeats Within the Filtered Timeframe
The only “problem” with version 2 is that, when we filter to 2004 for instance, a customer must have had multiple orders in 2004 alone. What if they placed an order in 2003, and then a single order in 2004? If you want them to be counted, we need to do something like…
[Repeat Cust v3]:=
ALL(Calendar)) > 1 &&
CALCULATE(COUNTROWS(Sales)) > 0
The yellow highlighted part says “a customer must have appeared in multiple orders, ignoring any filters on the Calendar table.” In other words, even if the pivot is filtered to 2004, a customer will “pass” the yellow test even if all of their prior orders occurred in other years (thanks to the ALL).
The green highlighted part just says “a customer must also appear at least once in the Sales data within the current time frame.” Since we are NOT using an ALL inside that CALCULATE, the Year=2004 filter from the pivot WILL be respected (as always).
The “&&” in orange is just the AND operator, which says “a row in the Customers table must pass BOTH the yellow test and the green test.”
***Update to V3 Formula
In the comments section, Ruve1k correctly pointed out that the formula above would return incorrect results if I set the Year to 2003. A customer who bought once in 2003 (and that was her first purchase ever), and then bought again in 2004, should NOT count as a repeat customer in 2003.
Totally true, and the reason why I didn’t contemplate that possibility? It has everything to do with this post NOT being the post I originally set out to write. In the context of that other post (which was forecasting related), filtering the pivot to the “max” year was the only thing you would EVER do. Not so with Repeat Customers of course.
So here’s the new and improved version, that now prevents the “all time” portion of the formula (the part originally in yellow above) to never look past the current Calendar context the pivot is set to.
Version 3 Rocks
Note how much more “restrictive” Version 2 turned out to be than v3! In the Northwest region alone, there are 7x as many Repeat Customers according to version 2 as according to version 3.