Spotting Outliers in DAX: A Power Pivot & Power BI Technique

Out of 407 Total Combinations of Subcategory and Region, These 8 Stand Out This Month
(These 8 Combinations Differed GREATLY from their Respective 12-month Averages)

A Post Long in the Thinking: Formulas that “Scan”

(What’s this, you say?  An actual “how-to” post from Rob?  Well YEAH, it was about time!)

DAX is fantastic at calculating numbers for us, and producing metrics that drop jaws all over the world.  It is, quite simply, The Best Thing Ever.  You want an all-up metric that properly accounts for seasonality and price changes simultaneously across all divisions of your organization?  That thing you’ve always talked about like it was The Holy grail?  Well, it’s really no problem.

But I’ve now had a long obsession with making DAX do more than that for us.  I want it to not just produce the metrics, but also to then robo-scan said metrics on our behalf and report back on interesting things that it found.  I don’t necessarily get too many chance to indulge this desire, and that’s a shame, so today I’m gonna use the old reliable AdventureWorks fake data.

Not Just “Off the Average,” But “Well Outside Normal Variation

The basic idea here is “alert me if something has changed dramatically.”  If there’s a corner of my business that has spiked or crashed in a big way, I want to know.  If something has dramatically improved in a particular region, I may want to dive into that and see if it’s something we can replicate elsewhere.  And if something has fallen off a cliff, well, I need to know that for obvious reasons too.  And both kinds of dramatic change, positive and negative, can easily be obscured by overall aggregate values (so in some sense this is a similar theme to “Sara Problem?”)

So the first inclination is to evaluate distance from average performance.  And maybe that would be fine with high-volume situations, but when we’re subdividing our business into hundreds or perhaps thousands of micro-segments, we end up looking at smaller and smaller sample sizes, and “normal” variation can be legitimately more random than we expect.

Consider these two series of numbers:

12, 2, 12, 2, 12, 2, 15

7, 7, 7, 7, 7, 7, 15

Both series have an average of 7 over the first six data points, and both end with 15 as the seventh data point.  If we just go by the average, that final value (15) is just as “much” of an outlier in the first series as the second: in both cases, our most recent number is 8 units higher than our average.

But our eyes can tell that 15 is far more “outlandish” in the second series than the first, right?  We were just cruising along, hyper-consistent at 7, and suddenly, BAM, huge spike to 15.  I can definitely imagine wanting to be notified of the second series’ spike and NOT caring about the first series’ spike.

So we bring Standard Deviation into the picture – not to replace “distance from average,” but to complement it.  Series one has a standard deviation of 5 and series 2 has a standard deviation of 0.  In some sense our eyes are “seeing” the standard deviation difference between these two series of numbers – series one can’t make up its mind, is super hot and cold, whereas series 2 is like a metronome.  Standard deviation is, in other words, a measure of how “choppy” a data set is, or what is considered “normal variation.”

Now, bringing Standard Deviation and “Distance from Average” together…  if we compare distance from average to standard deviation, we get a sense if this most recent change is a bigger than normal change!

In series one, 15 differs from the average by 2 units more than its standard deviation, or a 60% spike outside of normal variation:  (15 – 7 – 5) / 5 = 60%

In series two, 15 differs from the average by 8 units more than its standard deviation, or a 160% spike outside of normal variation:  (15 – 7 – 0) / 5 = 160%

The sensitivity thresholds for each biz are going to be different.  Hey, 60% may be plenty to catch your eye (probably not though).  But 160% is getting closer, and 300% is definitely attention-getting.

Let’s Write Some (Measure) Formulas!

[Most recent 12 months sales avg] :=

CALCULATE([Total Sales],
FILTER(ALL(Calendar),
Calendar[MonthID]<=MAX(Calendar[MonthID]) &&
Calendar[MonthID]>=MAX(Calendar[MonthID])-11
)
) / 12

I’m choosing to use the “greatest formula in the world” flavor of calendar navigation here rather than the built-in functions like DATEADD because we can generally make them work in ALL biz scenarios, but also because in this case I’m just more comfortable being “hands-on” with the logic rather than trusting the magic functions to do the right thing.

This relies on a MonthID column that in my Calendar table:

MonthID Column in DAX for Calendar Navigation

Subtract the First Year in Your Data Set (2001 in my case) from CalendarYear, Then Multiply by 12 and add MonthNumberOfYear – This Yields a MonthID that You Can Navigate with Arithmetic

Moving on to more formulas…

[Most recent month sales] :=

CALCULATE([Total Sales],
FILTER(ALL(Calendar), Calendar[MonthID]=MAX(Calendar[MonthID]))
)

[Absolute Delta from Average] :=

ABS([Most recent month sales]-[Most recent 12 months sales avg])

I’m taking the Absolute Value here because I want to treat “up” and “down” spike equivalently, AND because STDEV never returns negative numbers.

[STDEV Sales over Last 12 Months] :=

STDEVX.P(FILTER(ALL(Calendar),
Calendar[MonthID]<=MAX(Calendar[MonthID]) &&
Calendar[MonthID]>=MAX(Calendar[MonthID])-11
),
[Total Sales] )

And finally, the “detector” measure:

[Am I Big Change] :=

IF([Abs Delta From Average] >
(20*[STDEV Sales Over Last 12 Months]),
1,
BLANK()
)

I’ve highlighted that 20 because that’s nuts – YOU WON’T USE 20 THERE!  That 20 means I’m saying “let me know when the latest number is TWO THOUSAND PERCENT the normal standard deviation, whereas earlier in this post, I mentioned that 300% is a lot.

Seriously, you need to tune your formula to a sensitivity that fits your situation.  And you do that by changing that 20 to other values.  Start with 3, because that theoretically means you’ll catch only the most-outlying 0.3% .

Well remember, I’m working with fake data here, and I had to crank the factor up to 20 before I got “small” results (like the 8 combinations of Subcategory and Region displayed in the pivot at the top of the post).  Hopefully, you can use something more reasonable, like 3, but play around with it and find out.  (You could also use a disconnected slicer and make it a variable that the user can control of course.)

Furthermore, my fake data was loaded with lots of silly cases where Subcat/Region pairs exhibited ZERO standard deviation, which is just not going to happen in real life.  So I filtered out any cases where that happened, too – just to make the results look more realistic.

A Further Modification to Compensate for Fake Data Issues - You Won't Need This

Because My Fake Data Was Loaded with LOTS of Zero-Deviation Examples, I Added the Highlighted Test
(You Probably/Hopefully Don’t Need That Part)

And Now…  the Pivot Itself

First I created a Flattened Pivot…

Flattened Pivot

Then added my Subcategory and Territory fields, plus the measures:

On Our Way...

But I don’t want to see Subtotals here, so I turn those off…

Turn Off Subtotals

And then I filter the “inner” field (Region in this case) to only show me [Am I Big Change] = 1…

Filter to Just Show [Am I Big Change] = 1

Filter to Just Show [Am I Big Change] = 1

And then, because my Calendar table “overruns” my Sales table (it contains dates that “happen” after my last Sales row of data), I add a HasData report filter:

Making Sure I Don't Use "Future" Months

Which itself is a calculated column in my Calendar table:

Making Sure I Don't Use "Future" Months

That prevents me from EVER using a “bad” month (one that contains no data) as my most recent month.

Portable!

First off, we can add a Month-Year Slicer and go look at outliers in past months as well:

Spotting Outliers in DAX: A Power Pivot & Power BI Technique, Sliceable to Other Months!

Spotting Outliers in DAX: A Power Pivot & Power BI Technique, Sliceable to Other Months!

We are NOT Locked into the Strictly Most-Recent Month (July 2004) –
We Can Slice Back and Look at Other Months
(And we see there are more outlier combinations in May than in April)

Furthermore this is NOT “locked in” to Subcategories and Regions specifically.  We can try out other combinations on Rows of the pivot, as long as we remember to re-set that [Am I Big Change]=1 filter on the innermost field…

Spotting Outliers in DAX: A Power Pivot & Power BI Technique, Portable to Other Combinations on Rows!

Customer Education Level and Subcategory Work Too – No Formula Changes Required
(Although you might need to tune the sensitivity differently)

Let Us Know If You Use This (or Not)!

We’d be very interested in real-world applications of this technique, because such things simply make us happy when we hear about them.  So please circle back here if you have success… but also if you don’t.

Also, any ideas how this can be improved?  Flaws in my thinking?  Throw those at us too.

  Subscribe to PowerPivotPro!
X

Subscribe

Rob Collie

One of the founding engineers behind Power Pivot during his 14-year career at Microsoft, and creator of the world’s first cloud Power Pivot service, Rob is one of the foremost authorities on self-service business intelligence and next-generation spreadsheet technology. 

This Post Has 29 Comments

  1. great to see some statistical analysis in DAX.

    BTW in the quotes just before you document the 1st measure the 2nd quote reads series 1 it should read series 2.

  2. Agree with David – really nice to see stats in DAX.

    As DAX is so powerful I’d think that you could also create some very insightful analysis on how your data structure changes over time (like development of standard-deviation or correlations over time, Pareto-Analysis on outliers over time). Graphical display of the results would deliver interesting insights to Forensics and Marketing for example.

    I was pretty amazed to see how fast Monte-Carlos-simulations ran in DAX for example: https://blog.oraylis.de/2016/05/monte-carlo-simulation-with-dax-and-power-query/

  3. Very interesting, thanks a lot.

    Can you comment also on the speed of the calculations? Is this method scalable over large databases? E.g. what if we wer trying to find out outlier customers over millions of customers?

    1. Well it’s gonna break down somewhere. This example only “scans” 407 possible combinations, so as you scale up, it can quickly explode (imagine 1 million customers times 5,000 products).

      Furthermore, this technique relies on FILTER and STDEVX.P, both of which are manual iterators that “crawl through your data on their hands and knees.” (These functions are not able to utilize compression as a speed advantage, unlike simple CALCULATE filters, and furthermore *also* fail to utilize more than one core of your CPU, for a double-whammy of slower.) Our iterations over the Calendar table could probably be sped up quite a bit, and forced to iterate over 40-ish rows (of Months) rather than the < 1,000 rows (of dates) it currently is scanning. So if we hit a speed snag that would be the first place we'd look. That said, most cases in which I'd want to leverage this would revolve around segments of the "population" rather than individuals. Even the largest of businesses can often be segmented in useful ways, and variations at the individual level (whether that be specific customers, specific products, whatever) tend to naturally be noisier anyway - it takes some aggregate-level change in your business for a segment (or combination of segments) to spike or drop, but an individual can flip much easier. Anyway, give it a shot and see if it works for your needs. If too slow, like I said, I think the FILTER can probably be sped up by a factor of 100 or so.

  4. While you show the ABS Delta and 12-Month Standard Deviation values, I’d probably enhance that to show what Pct Std Dev = DIVIDE([Abs Delta from Average], [STDEV Sales over Last 12 Months] ) that equates to. It just adds another level at which to focus one’s attention to “the Most Outside Normal” on that percentage basis, since each line item has its own “normal”.

    1. Ah yes, that WOULD be a good addition for sure. It won’t change the list of combinations being displayed of course, but it WILL give the consumer a better sense of WHY it got displayed. (Which is what you were saying Chris, I’m just making sure that others who read this exchange understand as well).

    2. I did the same – adding a Pct Std Dev measure – but then went a step further and calculated an “Outlier Factor Measure” which I think is a bit simpler for business use…

      “Outlier Factor Measure” calc is Delta from last 11 months divided by Standard Dev for last 11 months. This calc puts out “Outlier Factor Measure” as a tangible number which you can rank the Pivot Table with. Setting it up as a whole number makes it easily readable, but keeps the decimal in the actual cell of the pivot itself. Handy for a dashboard!

      Also, you don’t need to have the “Am I a big change” measure as you can just filter the pivot table on Values for “Outlier Factor Measure” > 0.

      But, if you want to filter out certain Outlier Factors using the disconnected slicer method (as mentioned in post), you do need to have the “Am I a Big Change” Measure, else you can’t filter out outliers < 5 (for example) without faffing around within the Pivottable.

  5. Holy crap! I was JUST trying to figure out how to do a very minimal version of this and it was driving me crazy. I don’t entirely know how this happened, but in the five years I’ve been using Power Pivot I have NEVER run across the “Value Filters” option! Great timing, and awesome how-to post!

  6. I don’t know about everyone else, but I’m just impressed the post managed to exclude the terms coefficient of variation and/or relative standard deviation. and I’m not saying I wrote this comment here so that google will find this post when somebody queries for those terms, but… 🙂

  7. Rob,

    Thanks for pushing the envelope on DAX and taking the time to make the formulas available. The measures are a great solution to a real puzzle!

  8. Awesome post! Very excited to see some more how-to info that includes statistics and can easily be applied to real world situations. Nice!

  9. Hi Rob, I was wondering, how would you set this up for, example, to see a change between a period of two weeks rather than just by month? I usually have to interpret data on results based every two weeks.

    Best,

    Matthew

  10. Hi Rob,
    SHouldnt we in calculation of average and stanadr deviation omit the max(month), couse it influence the last 12 M period calculation. Like in the example above- to calculate deviation 0 from 7777777 you dont take into account 15

  11. I have this working at the macro level, but when I select a filter (in the column header) these measure are not filtered. Any thoughts?

  12. Awsome post by the way! But I have some problems with new products that hasn´t been sold for 12 months so the calculated measure “most recent 12 months sales average” doesn´t quite do the trick. Any suggestions?

  13. In the beginning of the post where we read “In series one, 15 differs from the average by 2 units more” shouldn’t it be 3 units?

  14. Rob, the very first measure you created was:

    [Most recent 12 months sales avg] :=

    CALCULATE([Total Sales],
    FILTER(ALL(Calendar),
    Calendar[MonthID]=MAX(Calendar[MonthID])-11
    )
    ) / 12

    The measure assumes there are 12 full months of data to average. This might not be the case, say, in a retail environment with a new store, a wholesale application with a new customer, or a wholesale application with a customer that purchases sporadically.

    Do you have any suggestions in the above measure to replace “/ 12” with a measure that determines the actual number of months with data within the past 12 months?

    Cheers!

    Ron

  15. “In series one, 15 differs from the average by 2 units more than its standard deviation, or a 60% spike outside of normal variation: (15 – 7 – 5) / 5 = 60%

    In series two, 15 differs from the average by 8 units more than its standard deviation, or a 160% spike outside of normal variation: (15 – 7 – 0) / 5 = 160%”

    Quick question: Why is each series divided by 5?

  16. In my experience, when using absolute delta from average some extreme outliers can make regular values get flagged as well (if you have a value so large that it shifts the average away from most data points, making most data points seem “way below average”).

    Using absolute delta from median might come in handy in those situations.

    1. Ooh, that’s an interesting angle I had not considered. Next time I do this with real data…

      Curious: how long have you been doing this kind of “delta from expected” analysis, Franz?

  17. Hey Rob,

    Been reading your website for the few last weeks and loving it.

    I’d like to know how could i adapt this so i can have a factor to each SKU (presuming that i have a table with two columns: SKU and factor).

    Thanks a lot, and sorry for my english (i’m from Brazil).

Leave a Comment or Question