Break your DAX Problem into Manageable Parts

November 26, 2015 at 5:06 pm

Great approach to developing solutions to complex questions Matt, thanks for the insight. Think the key is being able to validate the results along the way, so issues like ties in your example are discovered. When so much can happen “under the covers” with DAX, this is particularly important.

November 26, 2015 at 5:13 pm

As I mentioned in this post, I was pretty sure there would be a better solution. Jes Hansen emailed me this solution
Avg of top 8 from last 20 rounds:=
AVERAGEX (
TOPN (8, TOPN ( 20, Golfrounds, Golfrounds[Date], 0 ), Golfrounds[Score], 0,RAND (), 0),
Golfrounds[Score]
)

Jes added an extra option RAND() and says this fixes the the tie problem (although I am not sure why – can someone enlighten me).

However even writing this simplified formula from scratch would be difficult, particularly given the issue with the ties. It is the process that is important here.

November 27, 2015 at 3:26 am

RAND() inserts a random number as an extra sort criteria in the sorting table for the TOPN. That makes every line unique (no ties) and it’s easy to select the TOP 8 to use for the DAX engine. Since all the ties score is the same (19 in your example) it doesn’t really matter which ones get randomly selected (an older date could be selected) what matters is that 8 are selected.

See my small table:

item ID Date Score RAND
6 2012000118 11-02-2015 19 0,951842
7 2012000118 14-03-2015 19 0,652729
8 2012000118 18-02-2015 19 0,621396
9 2012000118 25-03-2015 19 0,355582

As you mentioned yourself the real subject of your blog post is to use a methodical approach to solve a DAX problem and that’s really really important.

November 27, 2015 at 11:00 am

Great article. I enjoyed following along.

To explain the RAND():

TOPN allows you to order by multiple columns. A golfing example where this might make sense: if each game had a date & time associated to it (in different columns) and you want the most recent 20 games, you’d want to sort by DATE DESC and then TIME DESC. The 2nd column acts as a tie-breaker for the 1st column.

What’s genius about this formula (and I would never have thought of it) is that TOPN doesn’t require you to order by a column at all, but allows for expressions. So in this case, rather than just taking the top 8 scores and getting messed up if 2 scores are the same, the formula says that if two scores are the same, assign a random number to each row and then pick the top one. Since you only care about score, you don’t care which row you receive, and RAND() (for practical purposes) guarantees that each row gets a different random number, so the tied rows are now orderable.

November 27, 2015 at 3:43 pm

I was literally lying in bed last night thinking about this, and the penny dropped about using the date column as a second sort order as you have described here. And I agree the RAND() is genius. Jes mentioned to me that he used the approach that I talked about (putting the problem on paper) and then the RAND() idea jumped out at him.

February 19, 2016 at 10:08 am

Hi Matt,

while reproducing your example as well as your approach, I realized that the

MINX function returns wrong results in my EXCEL 2013.

Using the same file in my EXCEL 2016 the results are correct.
Even after restarting my systems and reproducing the file this issue persisted.

Here is a link to the file: https://1drv.ms/1oPlZud

I would appreciate any comment.

February 21, 2016 at 8:54 pm

Hi Frank

Your formula in the sample workbook is not the same as I have posted here. If you use the same formula, it works.

=

MINX (

TOPN ( 8,
TOPN ( 20, Golf, Golf[Date], 0 ),
Golf[Score], 0,
Golf[Date], 0
),

Golf[Score]

)

February 22, 2016 at 5:40 am

Hi Matt,

first of all, thanks for your reply.

The formula you have posted here refers to AVERAGEX.
It uses “Golf[Date], 0” as a tie breaker. That’s perfect.

But for MINX ties shouldn’t matter !?
Obviously, you get the same wrong results in Excel 2013 without using this tie breaker.
Thus, there seems to be a bug.

As I mentioned, results in Excel 2016 will be correct with this formula:

=
MINX (
TOPN ( 8,
TOPN ( 20, Golf, Golf[Date], 0 ),
Golf[Score], 0,

),
Golf[Score]
)

February 22, 2016 at 1:07 pm

Sorry, yes it seems to be a bug.

November 30, 2015 at 12:12 pm

Rather than RAND() I’d use SAMPLE(). SAMPLE() is TOPN()’s tie-breaking brother.

I’ve had this problem before, with some really hairy dashboards I developed. SAMPLE() will give you exactly as many rows as you ask for, and will arbitrarily break ties at the end of the table. In some cases, this is bad, because it does so non-deterministically. In this case, though, we don’t need determinism in our tie-breaking.

So, I think with SAMPLE() we get a win for cognitive load. We don’t have to understand an extra expression in TOPN(), and we have a function that tells us with certainty that the author doesn’t care about non-deterministic tie-breaking. When I see TOPN() I assume that the result table size is variable, or that the data guarantees no ties based on the sort order. When I see SAMPLE() I assume that ties are expected and we don’t care how they’re broken. It’s a strong signal of author’s intent.

Beyond that, if we’re dealing with particularly large data sets, that RAND() will have to be evaluated for every row in the table. This isn’t too big a deal, it’s a fast function, but we are still spending cycles on it. More than that, though, to evaluate the expression, the entire table referenced in TOPN() will have to be materialized with the RAND() evaluation, before it can be sorted and truncated to the top N. In this particular example, we need very few fields in the table. We’ve got two fields we care about, and in the materialized table, we’ll expect a 16-byte row width. Adding a RAND() field demands that we materialize another 8-byte row width field into that table. In this example problem, we’d see a 50% increase in the memory footprint of evaluating this measure. In this example, we’re probably not very concerned about this performance characteristic, because I’d assume we’re dealing at an order of magnitude of only millions of rows.

Finally, since this is a static requirement of the model, why not just create a flag field in the fact either at the data source or as a calculated column, if need be, that indicates the most recent 8 fields? There’s no reason to identify the top 8 at run-time – it’s known at the time of model refresh and cannot change between model refreshes. This is the characteristic of a static attribute. If it’s a static attribute, it should be physically persisted in our model. Our measure becomes trivial, and much easier to reason about, with a vastly better big O complexity.

CALCULATE(
AVERAGE( ‘GolfRounds'[Score] )
,’GolfRounds'[Top8Flag] = 1
)

We could obviously wrap this in an IF() guard to make sure it’s only being evaluated when one player is in context.

All that being said, I think this is a great article, and very clearly illustrates the process I use as well when developing new measures and queries in DAX.

December 3, 2015 at 10:23 am

Hi Greg,

Your suggestion regarding SAMPLE() in this case certainly makes sense. I didn’t consider that possibility at all.
Hats off.

Regarding your remarks about a static field (a Boolean field) in the data model I’m still thinking a bit about that in the sense that a calculated column is never bit or otherwise compressed and execution is taking place in RAM and a byte aligned border will take a hit on 32 bit and 64 bit processors. That might consume more RAM than expected.

December 4, 2015 at 12:42 pm

Hi Jes,

Thanks for your thoughtful reply. You definitely raise some good points about the compression of calculated column. I almost never utilize calculated columns in my models, except in prototyping phases.

I should probably have mentioned that – I’ll explain my reasoning below, as well as address the point of compression on a bit field.

My expectation on the distribution of data for this problem is that the bit flag suggested is going to be heavily skewed, with many more 0 values than 1 (where 1 indicates a “top 8 from last 20” game). With sparse values like this, RLE compression will do a great job of compressing this bit field, as all that needs to be stored is 0 and the number of rows that it’s repeated. You could optimize this pretty well by making sure the data coming into the model is sorted by date, guaranteeing a very long string of 0s for old games.

What I’d probably do in a Tabular model with this dataset if I were very concerned is partition on the bit flag – this would allow the storage engine to ignore a huge portion of the records entirely. That might be a bit overkill to optimize for a single measure/field, but even still, a date partition would do a great deal to help compression on the field.

My rule of thumb is that anything that doesn’t change between model refreshes is something that will be physically stored in the source, or calculated in the query bringing the data into the model (whether via view, stored procedure, or Power Query).

Even if this is implemented as a calculated column, we would simplify our measure very much, in a way that enhances performance and readability, which makes it a strong argument to me for doing so.

If I can ever move complexity out of run-time, I will do so. It’s good for measure-writing, measure-reading, and measure-reasoning.

February 18, 2016 at 2:49 pm

Hi Greg,

unfortunately, you did not show, how you want to use the SAMPLE function here.

What do you mean by: ‘SAMPLE() is TOPN()’s tie-breaking brother.’ ?

SAMPLE() will give you exactly as many rows as you ask for, but the best scores may not be included.
Thus, I cannot use SAMPLE instead of TOPN here.

How did you want to use it ?

Regards

Frank

Break your DAX Problem into Manageable Parts

A Tricky Little Golf Problem

First the solution

Average of best 8 scores from last 20 rounds

The DAX Problem Solving Process

1. Write down the problem in English

2. Break the Problem into Manageable Pieces in English

3. Solve Each Piece of the Puzzle in DAX with Test Measures

Here is what the sample data looks like

Create Lookup Tables

Task 1 – create a table of last 20 rounds

Task 2 – keep only the best 8 rounds

Houston, we have a problem!

The Moral of the Story

Edit: A Better Answer (28th Nov)

Some things to take away

Cancel reply