Restructure Your Data Table for Improved Compression

January 5, 2016 at 10:33 am

Great post, Matt, I really appreciate the level of detail presented here.

That being said (and this is just going to sound mean no matter how I say it, but it’s really not) this is a pretty clear example of “Is It Worth The Time?” ( https://xkcd.com/1205/ ) / YAGNI.

Memory is cheap, spending a few hours to shave 25 MB off a file seems like a waste compared to, say, producing a compelling dashboard or introducing new datasets to enhance your model.

January 5, 2016 at 1:39 pm

Kyle,

Like you, I appreciated the deep dive into the trade-offs of writing better DAX.

My takeaway is a deeper insight into skills of writing better DAX. I also walk away with a link to some cool VBA code ( https://tinylizard.com/script-update-what-is-eating-up-my-memory-in-power-pivot/ ) and a tip on using a feature of DAX Studio (the Server Timings button) that I had overlooked.

I appreciated your candid and genuine response, but in this instance I disagree. I do not find this a case of YAGNI (you ain’t/aren’t gonna need it) which refers to writing more code than what is needed.

I see this post as an example of thoughtful DAX with the first benefit of time savings and the second of file size; your mileage or mine may vary.

January 5, 2016 at 5:44 pm

I think that is a valid comment Kyle – particularly with the benefit of hindsight – ie after the testing. I didn’t really make a comment about the total 10% savings at the end of my post, but I actually agree and I probably wouldn’t do this to save 10%. But the point is that I had no idea what savings could be made before I started – now I know. Other people with other data sets will get different results – and some will definitely get better results.

January 5, 2016 at 1:00 pm

Hi Matt,

Great post on ways to reduce memory size and some performance tips. I noticed another way to improve the performance of your query, namely getting rid of the DIVIDE() function. You’ll notice in your DAX Studio output that the Storage Engine (SE) query has a CallbackDataID call – This means that the SE is asking the Formula Engine (FE) for help in evaluating the Sum of DIVIDE(Sales[CostValue] , 1 – Sales[Margin %]). DIVIDE goes to the FE, which is single-threaded, and while still pretty dang fast, is not as fast as a pure SE query. Essentially the SE cranks away, hits a point where it asks the FE for help, sits around for a spell, then once the FE returns values, keeps going. This can be absolutely essential in some measures, dependent on how complicated the business logic is, but in this case I suspect it isn’t needed.

For all of the records in Sales, if a row has a [CostValue], does it have a non-zero [Margin %]? If so, a simple divide with ‘/’ would be evaluated solely in the SE, and the run-time performance should improve. If some [Margin %] values are zero (or negative, possibly for Loss-Leader items), I’d look at this SQLBI.com article for other options to improve performance: https://www.sqlbi.com/articles/divide-performance/. My biggest take-away from that article is that DIVIDE() often makes sense when it’s used to divide the results of two measures, but may not make since in the direct evaluation of measure inputs.

Thanks again for the great post!

-Alex

January 5, 2016 at 5:50 pm

Thanks for sharing Alex. In my case the data is not perfect and there are some div by zero errors. I actually suspected that DIVIDE may be sub optimal but needed it to help with these errors.

I have actually enrolled in Marco’s “”Optimising DAX” training in Sydney this year and expect to extend my skills and knowledge there. 🙂

January 6, 2016 at 12:38 am

The Optimizing DAX training is fantastic, I attended in Toronto last year. The training clarified and formalized several DAX patterns and optimization tricks I’d observed or was using without the best comprehension of why they worked, and the problem exploration sessions made for great practice at breaking down and analyzing complex and multi-faceted performance issues.

On a total side note regarding DIVIDE(), I just posted an issue on the Power BI Ideas page regarding what appears to be a bug or a misspecification in the alternate result section, here: https://ideas.powerbi.com/forums/265200-power-bi/suggestions/11318913-fix-change-divide-alternate-result-to-allow-for

In short, the optional alternate result is supposed to be a “Constant Numeric Value”, but the accepted values include a few function calls, and not -1, which strikes me as a rather constant numeric value. Unless -1 is passed as (-1) * 1 to Divide, the message is wrong, which lead to some simple but less than ideal for a KPI project I was involved with last year. All told this isn’t a terribly impactful bug/error, but now it’s out there for voting to be worked on.

January 5, 2016 at 2:10 pm

Alex, you’re right, avoiding the CallbackDataID is very important for storage engine performance. DIVIDE is executed by the formula engine, so if you use a division operator, just make sure you filtered zero values in the denominator before.

January 6, 2016 at 2:27 am

Great post, Matt. I work also with similar size datasets and Power Pivot and according to my experience it may be worth to test different ways to sort source data.

As we know, Power Pivot applies compression segment by segment (1 million row in each segment). When you read data from source system without explicit order you get it in some “natural” order. If this “natural” order is by date, you may get in every 1 million segment almost all possible product, customers, sales values, etc, but only few dates (CalYearWeek). But if you sort data by Product, then you get in every 1 million segment all possible dates, but only part of products and due to fact that distinct sales values, margins, etc are correlated to products, only part of all possible distinct values. So you can get better compression. But be aware of cost of sorting operation (CPU, Memory, TempDB, if your source system is SQL server).

January 6, 2016 at 10:26 am

Somewhat related to Alex and Marco’s comments:

An easy rule of thumb for SUMX() performance is whether the items referenced in the expression in SUMX()’s second argument can be evaluated in row context of the table in its first argument. Think of it like creating a calculated column in the sales table. If you were to create a calculated column for extended sales as a calculated column, you’d multiply unit price * quantity on each row. When using SUMX() on a fact table, you should lift the logical calculated column definition into the SUMX(). This is what you ended up with.

SUMX( FactSale, FactSale[Unit Price] * FactSale[Quantity] )

Like Alex and Marco noted, this can be evaluated entirely by the storage engine (leveraging efficient primitive operations and parallelism).

The problem is not explicitly that you have a SUM() in the first SUMX(), but that you are forcing a context transition with the implicit CALCULATE() around the measure. Each field on the current row of SUMX()’s iteration over FactSale must be translated into filter context for the evaluation of the SUM() in the measure. The end result is logically identical – for each row in FactSale you get the value of a field on that row, but the steps between are nearly nonexistent when evaluated in row context, and clearly very time consuming when evaluated as a filter context consisting of the values of every field in that fact table row.

Restructure Your Data Table for Improved Compression

Swap out Extended Price for Margin %.

Swap out Extended Values with Price per Case

But what about the DAX?

It’s Important to use Real Data for this Test

Test 1 – Replace the SalesValueExTax with Margin %.

Test 2 – Swap out the Extended Columns for “Per Unit” Values

What is going on here?

Test 1 – Very complex query with joins to the lookup tables

Test 2 – Very simple query with no joins to the lookup tables.

Test 3 – Combine both concepts together

Lessons from the Experience

Cancel reply