Sort Your Data on Load for Improved Compression

February 9, 2016 at 11:59 am

Check out Marco’s excellent video explaining compression techniques. Sorting by different columns changes the Run Length Encoding part of compression.

https://www.sqlbi.com/tv/ssas-tabular-from-the-trenches/

February 9, 2016 at 9:00 pm

@Matt – “I changed my SQL Server Query and simply added an ORDER BY Product_CD clause”

ORDER BY is not supported in views in SQL Server 2012 and above (unless accompanied by TOP (99.9999) percent in the SELECT statement which could eat up a few rows on large data – which means you will end up doing the sorting on the Client side (inside PQ)
Not sure if sorting inside PQ is as efficient

February 9, 2016 at 9:27 pm

Thanks for Posting Sam. What I suggest you do in this case is one of the following:
1. Go to IT and get them to add the order by clause to the view
2. If they can’t/wont do this, then get them to send you a copy of the SQL in the view. You then paste this code into the Query View and add the order by clause yourself. I do this ALL the time.

February 12, 2016 at 8:04 am

Why would you get the top 99.9999?? TOP 100 PERCENT works fine.

But as Matt explains below you don’t even need that since you can SELECT col1, col2 FROM YourView ORDER BY col2; in PowerPivot or in Power Query.

Even when you apply the sorting in a second step in Power Query, the Query Folding feature pushes as much as possible to the SQL Server. I guess in most cases a sort in Power Query will be translated to an ORDER BY clause in the statement that wil be sent to the server.
To be sure you should use Extended Events (or Profiler,…) and capture the query on your SQL instance when you execute the Power Query.

I just ran some tests today and what really puzzles me is that the same query with the ORDER BY actually led to a faster data refresh than without.
When executed in SSMS it was the other way around as we could expect. The execution plan said the query with ORDER BY took 95% of the batch and the one without 5%. But strangely the ORDER BY version was executed in Batch mode and used parallellism, while the one without used Row mode and the duration of both versions was just a little different; 16 min vs 17 min for a 51 million row fact table.
In Powerpivot data refresh without ORDER BY completed in 10:50, with ORDER BY in 07:35.
The file size shrunk from 287 MB to 163MB.

February 12, 2016 at 2:53 pm

That is really interesting. I don’t know anywhere near enough about query folding, but would like to know more. I also don’t know a lot (just some) about profiler. Is the profiler the best tool to investigate which query is being sent from PQ? Can you actually see the SQL code or the query plan, or just the events?

February 15, 2016 at 3:57 am

I just did some testing (Excel 2016 with SQL Server 2014).

I connected to my DimDate table, got the M-statement, enabled Extended events (the Profiler successor) and captured the query on the SQL Server when I refreshed. Then I added a sort by DayInYear in a second step, looked again at the M and SQL statements, and finally replaced the sort with a filter on DayInWeek in (5,6).

let
Source = Sql.Database(“sqlserver.domain.be”, “DataHub”),
dbo_DimDate = Source{[Schema=”dbo”,Item=”DimDate”]}[Data]
in
dbo_DimDate

becomes:

execute sp_executesql N’select [$Ordered].[DateKey],
[$Ordered].[Date],
[$Ordered].[ISODate],
[$Ordered].[DayInWeek],
[$Ordered].[DayInMonth],
[$Ordered].[DayInYear],
[$Ordered].[HolidayBE],
[$Ordered].[MonthNumber],
[$Ordered].[Year],
[$Ordered].[ISOYear],
[$Ordered].[ISOWeek],
[$Ordered].[JaarWeek],
[$Ordered].[DayOfWeek],
[$Ordered].[DayOfWeekNL],
[$Ordered].[DwNL],
[$Ordered].[MonthName],
[$Ordered].[MaandNaam],
[$Ordered].[Quarter],
[$Ordered].[Year_Month],
[$Ordered].[Year_Quarter],
[$Ordered].[JaarMaand],
[$Ordered].[IsWorkDay]
from [dbo].[DimDate] as [$Ordered]
order by [$Ordered].[DateKey]’

With the sort step added, the PowerQuery advanced editor shows:

let
Source = Sql.Database(“sqlserver.domain.be”, “DataHub”),
dbo_DimDate = Source{[Schema=”dbo”,Item=”DimDate”]}[Data],
#”Sorted Rows” = Table.Sort(dbo_DimDate,{{“DayInYear”, Order.Ascending}})
in
#”Sorted Rows”

and the SQL Query is:

execute sp_executesql N’select [_].[DateKey],
[_].[Date],
[_].[ISODate],
[_].[DayInWeek],
[_].[DayInMonth],
[_].[DayInYear],
[_].[HolidayBE],
[_].[MonthNumber],
[_].[Year],
[_].[ISOYear],
[_].[ISOWeek],
[_].[JaarWeek],
[_].[DayOfWeek],
[_].[DayOfWeekNL],
[_].[DwNL],
[_].[MonthName],
[_].[MaandNaam],
[_].[Quarter],
[_].[Year_Month],
[_].[Year_Quarter],
[_].[JaarMaand],
[_].[IsWorkDay]
from [dbo].[DimDate] as [_]
order by [_].[DayInYear]’

The filter step becomes:

let
Source = Sql.Database(“sqlserver.domain.be”, “DataHub”),
dbo_DimDate = Source{[Schema=”dbo”,Item=”DimDate”]}[Data],
#”Filtered Rows” = Table.SelectRows(dbo_DimDate, each ([DayInWeek] = 5 or [DayInWeek] = 6))
in
#”Filtered Rows”

which is, in SQL:

execute sp_executesql N’select [$Ordered].[DateKey],
[$Ordered].[Date],
[$Ordered].[ISODate],
[$Ordered].[DayInWeek],
[$Ordered].[DayInMonth],
[$Ordered].[DayInYear],
[$Ordered].[HolidayBE],
[$Ordered].[MonthNumber],
[$Ordered].[Year],
[$Ordered].[ISOYear],
[$Ordered].[ISOWeek],
[$Ordered].[JaarWeek],
[$Ordered].[DayOfWeek],
[$Ordered].[DayOfWeekNL],
[$Ordered].[DwNL],
[$Ordered].[MonthName],
[$Ordered].[MaandNaam],
[$Ordered].[Quarter],
[$Ordered].[Year_Month],
[$Ordered].[Year_Quarter],
[$Ordered].[JaarMaand],
[$Ordered].[IsWorkDay]
from
(
select [_].[DateKey],
[_].[Date],
[_].[ISODate],
[_].[DayInWeek],
[_].[DayInMonth],
[_].[DayInYear],
[_].[HolidayBE],
[_].[MonthNumber],
[_].[Year],
[_].[ISOYear],
[_].[ISOWeek],
[_].[JaarWeek],
[_].[DayOfWeek],
[_].[DayOfWeekNL],
[_].[DwNL],
[_].[MonthName],
[_].[MaandNaam],
[_].[Quarter],
[_].[Year_Month],
[_].[Year_Quarter],
[_].[JaarMaand],
[_].[IsWorkDay]
from [dbo].[DimDate] as [_]
where [_].[DayInWeek] = 5 or [_].[DayInWeek] = 6
) as [$Ordered]
order by [$Ordered].[DateKey]’

February 10, 2016 at 12:52 am

@Matt – you misunderstood what I said – ORDER BY is not allowed in views from SOL Server 2012 – period
There is nothing any one can do – the only workaroud to this use a SELECT TOP (99.99999) percent …..

But this is risky on Fact tables – but works well on small Dim tables

You can read more about this here

https://blog.sqlauthority.com/2010/08/23/sql-server-order-by-does-not-work-limitation-of-the-views-part-1/

February 10, 2016 at 1:12 am

Edit: OK, so you can’t add Order By inside the view. But you definitely can add Order By as part of your select statement that points to your view.

Select *
from myView
order by myView.ProductCode

Just change the Table properties wizard from Table View to Query View and add the Order By clause.

Given what I understand about Query Folding, I believe the task to do the sorting is pushed to SQL Server and not Power Pivot

February 15, 2016 at 9:59 am

Some good info about Query Folding and performance traces here:
https://cwebbbi.wordpress.com/2014/12/11/reading-the-power-query-trace-filewith-power-query/
https://cwebbbi.wordpress.com/2014/11/17/timing-power-query-queries/

February 21, 2016 at 2:21 am

Thanks Matt, great article,
I did use it on my small workbook (8192MB) which holds around 2M records, data source is from an access DB, sorted the queries there and got reduced size after i refreshed my workbook (new file size: 8081MB, about 1.4% reduction), i guess this technique works on all file sizes but the improvement is less as file size shrinks.

February 21, 2016 at 3:04 am

Are your numbers correct? 8.1 Gigabites for 2M records? If so, something would seem to be wrong. I normally get file sizes around 40Mb for 2M rows. If these numbers are correct, my best guess is you either have lots of materialised uncompressed data in worksheets, or a very wide data table with lots and lots of columns. If the latter, this is a very inefficient shape for data.

Can you confirm?

Matt

February 21, 2016 at 3:42 am

My Bad! it is KB not MB but i always read the first number in my head! sorry about that, so yeah the real number is 8,192KB not MB.
i Have a total of six tables including the date table, total records in one table is over a million, sum of the other records might sum up to another million.

February 21, 2016 at 4:01 am

8MB is a very small workbook and probably not worth much effort to make it smaller.

June 17, 2016 at 9:29 am

Two questions:
1) Does this only apply to importing into Excel, or does it apply to SSAS and/or PowerBI Desktop?
2) Would you say that, in general, what you are looking to do is order by columns with low cardinality, or the columns with the least cardinality in the database?

Thanks!
Eric

June 17, 2016 at 3:06 pm

It applies to both, however the segment size for Excel is 1m rows, but the default (configurable) for SSAS is 8 million. So sorting in SSAS with less than 8m rows in a single table will have less impact (it may still have some impact if you can improve on the “order” in which columns are sorted – you need to test it.

In general, you want to sort high cardinality columns. A column with 2 unique values over 500 million rows won’t take up much space regardless if it is sorted or not. If your column has 50,000 unique values (at random) over 500 million rows, you will likely have 50,000 unique values per segment unsorted, but 100 unique values per segment once sorted. That is the principle anyway – you need to test it on your data

July 1, 2016 at 8:48 am

Thanks Matt… I meant high cardinality there… I always get that backwards for some reason, not sure why as it is just a matter of thinking “how many birds do you have in that bag, a high number of cardinals, or a low number?”

February 5, 2017 at 9:31 am

Hi Matt,
When applying sort in Power Query, it seems to not save the table in a sorted way into the data model due to lazy evaluation. I’m using Power Query to import the data from a folder of .csv files and then unpivot them. Do you have any idea how to make Power Query actually save the query to the data model in the specified order? If not, do you know of a suitable work-around?

Related thread: https://forum.powerpivotpro.com/forums/topic/query-refresh-or-using-slicers-uses-up-64-gb-ram-on-3gb-input-excel-crashes/#post-7521

Thanks and best regards,
Richard

February 5, 2017 at 3:05 pm

Mmmm, Interesting. I didn’t realise that, but I can understand how it happens. I have been sitting here thinking about how you could do a couple of steps after the sort – something that can’t be done before the sort, then undo it. But I think the process of undoing it will be caught by the lazy evaluation.

In Excel 2010, there is no way to import directly from power query to power pivot. Instead you have to import in Power Pivot from an existing connection\power query. In the Excel 2010 wizard you can write SQL over the power query, so you could do it there.

You could add an iD column after sorting, but that would be expensive if you loaded it. What if you added an ID column, added a custom column that evaluates the first row to 1 and all other rows to null. It can’t do this without the iD column. Then delete the iD column. Load the extra column, but it would be insignificant. You can just hide it. I think if you delete it before load, the lazy execution would kick in again

February 6, 2017 at 5:01 pm

Hi Matt,
Thank you very much for your reply. I have tried your suggestion and I think that does not cause Power Query to rearrange the data internally.

Here is why: When you sort a table in Power Query and load it to Excel, the table is properly sorted in Excel. If, however, you additionally add the query to the data model, the table in Excel gets into a seemingly arbitrary sort order. This is what I presume the order in memory to be.

Do you have any other idea on how to solve this sort issue?

I also posted about this issue here: https://www.excelguru.ca/forums/showthread.php?7361-How-to-save-a-sorted-query-to-the-data-model-(for-performance-reasons-in-Power-Pivot)&p=30166&posted=1#post30166

Best regards, Richard

February 6, 2017 at 5:12 pm

Hi Richard. It’s not clear if you tried my suggestion. I can’t see why that won’t work.

February 6, 2017 at 5:27 pm

Hi Matt,

I did try your suggestion.

This is the resulting table in Excel when loading to Excel and NOT loading to the data model. It shows the correct sort order.
Date ISOWeek WeekInCalendar Weekday Index isFirstRow
29.10.2016 00:00 43 2016-CW43 Saturday 0 first row
30.10.2016 00:00 43 2016-CW43 Sunday 1
31.10.2016 00:00 44 2016-CW44 Monday 2
…

And this is the resulting table in Excel when loading to Excel and loading to the data model. It is sorted somewhat arbitrarily
Date ISOWeek WeekInCalendar Weekday Index isFirstRow
21.11.2016 00:00 47 2016-CW47 Monday 23
04.02.2017 00:00 5 2017-CW05 Saturday 98
03.12.2016 00:00 48 2016-CW48 Saturday 35
…

Again, I assume that assume the second output corresponds to how the table is saved. I could send you the sample workbook but I couldn’t find your email address.

Best regards,
Richard

February 6, 2017 at 5:49 pm

Very interesting indeed. I think this is a question for Chris Webb, or maybe Ken Puls. It could be the concept of lazy execution and pre-sorting data for compression are simply 2 concepts that don’t play well together – I dont’ know.

February 8, 2017 at 10:28 am

Hi Matt,
Yes, it seems like these two concepts don’t play well together. Thanks for considering Chris Webb and Ken Puls. If needed I can send you guys a sample workbook.

The only work-arounds I can think of to implement the optimization of this article seem pretty nasty:
1) Csv input files -> Power Query (load from folder, unpivot, merge some columns) -> Load to Excel data model -> somehow export to Csv -> Import to Access -> Import to Excel data model via SQL query with an ORDER BY clause.
2) Csv input files -> implement the ETL in Access -> Import to Excel data model via SQL query with an ORDER BY clause.

Best regards,
Richard

February 9, 2017 at 12:40 pm

Good news. I met Matt Masson (Principle PM on Power Query at Microsoft) at Difinity.co.nz yesterday. I had a chat to him about your problem and his opinion is it should work (aka, it might be a bug). Are you able to send me your workbook to share with MS to take a look? You can contact me via my website.

February 9, 2017 at 5:11 pm

Hi Matt,
Thank you very much for your deep interest!
I sent you an email through the contact form on your website regarding sending the sample workbook.

Thank you very much and best regards,
Richard

February 13, 2017 at 3:39 pm

I can confirm now that sorting in Power Query does affect the data model and that an improved or worsened compression does take place as a result. This can be deducted from the file size and only happened for me when I imported more than a few million rows.

I did a test with 36 million rows using Excel 2016 x64 and my file as outlined here: https://forum.powerpivotpro.com/forums/topic/query-refresh-or-using-slicers-uses-up-64-gb-ram-on-3gb-input-excel-crashes/#post-7521
When not sorting in Power Query, the resulting file size is 54,369 KB.
When sorting in Power Query by KPI long name, Date&time, Cell ID, the resulting file size is 98,113 KB. To elaborate, I mean the following sort command:
Table.Sort(#”Reordered Columns”,{{“KPI long name”, Order.Ascending}, {“Date&time”, Order.Ascending}, {“Cell ID”, Order.Ascending}})

Regarding the different file sizes, I think I encountered a similar case as Matt outlined in the article:
“As you can see in the memory usage table below, I got further improvements over the Product sorting, however note that the space used on disk is higher.”

Regarding the issue which I previously stated in the comments, when loading a query to Excel and to the data model, the table in Excel is not sorted properly. This may be a bug but the seemingly arbitrary sort order in the Excel table apparently does not correspond to the sort order in the data model, at least not when only loading to the data model.

Thanks a lot to Matt for his insights.

Best regards,
Richard

Compressed file from January	Compressed file with Product Code sorting on load

Size on Disk = 238MB	Size on Disk = 175MB

File from Jan. 10% compressed	Product Code Sorting	Customer Code Sorting

Size on Disk = 238MB	Size on Disk = 175MB	Size on Disk = 193MB

Sort Your Data on Load for Improved Compression

SSAS uses segments

OK, so why is this an opportunity?

What if you sort the product column first?

Let’s see it in action

Which Column(s) Should I Sort By?

Sorting by 2 Columns?

Final Comments

Sydney Training 25/26 Feb 2016

Cancel reply