skip to Main Content

My last 2 blog posts have been about data compression – that theme continues today.  In December I blogged about the basics of compression and impacts that columns have on file size and did some testing with randomised data.  In January I went a bit deeper and talked about restructuring your data to get better compression.  In the comments section of this last post in January there was a seemingly innocuous comment from Mati Selg at the bottom of the page that caught my eye, and that is the topic for today – the compression benefits of sorting your data on load.

SSAS uses segments

Now I am no SQL Server Ninja – I am competent and can work my way around SSMS and write queries as needed, but I don’t come close to a DBA.  I have heard of the concept of segments (or groups of records) before and I have a basic understanding, but I never really took much notice until now.  My layman’s explanation is that SSAS loads data in groups of records, similar in the way that you would pack your groceries into separate bags to make your shopping easier to carry.  In Excel the segment size is 1 million rows.  So Power Pivot, sorts, loads and compresses data in segments – 1 million records at a time.  Each segment is discreet when it comes to data compression.  Of course the segments are magically recombined at run time in a way that is seamless to the user.

OK, so why is this an opportunity?

Imagine you have 50,000 products in your data table and you have 50,000,000 rows of data.  Power Pivot will take the first 1 million rows it comes to (1 segment worth), work out how to sort and compress the columns, and then compress the data into a single segment before moving to the next 1 million rows it comes to (in the order they are loaded).  When it does this, it is highly likely that every product number will appear in every single segment – all 50 segments.  If we assume an equal number of product records for each product (unlikely but OK for this discussion), then there would be 1,000 records for each product spread throughout the entire data table, and each and every segment is likely to contain all 50,000 product IDs.  This is not good for compression.

What if you sort the product column first?

Now imagine you sort the data table by the product column first before it gets to Power Pivot.    Each segment will still contain 1 million records, but the first 1,000 records will be product 1, the second 1,000 will be product 2, etc.  With this “sorted load” of the data, this should lead to an (order of magnitude) 50x improved compression on this one particular Product column (note other columns may be negatively impacted as a result).  Now in real life (with real data) some of the products will occur in the data table more frequently than others, but the benefits of sorting are still likely to exist.

Let’s see it in action

Picking up from the last blog in January, I started originally with a workbook that was 264MB on disk (27.6 million rows in the data table).  After completing the compression work discussed last time, the file on disk was reduced to 238MB, a reduction of about 10%.

So now starting with the newly compressed 238MB workbook from last month, I changed my SQL Server Query and simply added an ORDER BY Product_CD clause at the end of the query.  This is easy to do as long as you are loading data from SQL Server (or an equivalent DB).  Just go to your table in Power Pivot, click on DESIGN\TABLE PROPERTIES and then change the wizard type from Table Preview to Query Editor.  You can then simply add the Order by clause in the Query dialog box at the end of the query.

switch

And the result on my test data (which is real data by the way)?  A staggering additional 24% reduction in file size from 238MB to 175MB, a reduction of 63MB on top of the previous 10% reduction.

Actually the memory test tool reports 236MB, but the file size on disk is definitely 175MB as indicated below.  Note how the column memory usage for a lot of the columns has changed in the before and after scenario.  This sorting has flow on effects on almost every column – it is the combined impact of these changes on all columns that will affect the file size.

Compressed file from January Compressed file with Product Code sorting on load
image image
Size on Disk = 238MB Size on Disk = 175MB

 

Now I don’t get excited much, but when I can cut 24% off my file size with this simple change, then that makes me excited.

Edit 10 Feb ’15.  I just added a single order by clause to another large (339MB) workbook I use in production.  The result – a doubly staggering reduction in file size of 177MB (53% reduction).

Which Column(s) Should I Sort By?

Well to loosely quote The Italians, “It depends.  You need to test it on your data as every situation is different.”

What I did next was to change my ORDER BY clause and sorted by Customer instead.  As you can see in the memory usage table below, I got further improvements over the Product sorting, however note that the space used on disk is higher.

File from Jan. 10% compressed Product Code Sorting Customer Code Sorting
image image image
Size on Disk = 238MB Size on Disk = 175MB Size on Disk = 193MB

 

It is not clear to me if the memory usage tool is not reporting the memory usage accurately, or if the standard Excel file compression is having an impact when the file is stored on disk.  Regardless, in my case if I am looking for the smallest file sizes so Product Code sorting is better.  If any experts reading this can confirm what is happening here, that would be great.

Sorting by 2 Columns?

The last thing I tested was to sort by Product Code and then by Customer Code.  This gave me a marginal additional 4MB saving on disk but the memory usage was reported as 248MB, up from 228MB.

Final Comments

Keep in mind that the reason this works is because Power Pivot for Excel loads data in segments of 1 million rows.  It therefore follows that you can only get the benefits of sorting your tables if the table in question has more than 1 million rows of data.  There is no value on sorting most lookup tables or smaller data tables.

Sydney Training 25/26 Feb 2016

For those readers living in Sydney Australia, I will be teaching Rob’s Power Pivot University Training Course in Sydney on 25th and 26th Feb 2016.  Click here to find out more and register to attend.

 

Matt Allington

Matt Allington is a Microsoft MVP specalising in Power Pivot, Power BI and Power Query Consulting and Training based in Sydney Australia. Visit Matt's blog here.

This Post Has 26 Comments
  1. @Matt – “I changed my SQL Server Query and simply added an ORDER BY Product_CD clause”

    ORDER BY is not supported in views in SQL Server 2012 and above (unless accompanied by TOP (99.9999) percent in the SELECT statement which could eat up a few rows on large data – which means you will end up doing the sorting on the Client side (inside PQ)
    Not sure if sorting inside PQ is as efficient

    1. Thanks for Posting Sam. What I suggest you do in this case is one of the following:
      1. Go to IT and get them to add the order by clause to the view
      2. If they can’t/wont do this, then get them to send you a copy of the SQL in the view. You then paste this code into the Query View and add the order by clause yourself. I do this ALL the time.

    2. Why would you get the top 99.9999?? TOP 100 PERCENT works fine.

      But as Matt explains below you don’t even need that since you can SELECT col1, col2 FROM YourView ORDER BY col2; in PowerPivot or in Power Query.

      Even when you apply the sorting in a second step in Power Query, the Query Folding feature pushes as much as possible to the SQL Server. I guess in most cases a sort in Power Query will be translated to an ORDER BY clause in the statement that wil be sent to the server.
      To be sure you should use Extended Events (or Profiler,…) and capture the query on your SQL instance when you execute the Power Query.

      I just ran some tests today and what really puzzles me is that the same query with the ORDER BY actually led to a faster data refresh than without.
      When executed in SSMS it was the other way around as we could expect. The execution plan said the query with ORDER BY took 95% of the batch and the one without 5%. But strangely the ORDER BY version was executed in Batch mode and used parallellism, while the one without used Row mode and the duration of both versions was just a little different; 16 min vs 17 min for a 51 million row fact table.
      In Powerpivot data refresh without ORDER BY completed in 10:50, with ORDER BY in 07:35.
      The file size shrunk from 287 MB to 163MB.

      1. That is really interesting. I don’t know anywhere near enough about query folding, but would like to know more. I also don’t know a lot (just some) about profiler. Is the profiler the best tool to investigate which query is being sent from PQ? Can you actually see the SQL code or the query plan, or just the events?

        1. I just did some testing (Excel 2016 with SQL Server 2014).

          I connected to my DimDate table, got the M-statement, enabled Extended events (the Profiler successor) and captured the query on the SQL Server when I refreshed. Then I added a sort by DayInYear in a second step, looked again at the M and SQL statements, and finally replaced the sort with a filter on DayInWeek in (5,6).

          let
          Source = Sql.Database(“sqlserver.domain.be”, “DataHub”),
          dbo_DimDate = Source{[Schema=”dbo”,Item=”DimDate”]}[Data]
          in
          dbo_DimDate

          becomes:

          execute sp_executesql N’select [$Ordered].[DateKey],
          [$Ordered].[Date],
          [$Ordered].[ISODate],
          [$Ordered].[DayInWeek],
          [$Ordered].[DayInMonth],
          [$Ordered].[DayInYear],
          [$Ordered].[HolidayBE],
          [$Ordered].[MonthNumber],
          [$Ordered].[Year],
          [$Ordered].[ISOYear],
          [$Ordered].[ISOWeek],
          [$Ordered].[JaarWeek],
          [$Ordered].[DayOfWeek],
          [$Ordered].[DayOfWeekNL],
          [$Ordered].[DwNL],
          [$Ordered].[MonthName],
          [$Ordered].[MaandNaam],
          [$Ordered].[Quarter],
          [$Ordered].[Year_Month],
          [$Ordered].[Year_Quarter],
          [$Ordered].[JaarMaand],
          [$Ordered].[IsWorkDay]
          from [dbo].[DimDate] as [$Ordered]
          order by [$Ordered].[DateKey]’

          With the sort step added, the PowerQuery advanced editor shows:

          let
          Source = Sql.Database(“sqlserver.domain.be”, “DataHub”),
          dbo_DimDate = Source{[Schema=”dbo”,Item=”DimDate”]}[Data],
          #”Sorted Rows” = Table.Sort(dbo_DimDate,{{“DayInYear”, Order.Ascending}})
          in
          #”Sorted Rows”

          and the SQL Query is:

          execute sp_executesql N’select [_].[DateKey],
          [_].[Date],
          [_].[ISODate],
          [_].[DayInWeek],
          [_].[DayInMonth],
          [_].[DayInYear],
          [_].[HolidayBE],
          [_].[MonthNumber],
          [_].[Year],
          [_].[ISOYear],
          [_].[ISOWeek],
          [_].[JaarWeek],
          [_].[DayOfWeek],
          [_].[DayOfWeekNL],
          [_].[DwNL],
          [_].[MonthName],
          [_].[MaandNaam],
          [_].[Quarter],
          [_].[Year_Month],
          [_].[Year_Quarter],
          [_].[JaarMaand],
          [_].[IsWorkDay]
          from [dbo].[DimDate] as [_]
          order by [_].[DayInYear]’

          The filter step becomes:

          let
          Source = Sql.Database(“sqlserver.domain.be”, “DataHub”),
          dbo_DimDate = Source{[Schema=”dbo”,Item=”DimDate”]}[Data],
          #”Filtered Rows” = Table.SelectRows(dbo_DimDate, each ([DayInWeek] = 5 or [DayInWeek] = 6))
          in
          #”Filtered Rows”

          which is, in SQL:

          execute sp_executesql N’select [$Ordered].[DateKey],
          [$Ordered].[Date],
          [$Ordered].[ISODate],
          [$Ordered].[DayInWeek],
          [$Ordered].[DayInMonth],
          [$Ordered].[DayInYear],
          [$Ordered].[HolidayBE],
          [$Ordered].[MonthNumber],
          [$Ordered].[Year],
          [$Ordered].[ISOYear],
          [$Ordered].[ISOWeek],
          [$Ordered].[JaarWeek],
          [$Ordered].[DayOfWeek],
          [$Ordered].[DayOfWeekNL],
          [$Ordered].[DwNL],
          [$Ordered].[MonthName],
          [$Ordered].[MaandNaam],
          [$Ordered].[Quarter],
          [$Ordered].[Year_Month],
          [$Ordered].[Year_Quarter],
          [$Ordered].[JaarMaand],
          [$Ordered].[IsWorkDay]
          from
          (
          select [_].[DateKey],
          [_].[Date],
          [_].[ISODate],
          [_].[DayInWeek],
          [_].[DayInMonth],
          [_].[DayInYear],
          [_].[HolidayBE],
          [_].[MonthNumber],
          [_].[Year],
          [_].[ISOYear],
          [_].[ISOWeek],
          [_].[JaarWeek],
          [_].[DayOfWeek],
          [_].[DayOfWeekNL],
          [_].[DwNL],
          [_].[MonthName],
          [_].[MaandNaam],
          [_].[Quarter],
          [_].[Year_Month],
          [_].[Year_Quarter],
          [_].[JaarMaand],
          [_].[IsWorkDay]
          from [dbo].[DimDate] as [_]
          where [_].[DayInWeek] = 5 or [_].[DayInWeek] = 6
          ) as [$Ordered]
          order by [$Ordered].[DateKey]’

    1. Edit: OK, so you can’t add Order By inside the view. But you definitely can add Order By as part of your select statement that points to your view.

      Select *
      from myView
      order by myView.ProductCode

      Just change the Table properties wizard from Table View to Query View and add the Order By clause.

      Given what I understand about Query Folding, I believe the task to do the sorting is pushed to SQL Server and not Power Pivot

  2. Thanks Matt, great article,
    I did use it on my small workbook (8192MB) which holds around 2M records, data source is from an access DB, sorted the queries there and got reduced size after i refreshed my workbook (new file size: 8081MB, about 1.4% reduction), i guess this technique works on all file sizes but the improvement is less as file size shrinks.

    1. Are your numbers correct? 8.1 Gigabites for 2M records? If so, something would seem to be wrong. I normally get file sizes around 40Mb for 2M rows. If these numbers are correct, my best guess is you either have lots of materialised uncompressed data in worksheets, or a very wide data table with lots and lots of columns. If the latter, this is a very inefficient shape for data.

      Can you confirm?

      Matt

  3. My Bad! it is KB not MB but i always read the first number in my head! sorry about that, so yeah the real number is 8,192KB not MB.
    i Have a total of six tables including the date table, total records in one table is over a million, sum of the other records might sum up to another million.

  4. Two questions:
    1) Does this only apply to importing into Excel, or does it apply to SSAS and/or PowerBI Desktop?
    2) Would you say that, in general, what you are looking to do is order by columns with low cardinality, or the columns with the least cardinality in the database?

    Thanks!
    Eric

    1. It applies to both, however the segment size for Excel is 1m rows, but the default (configurable) for SSAS is 8 million. So sorting in SSAS with less than 8m rows in a single table will have less impact (it may still have some impact if you can improve on the “order” in which columns are sorted – you need to test it.

      In general, you want to sort high cardinality columns. A column with 2 unique values over 500 million rows won’t take up much space regardless if it is sorted or not. If your column has 50,000 unique values (at random) over 500 million rows, you will likely have 50,000 unique values per segment unsorted, but 100 unique values per segment once sorted. That is the principle anyway – you need to test it on your data

      1. Thanks Matt… I meant high cardinality there… I always get that backwards for some reason, not sure why as it is just a matter of thinking “how many birds do you have in that bag, a high number of cardinals, or a low number?”

  5. Hi Matt,
    When applying sort in Power Query, it seems to not save the table in a sorted way into the data model due to lazy evaluation. I’m using Power Query to import the data from a folder of .csv files and then unpivot them. Do you have any idea how to make Power Query actually save the query to the data model in the specified order? If not, do you know of a suitable work-around?

    Related thread: https://forum.powerpivotpro.com/forums/topic/query-refresh-or-using-slicers-uses-up-64-gb-ram-on-3gb-input-excel-crashes/#post-7521

    Thanks and best regards,
    Richard

    1. Mmmm, Interesting. I didn’t realise that, but I can understand how it happens. I have been sitting here thinking about how you could do a couple of steps after the sort – something that can’t be done before the sort, then undo it. But I think the process of undoing it will be caught by the lazy evaluation.

      In Excel 2010, there is no way to import directly from power query to power pivot. Instead you have to import in Power Pivot from an existing connection\power query. In the Excel 2010 wizard you can write SQL over the power query, so you could do it there.

      You could add an iD column after sorting, but that would be expensive if you loaded it. What if you added an ID column, added a custom column that evaluates the first row to 1 and all other rows to null. It can’t do this without the iD column. Then delete the iD column. Load the extra column, but it would be insignificant. You can just hide it. I think if you delete it before load, the lazy execution would kick in again

  6. Hi Matt,
    Thank you very much for your reply. I have tried your suggestion and I think that does not cause Power Query to rearrange the data internally.

    Here is why: When you sort a table in Power Query and load it to Excel, the table is properly sorted in Excel. If, however, you additionally add the query to the data model, the table in Excel gets into a seemingly arbitrary sort order. This is what I presume the order in memory to be.

    Do you have any other idea on how to solve this sort issue?

    I also posted about this issue here: http://www.excelguru.ca/forums/showthread.php?7361-How-to-save-a-sorted-query-to-the-data-model-(for-performance-reasons-in-Power-Pivot)&p=30166&posted=1#post30166

    Best regards, Richard

  7. Hi Matt,

    I did try your suggestion.

    This is the resulting table in Excel when loading to Excel and NOT loading to the data model. It shows the correct sort order.
    Date ISOWeek WeekInCalendar Weekday Index isFirstRow
    29.10.2016 00:00 43 2016-CW43 Saturday 0 first row
    30.10.2016 00:00 43 2016-CW43 Sunday 1
    31.10.2016 00:00 44 2016-CW44 Monday 2

    And this is the resulting table in Excel when loading to Excel and loading to the data model. It is sorted somewhat arbitrarily
    Date ISOWeek WeekInCalendar Weekday Index isFirstRow
    21.11.2016 00:00 47 2016-CW47 Monday 23
    04.02.2017 00:00 5 2017-CW05 Saturday 98
    03.12.2016 00:00 48 2016-CW48 Saturday 35

    Again, I assume that assume the second output corresponds to how the table is saved. I could send you the sample workbook but I couldn’t find your email address.

    Best regards,
    Richard

  8. Very interesting indeed. I think this is a question for Chris Webb, or maybe Ken Puls. It could be the concept of lazy execution and pre-sorting data for compression are simply 2 concepts that don’t play well together – I dont’ know.

    1. Hi Matt,
      Yes, it seems like these two concepts don’t play well together. Thanks for considering Chris Webb and Ken Puls. If needed I can send you guys a sample workbook.

      The only work-arounds I can think of to implement the optimization of this article seem pretty nasty:
      1) Csv input files -> Power Query (load from folder, unpivot, merge some columns) -> Load to Excel data model -> somehow export to Csv -> Import to Access -> Import to Excel data model via SQL query with an ORDER BY clause.
      2) Csv input files -> implement the ETL in Access -> Import to Excel data model via SQL query with an ORDER BY clause.

      Best regards,
      Richard

      1. Good news. I met Matt Masson (Principle PM on Power Query at Microsoft) at Difinity.co.nz yesterday. I had a chat to him about your problem and his opinion is it should work (aka, it might be a bug). Are you able to send me your workbook to share with MS to take a look? You can contact me via my website.

  9. Hi Matt,
    Thank you very much for your deep interest!
    I sent you an email through the contact form on your website regarding sending the sample workbook.

    Thank you very much and best regards,
    Richard

  10. I can confirm now that sorting in Power Query does affect the data model and that an improved or worsened compression does take place as a result. This can be deducted from the file size and only happened for me when I imported more than a few million rows.

    I did a test with 36 million rows using Excel 2016 x64 and my file as outlined here: https://forum.powerpivotpro.com/forums/topic/query-refresh-or-using-slicers-uses-up-64-gb-ram-on-3gb-input-excel-crashes/#post-7521
    When not sorting in Power Query, the resulting file size is 54,369 KB.
    When sorting in Power Query by KPI long name, Date&time, Cell ID, the resulting file size is 98,113 KB. To elaborate, I mean the following sort command:
    Table.Sort(#”Reordered Columns”,{{“KPI long name”, Order.Ascending}, {“Date&time”, Order.Ascending}, {“Cell ID”, Order.Ascending}})

    Regarding the different file sizes, I think I encountered a similar case as Matt outlined in the article:
    “As you can see in the memory usage table below, I got further improvements over the Product sorting, however note that the space used on disk is higher.”

    Regarding the issue which I previously stated in the comments, when loading a query to Excel and to the data model, the table in Excel is not sorted properly. This may be a bug but the seemingly arbitrary sort order in the Excel table apparently does not correspond to the sort order in the data model, at least not when only loading to the data model.

    Thanks a lot to Matt for his insights.

    Best regards,
    Richard

Leave a Comment or Question