I have read a lot about how good Power Pivot compression is over the last couple of years, and I have a pretty good understanding of how it all works.  But there were some doubts in my mind after some recent reading, so I thought I would simply do some tests to firm up my understanding – I am sharing that testing and learning in this blog today.

There are a number of compression algorithms that Power Pivot uses.  Today I am going to talk about just one of them – Run Length Encoding – and do some testing to see what happens under a couple of different scenarios.

Run Length Encoding – First The Basics

Take a look at the sample data below.  The column on the left (the raw data) contains a list of 3 possible values (A, B, C). Power Pivot uses a compression technique called Run Length Encoding (RLE) to compress columns of data like this.  When the data is loaded into Power Pivot, the Vertipaq engine will first sort the column so that it looks like the orange column in the middle.  Once the data is sorted like this, Power Pivot then analyses the data to see if it is a good candidate for RLE.  When it is beneficial to do so, Power Pivot will create a new a table like the blue one on the right (and store that blue table – or something similar) rather than physically storing the original column of data.  Columns of data with a low cardinality (a small number of unique values) are great candidates for RLE and compress very well using this method.

image

Power Pivot would end up storing a table that looks more like the black table above (rather than the blue one), keeping just the minimum amount of information it needs to rebuild the real table of data on the fly when and if required.   If the black RLE table ended up taking more space than the original column of data, then there would be no benefit of RLE and the original column of data would be stored.  Power Pivot may use one or more of the other compression techniques used as well as, or instead of RLE – it all depends on the specifics of the actual data.

How I Conducted My Compression Tests

I created a table in Excel with 10 columns of integers and 100,000 rows. (1 million data points in all).  Each integer was generated as a random number between 1 and 50 and then converted to a value (ie it was not saved as a RANDBETWEEN formula).  A sample of the data is shown below.

image

I then saved this file as an XLSX file and a CSV file.  Then I created a new blank Excel workbook and imported all 10 columns from the Excel file directly into Power Pivot in the blank workbook.

Just a Few Caveats Up Front

Of course this is not real life data – it is just test data.  Real life data will compress more or less than my test data and it totally depends on the data itself.  Also the data in my columns of test data is all of the same type and the same number range (ie integers from 1 to 50) – this is not likely in real life.  None the less this is still an interesting exercise, and the similarity of the number ranges across the columns gives me a chance to compare the compression impact across the different columns when there are more or less columns in the data model – one of my main reasons for doing this test.

Preliminary Results

As you can see in the table below, the original Excel workbook is the largest. It was actually quite surprising to me that the XLSX was larger than the CSV.  Excel has inbuilt ZIP compression in all XLSX files so I expected it to be smaller than the CSV.  (As an aside, you can actually rename an XLSX workbook and give it a ZIP extension and look at the uncompressed XML data inside the zip file).  My assumption is that the compressed XML storage of this data simply takes up more space than the uncompressed CSV storage of the same data (in this case anyway).  I would not expect this to be the case with normal production data and or data that has some text values. In real life I would generally expect the XLSX to be smaller than the CSV.

The CSV file is about half the size of the original Excel file, but the Power Pivot workbook is just 15.6% of the original Excel workbook (ie it is compressed 6.4 times).image

Individual Column Impact on Compression

The next thing (main thing) I wanted to test was the impact of each incremental column (or decremental column in this case given I am going to remove columns) on the size of the workbook.  I set about creating another 9 workbooks.  Each workbook had 1 additional column removed.  Using Power Query, I imported the data from Windows Explorer into Excel to analyse the results (shown below).

image

I was actually quite surprised that there was a linear relationship between the number of columns in each workbook and the size of the file.  Now of course this is artificial data and the cardinality of each column is identical – this wont happen in real life.  But this test does seem to confirm that each column is compressed on its own merits without regard to any other column in the workbook, and without regard to how many other columns there are in the workbook.

Columns with Extremes of Cardinality

The next thing I did was take a copy of the Power Pivot Workbook that contained the full 10 columns and 100,000 rows of data.  I then created 3 new workbooks – each workbook had 1 additional column added to the data model.

  1. In one of the workbooks I added an ID column that uniquely identifies each row (high cardinality – integers from 1 to 100,000). 
  2. In the second workbook I added a column with random binary digits (low cardinality – either 0 or 1) and imported it from the source.
  3. In the last workbook I created the random binary column using a calculated column instead of importing the data from the source.

calc

Note how the workbook with a column of 0 or 1 only (low cardinality) increased the file size by just 12KB (1.3% bigger).  Yet the file also with 1 extra column – an ID column (high cardinality) increased the file size by 2.5 times. Also note that the workbook that contains the binary calculated column only made the workbook marginally larger than the case when the data was imported from the source.  Given the random numbers are not identical, this is not a 1 for 1 test, so I would call the difference in file size negligible.

Cardinality is your enemy when seeking smaller workbooks and faster performance in Power Pivot, and ID columns are amongst the worst culprits (eg Invoice Number).  Of course if you need the high cardinality column in your data set, then you should import it.  But if you don’t needed it, then do yourself a favour and leave it out of your data model.

Row Impact on Compression

Then next thing I did was take a look at the impact of increasing or decreasing the number of rows on the size of the files.

Half Size File: I took the original 100,000 row file loaded into Power Pivot, and deleted half the rows, leaving 50,000 rows of data.

10 Fold Increase File:  I took the original 100,000 row file and increased the data set to 1 million rows (using the same random number between 1 and 50 technique across all 10,000,000 data points).

image

Using the original 100,000 row file as the base case, you can see in the table above that the more rows the file has, the more efficient the compression gets*.  The file that has 10 times more rows is only 7.2 times larger.  And the file that has half the number of rows is actually 61% of the size.  *Note: this is the case for this sample data.  It may or may not be the case with your production data – it will depend on your data.  If you data tends to have similar values over time, then compression will generally behave like above.  If the values tend to change over time, then you may not get this result.

Key Things to Take Away From the Testing

The key thing I learnt from this testing was that each column seems to compress based on its own merits completely independently of all other columns.  It also confirmed what I already knew – that low cardinality columns (such as binary columns) have an almost negligible impact on the size of your workbook.   The implication is that it can be highly beneficial to use calculated columns in your data tables for complex DAX formulas providing the calculated column returns a low cardinality set of data.  This is exactly the point that Greg Baldini made in the comments section of my last blog post.

Further Reading

There is excellent coverage about compression in the new book from The Italians The Definitive Guide to DAX.  A word of warning though – this book is not targeted to Excel beginners and it contains concepts that will take a deep review to absorb.  However when you want to go in deep, this is a great read (and reference guide).

Also, you may like to check out the “What’s eating up my memory” blog post from Scott at Tiny Lizard.  Scott made some improvements on the VBA code (first produced by Kasper de Jonge) that analyses your workbooks and shows which columns are using the most memory (Excel 2013+ only).

  Subscribe to PowerPivotPro!
X

Subscribe

Matt Allington

Matt Allington is a Microsoft MVP specalising in Power Pivot, Power BI and Power Query Consulting and Training based in Sydney Australia. Visit Matt's blog here. 

This Post Has 15 Comments

  1. Interestin article! What should be noted is that in very large tables (millions of rows) it pays big times to sort the data before importing (either in the source or during import with SQL order by). This is because PowerPivot computes these RLE tables per million rows. In case of 2 million rows with only 2 distinct values “A” or “B” it would pay to sort before importing, so you get one page with only A’s and one page with only B’s, as opposed to two pages with both A’s and B’s. I have tested this on real-life production data (ERP stock moves) and sorting reduced the workbook from ~70 MB to about 30 MB.

    1. Very interesting Bas, thanks for sharing. Do you know if this holds for Power Pivot in Excel too? I assume it does given the underlying engine is essentially the same, but I am not 100% sure if Power Pivot has the same paging technology or not.

      1. I tested this in Power Pivot for Excel, so yes! However I do believe I’ve read somewhere that Tabular has larger pages than Excel, so maybe the difference won’t be as big regarding sort vs non-sort.
        I discovered this trick on one of the sites of “the Italians”. I should look it up for you.

  2. Thanks very much for the shout-out. I would like to emphasize the point I made in the linked comments. I did spend the majority of my words talking about memory, both at rest and at run-time. The important kernel that led to that discussion, though, is more theoretical than technical.

    It’s common to see people solving complex problems with measures. This is great, sometimes business logic simply isn’t logical, and it’s awesome that we have tools that allow us to cope with the trauma, I mean complexity.

    That being said, it’s also unfortunately common to see very complex measures solving problems that don’t need to be solved in measures. I think we often forget that dimensional modelling, and data modelling in general, is a discipline much more nuanced than slapping our tables around until they are in a star schema. Our facts are numeric metrics that can be aggregated in various ways. Our dimensions are logical groupings of attributes which can impact the evaluation of these aggregations, through filtering or perhaps special handling of specific types of data.

    Sometimes new attributes can be derived from logical groupings of other attributes. In the other post with the golf problem, the key point I was concerned with was why we’re deriving the 8 games to consider at measure evaluation time when this is an attribute that can be derived at model load time. It’s simply a combination of Player, Date, and Score. There is no way for this attribute to change based on filter selections In this case, why jump through hoops in our measures? We can simply store a flag that indicates which games to consider.

    I moved into a discussion of memory impact because sometimes there is an argument against storing “extra” data in the fact table. I was hoping to fend off those arguments by speaking to the low impact such a field might have. I am happy that the comment spun off this thorough and well-thought-out post.

    I would like to take this chance to emphasize a different point, though. Solve your problems in your model, and you won’t have to solve them with complex and slow measures.

    If something is known at the time of model refresh, it’s likely good to persist that into the model.

  3. Good stuff Matt, and thanks for the shout out. Also see http://tinylizard.com/how-does-power-pivot-store-and-compress-data/ for some similar discussion to what you have here, though somehow mine is less thorough… yet includes some words on hash tables.

    I have to say… your purely linear relationship between filesize and # of columns is kinda freaking me out. Even with this “random” data, I would expect “worse than linear”. I think. 🙂 Each column will be sorted worse than the prior ones… so you get the additional data AND worse compression. In theory. 🙂

    Completely agree that adding calc columns for heavy calcs that map to low cardinality results is super super useful… and certainly not to be avoided!

  4. If what I’ve already read from tiny lizard and sqlbi is true, there shouldn’t be a linear relationship between number of columns and file size; the file should get significantly larger with every column that’s added. Most of the Power Pivot pros seem to agree that the more columns you add, the worse your sorting gets, and thus the worse the RLE in xVelocity gets. This testing seems to indicate that that isn’t true though.

    1. I have exactly the same background as you, which is why I did this testing. Something wasn’t adding up, so I did this analysis. I am happy to be proved wrong, but these were my tests and these were the findings.

      1. I’m not convinced you’re wrong; you did after all do the testing. It just means that the xvelocity engine is doing something weird that maybe even the Italians didn’t know about!

        1. It was after reading “The Definitive Guide to DAX” that I was pushed over the edge and decided to test it. The sentence was “when stored in a columnar database, each column is stored in its own data structure, and it is physically separated from the others”. This sentence was the final straw that prompted me to test my understanding. I will ask Marco when I attend his training in March.

    1. I am attending the Mastering DAX and Optimising DAX training at the end of this month in Sydney. I have a list of questions for Marco as long as my arm (this one included). Marco will think the registration fee he charged me was not enough by the time I have finished a week with him 🙂

      1. @Matt, @Scottsen: Here is my take on an explanation:

        The phenomena you have observed in the testing i.e. the linearity when you remove columns piecemeal is exactly what I would expect. The reason is that every column in the test database has the same distribution (50 values distributed randomly in 100000 rows) since this is the case the columns’ values will statistically have the same possibility for compression hence they will compress in the same way hence they will take up the same amount of space hence the linearity.

        @Matt: you wrote earlier in the remark section:

        “when stored in a columnar database, each column is stored in its own data structure, and it is physically separated from the others”

        While it’s correct that every column is stored in it’s own data structure and physically separated from any other column in RAM there still is an interdependency between the columns in RAM. That interdependency is the sort order. If the engine decides that the best choice (by presorting the data you are merely giving the engine a hint to work with) is to sort 2 of the columns during processing the rest of the columns will follow this sort order (and remember it’s the entire table that is being sorted not an individual column). This is so you can easily reassemble individual rows quickly. The columns are not sorted independently; the sort order of the entire table dictates the order of the rows in a column also those that are not sorted on. So for example if you wanted to reassemble row no. 17 you would scan down to the 17th row in every column and this is pretty easy the way the data in a column is laid out. This is also the reason why when you have real data you loose compression possibilities. If you have a column with only 2 values and that column is not one of the ones being sorted by the engine the values could in worse case be distributed in such a way that an 0 will be followed by an 1 by an 0 and so on.

        Well this is at least my understanding of the matter.

Leave a Comment or Question