I have read a lot about how good Power Pivot compression is over the last couple of years, and I have a pretty good understanding of how it all works. But there were some doubts in my mind after some recent reading, so I thought I would simply do some tests to firm up my understanding – I am sharing that testing and learning in this blog today.
There are a number of compression algorithms that Power Pivot uses. Today I am going to talk about just one of them – Run Length Encoding – and do some testing to see what happens under a couple of different scenarios.
Run Length Encoding – First The Basics
Take a look at the sample data below. The column on the left (the raw data) contains a list of 3 possible values (A, B, C). Power Pivot uses a compression technique called Run Length Encoding (RLE) to compress columns of data like this. When the data is loaded into Power Pivot, the Vertipaq engine will first sort the column so that it looks like the orange column in the middle. Once the data is sorted like this, Power Pivot then analyses the data to see if it is a good candidate for RLE. When it is beneficial to do so, Power Pivot will create a new a table like the blue one on the right (and store that blue table – or something similar) rather than physically storing the original column of data. Columns of data with a low cardinality (a small number of unique values) are great candidates for RLE and compress very well using this method.
Power Pivot would end up storing a table that looks more like the black table above (rather than the blue one), keeping just the minimum amount of information it needs to rebuild the real table of data on the fly when and if required. If the black RLE table ended up taking more space than the original column of data, then there would be no benefit of RLE and the original column of data would be stored. Power Pivot may use one or more of the other compression techniques used as well as, or instead of RLE – it all depends on the specifics of the actual data.
How I Conducted My Compression Tests
I created a table in Excel with 10 columns of integers and 100,000 rows. (1 million data points in all). Each integer was generated as a random number between 1 and 50 and then converted to a value (ie it was not saved as a RANDBETWEEN formula). A sample of the data is shown below.
I then saved this file as an XLSX file and a CSV file. Then I created a new blank Excel workbook and imported all 10 columns from the Excel file directly into Power Pivot in the blank workbook.
Just a Few Caveats Up Front
Of course this is not real life data – it is just test data. Real life data will compress more or less than my test data and it totally depends on the data itself. Also the data in my columns of test data is all of the same type and the same number range (ie integers from 1 to 50) – this is not likely in real life. None the less this is still an interesting exercise, and the similarity of the number ranges across the columns gives me a chance to compare the compression impact across the different columns when there are more or less columns in the data model – one of my main reasons for doing this test.
As you can see in the table below, the original Excel workbook is the largest. It was actually quite surprising to me that the XLSX was larger than the CSV. Excel has inbuilt ZIP compression in all XLSX files so I expected it to be smaller than the CSV. (As an aside, you can actually rename an XLSX workbook and give it a ZIP extension and look at the uncompressed XML data inside the zip file). My assumption is that the compressed XML storage of this data simply takes up more space than the uncompressed CSV storage of the same data (in this case anyway). I would not expect this to be the case with normal production data and or data that has some text values. In real life I would generally expect the XLSX to be smaller than the CSV.
Individual Column Impact on Compression
The next thing (main thing) I wanted to test was the impact of each incremental column (or decremental column in this case given I am going to remove columns) on the size of the workbook. I set about creating another 9 workbooks. Each workbook had 1 additional column removed. Using Power Query, I imported the data from Windows Explorer into Excel to analyse the results (shown below).
I was actually quite surprised that there was a linear relationship between the number of columns in each workbook and the size of the file. Now of course this is artificial data and the cardinality of each column is identical – this wont happen in real life. But this test does seem to confirm that each column is compressed on its own merits without regard to any other column in the workbook, and without regard to how many other columns there are in the workbook.
Columns with Extremes of Cardinality
The next thing I did was take a copy of the Power Pivot Workbook that contained the full 10 columns and 100,000 rows of data. I then created 3 new workbooks – each workbook had 1 additional column added to the data model.
- In one of the workbooks I added an ID column that uniquely identifies each row (high cardinality – integers from 1 to 100,000).
- In the second workbook I added a column with random binary digits (low cardinality – either 0 or 1) and imported it from the source.
- In the last workbook I created the random binary column using a calculated column instead of importing the data from the source.
Note how the workbook with a column of 0 or 1 only (low cardinality) increased the file size by just 12KB (1.3% bigger). Yet the file also with 1 extra column – an ID column (high cardinality) increased the file size by 2.5 times. Also note that the workbook that contains the binary calculated column only made the workbook marginally larger than the case when the data was imported from the source. Given the random numbers are not identical, this is not a 1 for 1 test, so I would call the difference in file size negligible.
Cardinality is your enemy when seeking smaller workbooks and faster performance in Power Pivot, and ID columns are amongst the worst culprits (eg Invoice Number). Of course if you need the high cardinality column in your data set, then you should import it. But if you don’t needed it, then do yourself a favour and leave it out of your data model.
Row Impact on Compression
Then next thing I did was take a look at the impact of increasing or decreasing the number of rows on the size of the files.
Half Size File: I took the original 100,000 row file loaded into Power Pivot, and deleted half the rows, leaving 50,000 rows of data.
10 Fold Increase File: I took the original 100,000 row file and increased the data set to 1 million rows (using the same random number between 1 and 50 technique across all 10,000,000 data points).
Using the original 100,000 row file as the base case, you can see in the table above that the more rows the file has, the more efficient the compression gets*. The file that has 10 times more rows is only 7.2 times larger. And the file that has half the number of rows is actually 61% of the size. *Note: this is the case for this sample data. It may or may not be the case with your production data – it will depend on your data. If you data tends to have similar values over time, then compression will generally behave like above. If the values tend to change over time, then you may not get this result.
Key Things to Take Away From the Testing
The key thing I learnt from this testing was that each column seems to compress based on its own merits completely independently of all other columns. It also confirmed what I already knew – that low cardinality columns (such as binary columns) have an almost negligible impact on the size of your workbook. The implication is that it can be highly beneficial to use calculated columns in your data tables for complex DAX formulas providing the calculated column returns a low cardinality set of data. This is exactly the point that Greg Baldini made in the comments section of my last blog post.
There is excellent coverage about compression in the new book from The Italians The Definitive Guide to DAX. A word of warning though – this book is not targeted to Excel beginners and it contains concepts that will take a deep review to absorb. However when you want to go in deep, this is a great read (and reference guide).
Also, you may like to check out the “What’s eating up my memory” blog post from Scott at Tiny Lizard. Scott made some improvements on the VBA code (first produced by Kasper de Jonge) that analyses your workbooks and shows which columns are using the most memory (Excel 2013+ only).