skip to Main Content

By Avi Singh [Twitter]

As I have started working with clients around training/consulting, I am surprised by how often I find FlatToStar_RockyIVtables that have been flattened, i.e. the data tables have been combined (joined) with lookup tables, to produce a really wide table with everything and the kitchen sink in it. Maybe that’s to be expected; after all that is the modus operandi when working in Excel. That’s the first habit that I try to break. I would nominate this for the top 5 mistakes but that already has 6 on the list 🙂

Usually it is just a matter of changing how data is being pulled from the source to address this issue, and go from flat table to separate Data and Lookup tables (Star Schema). However at times, your data source itself may only have the flattened version of data. If you have little control/influence over the data source, you may be stuck with a flat table. Or not!

Flat to Star: Using DAX Query

The first thing that should come in your mind to deal with this should be Power Query, and we will cover that. But this time let’s try to use DAX as a query language for this purpose. Our sample data, is a flattened file which has sales data as well as product attributes. We would like to separate these into distinct power pivot tables.

FlatToStar
Go from a flat wide table to separate Data and Lookup tables

The hardest part is getting started; for that watch the video. I will skip to the DAX query used, with a tip of the hat to Marco and Alberto from whom I have learned so much (and continue to do so).

 

EVALUATE
SUMMARIZE
(
FlatFact
, FlatFact[ProductKey] , FlatFact[EnglishProductName] , FlatFact[Color] , FlatFact[Size] , FlatFact[EnglishDescription] , FlatFact[EnglishProductCategoryName] , FlatFact[EnglishProductSubcategoryName] )

EVALUATE: Evaluate is to DAX query as Select is to SQL query. Start with Evaluate, and use a DAX expression that returns a table – and that’s your DAX query. If that sounds scary, realize that you probably are already aware of many DAX functions that return a table – FILTER, ALL, VALUES to name a few.

SUMMARIZE: Summarize is like the SQL group by, and I’m using that here to get a distinct set of rows for my product table.

Once you have your DAX query results in an Excel table, you can even click “Add to Data Model” to send the query results back to Power Pivot as a table.  Isn’t that cool!

Power Query
You can indeed accomplish the same using power query. The main steps being to remove unneeded columns and then removing duplicates. Watch the video for more.

PowerQuery

 

Data Gained, Data Lost

While feasible, you should be aware of the downsides of this approach, i.e. building your lookup (dimension) tables based on transaction record. The primary downside, is that you would only get data pertaining to the transactions that actually took place. For example, if there are products, customers, employees… that do not appear in any transaction in the period being considered, they would not show up in your lookup tables either. For one, this would mean you can never look for things that didn’t happen.

Files:
Flat to Star – using DAX.xlsx
Flat to Star – using Power Query.xlsx

Avi Singh

Avi Singh has personally experienced the transformation and empowerment that Power BI can bring - going from an Excel user to building large scale Power BI solutions. His mission now is to share the knowledge about Power Pivot and Power BI.

This Post Has 20 Comments
  1. Avi,
    when you want to retrieve all the distinct values from columns of the same table, you can also write ALL with the column list:

    EVALUATE
    ALL
    (
    FlatFact[ProductKey]
    , FlatFact[EnglishProductName]
    , FlatFact[Color]
    , FlatFact[Size]
    , FlatFact[EnglishDescription]
    , FlatFact[EnglishProductCategoryName]
    , FlatFact[EnglishProductSubcategoryName]
    )

    However, performance are identical and SUMMARIZE is more flexible because it can be used in a CALCULATETABLE to filter data. That said, for the transformation described in this article this could be an alternative to save typing one argument 🙂

    1. Thanks Marco. I would say that by now I feel comfortable (at home) when writing DAX Measures (as long as it doesn’t get too high tech). It feels like I know the different options available to me – be it a general approach (Calculated Column vs Measure) or specific Functions (CALCULATE or do I need to employ a FILTER). And I can intelligently choose between them, knowing the tradeoffs in general.

      However when it comes to querying, I feel out of place. Your pointers here, really help. I am hoping that as I use it more and more I would reach the same comfort zone (or close enough 🙂

      One thing I am struggling with is transitioning my SQL Query mindset. For example, I quite favored Common Table Expressions when writing SQL Queries, and in DAX Query, with what I have tried so far, I can’t get it to work that way.

      1. CTE can be recursive, DAX is not recursive. SQL has alias, DAX does not. SQL has function, DAX does not. There are many Language features I hope that will appear in the next DAX version 🙂

  2. Avi, just an FYI, in Power Query you can change the default loading options for your queries if you want, simply go to the Machine Settings in the Power Query ribbon and click on Options. You will see the Default Query Load Settings where you can make the change.

    Great stuff and thanks for sharing!

    1. Dan, you are my savior. For the rest of the folks, here is the gem. Saves me from having to gripe each time I loaded data using Power Query:
      Power Query > Options (Machine Settings )

      p.s. I still wish things were more user friendly with anything Microsoft created. For example it would have been lovely (more discoverable) to have a checkbox in the Load box to say “set this as my default”)

      1. Hi Avi,

        Good post.

        Just because I took me a good 10 minutes to figure it out, here are the steps to find this option in Excel 2016

        Data > On the Get & Transform section click New Query > Query Options

        Or in File Menu of the Query Editor > Options and settings > Query Options

        Regards,

        Fernando

  3. Great post – it will help me a lot as I work in an environment where flat files are (for now) the only option to access big corporate data.

    I have tested the approach described above (using DAX) and it works fine, but one thing that would help me a lot is how to get rid of blank values/zero in the summarized table. In the flat file I use as source, the main column that I will use as key in the final powerpivot table has occasional blanks; I am not interested in those rows for the final table of distinct rows obviously, any way to “remove” those in the same step where Evaluate+summarize happens?

    Thanks!

    1. To avoid blanks, you can use FILTER to filter out the first parameter passed to SUMMARIZE, as below
      EVALUATE
      SUMMARIZE
      (
      FILTER(FlatFact, NOT(ISBLANK(FlatFact[ProductKey])))
      , FlatFact[ProductKey]
      , FlatFact[EnglishProductName]
      )

  4. Hi all,

    Those posts are really great for someone starting with PowerPivot (such as myself !). In this instance though, I couldn’t quite figure it out.

    It seems obvious to everyone that it’s better to have one fact table and several lookup tables rather than one flat table. Unfortunately, the ‘why’ part eludes me. I assume it’s performance related (why? by how much?) ? Because if not, i’m not sure i want to do that. I enjoy visualizing everything on the same row when i can.

    For instance, if i have 5 columns : date, time, shop, sales, ‘overall daily sales’. Now I would be tempted to make 2 tables : first one with date, time, shop, and sales. Second one with date and ‘overall daily sales’ as a lookup table with a relationship.

    But if there’s no big performance gain, why would i? especially since it’s less flexible for other aspects, like for instance exporting it as a flat csv table once you added calculated columns for someone else to use in other tools.

    Antoine

    1. Antoine,

      A few points

      The lookup table, in your case the second table, would not contain any data only text attributes.
      Second one with date and ‘overall daily sales’
      Actually, in your example, I would look to create 3 or perhaps 4 tables
      Data Table: Data (Sales) + Keys (Date, Time, Shop ID)
      Lookup Tables:
      Date
      Time (Optional: only if reports need to slice by time, e.g. Which hour of the day registers greatest sales revenue?)
      Shop

      I enjoy visualizing everything on the same row when i can…: Visualization is separate than Modeling: You can have a Flat or Star Schema Model and still visualize data in a flat row.

      Why?: At times having a flat table would be preferred. It comes down to tradeoffs and there are a few involved here: Performance, Usability, Scalability of the Model. For simple models users starts with, flat may be the way to go. But as soon as you step into Production models of even slight complexity, Flat tables would hurt you on all three counts. I have a future post planned to delve deeper into the “Why?” and explain the tradeoff. So stay tuned.

  5. We just did the exact opposite to get the best performance. Flattening everything appears to fly faster than any modeling this product provides. Any reason we wouldn’t model everything in the dwh?

  6. Hi Robi, Avi i have the following problem: when i want to import data to table by creating pivot and indicating source of data, i dont have option to click to pick up table output and at the end pivot comes up, without the opportunity to write a DAX code. How is it possible?
    thnx

  7. How do you add multiple filter layers? Filter by date, then by non blank, then by a grouping and etc? Or are there resource materials that I could access to better learn the DAX Query language outside PowerPivot. I am having a hard time to get my mind to shift from PowerPivot logic.

    1. Nathan, multiple filters could be applied using CALCULATETABLE or using && in FILTER. However, I’ve mostly moved away from this approach and towards Power Query. I bring in the DAX table into Power Query using a simple “EVALUATE tablename” statement and then use Power Query for all other transformation. I find that much easier for myself and also for those whom I transition my work to.

  8. I think I suffering from what this article is discussing LOL!! But I can’t quite work out how to fix the issue. I’ve read through the Power Pivot and PowerBI book and still don’t see how to get the task done.
    Problem Statement
    I need to unflatten the data. Consider the following simple example (real data is way more complex).
    Column A has a unique key — lets call it customer request ID
    Column B, C, D and E are for Task 1 and are the User ID, Task Status, Start Time and End Time for the task.
    Columns F, G, H and I are the same but for Task 2
    Columns J, K, L and M are the same but for Task 3
    Columns N, O, P and Q are the same but for Task 4

    I wish to build a new table with the following columns:
    Customer Request ID
    Task Number
    Users ID
    Task Status
    Start Time
    End Time

    How do I do this? I need to read the source table multiple times and somehow append records to the new table. I also need to only create records where the User ID is not blank and perhaps where the status is set particular values.

    I feel like I’m missing something obvious.

    Help!!!

    1. If I understood what you’re lookin for…
      a): Create a PQ Query which connects to your data as it exists (do not load this into Excel or PowerPivot, leave as Connection)
      b): Right click and Reference this query to create a new query based on a) In this query only keep Task 1 columns
      c) repeat step b) to create Queries for Task 2, Task 3, Task 4 (only keeping those columns in each query)
      d) Append Query for Task 1, Task 2, Task 3 and Task 4

      Hope that helps.

  9. I have a question/problem with the “Evaluate”-Statement (not tied to this “from-flat-to-star topic discussed here): it is easy (since Todd showed me!) to query a certain Measure within the Evaluate-Statement, but if I query a Measure which uses a TotalYTD- or DatesYTD-Expression that calculations do not arrive in the query resultset of the Evaluate statement … does anybody know that and how to solve it?
    Thanks a lot for your help!

Leave a Comment or Question