skip to Main Content

The scenario – source data sheets that are structured differently!

Does Power Query sometimes seem too rigid? It’s great for defining a sequence of transformations to clean your data but requires the incoming data to be in a consistent format. Today, I want to talk about a technique for dynamically handling sheets with different structures.

You can download a sample .pbix and the .xlsx workbook that feeds into it to follow along with me here and here.

Be sure to update the query (the Source step path) to point to your own local machine.

Let’s look at my sample data. Suppose my sales manager sends me a monthly sales report structured similarly to that below:

Sample Data

This table looks pretty clean, right? Using Power Query, you can load this sheet, remove the first two rows and columns, unpivot the data, and you have the numbers you’re looking for in tabular format.

Patting yourself on the back, you file the report away until February rolls around. The Sales Manager sends you the new numbers, and you hit refresh, and… an error! Peeking at the data, you see that he changed the layout of the file and also added new salespeople to the matrix!

No leading columns!

Extra Rows!

As enjoyable as it would be just to tell the manager to stick to a consistent format, sometimes this isn’t realistic or feasible. So, let’s look at how we can use Power Query to handle this chaos:

  1.  Identify the header name for the column that demarcates the beginning of the matrix.
  2.  Automatically detect rows up to that header row.
  3.  Automatically detect columns up to that header column.
  4.  Remove the extra rows and columns, leaving you with a squeaky-clean dataset.
  5.  Unpivot the matrix for each sheet and append into a single dataset.

STEP 1: IDENTIFY THE TARGET HEADER ROW AND COLUMN:

Looking back at our sample data, you can see that there is a common cell in all our sheets that lies at the upper left of the data we want to extract. In this case, the target cell is the “Region” header. We want to remove all rows ABOVE that cell and all columns LEFT of that cell without hard-coding the location of that cell.

STEP 2: LOCATE THE HEADER ROW DYNAMICALLY:

We need some way to identify where the header row starts so that we can remove rows up to that point. This functionality is something that I would have thought is built in by default, but surprisingly is not! Luckily our friend Ken Puls over at ExcelGuru.ca came up with [a solution] for this which I’ve adapted slightly for our purposes.

Load your first worksheet into Power Query, add an Index column, and filter the table to the target value from step 1:

  1. Add Column > Index column > From 0.
  2. Add a custom step (click the fx button in the formula bar).
  3. Replace the formula with some custom M code: Table.FindText(#”Starting Table w/Index”, “Region”). Name the step “Filtered to Header Row.”

Load First Worksheet

Note that you’ll want to replace the yellow-highlighted text with the Target Header name from step 1. Table.FindText() scans the entire table for the target text and returns any row with that value. So be careful that your dataset doesn’t have that exact target word in other places!

Now we have our header row isolated along with the Index value for that row. Rename the step to “Filtered Header Row” as we’ll come back to this shortly.

Filtered Header Row

STEP 3: DETECT THE COLUMNS TO DELETE

Let’s move on to the more difficult part: dynamically removing leading columns. We have a bunch of columns, and we want to eliminate the first X, up to our target column. We’ll leverage the technique above and add some P3 secret sauce.

First, transpose the #”Filtered Header Row” step and add an index. That will make a single row table into a single column table that we use to identify the columns to remove.

  1. Transform > Transpose.
  2. Add an index column: Add Column > Index column > From 1.
  3. To handle blank columns in the header row (always a possibility in dirty data), add a custom column that checks if the column list has any nulls: Add Column > Custom Column > if [Column1] = null then “Column” & Number.ToText([Index]) else [Column1].

Our goal is to delete all columns left of the “Region” column (or above “Region” in the transposed table) so let’s find the index of that row:

  1. Right Click Column1 > Text Filters > Equals > “Region”.

Text Filters > Equals > “Region”

We’re building upon Ken’s technique of finding the Index that corresponds to a target cell but this time with a transposed table. Since we’ll reference this number a couple of times, let’s Drill Down to just the Index number so that we have an easily referenceable step:

  1. Right-Click on the Index value > Drill Down.
  2. Rename that step to “TargetColumnIndex”.

Now, jump around a bit and reference the original column list and filter it down to include ONLY the rows that have an index number less than the target column.

  1. Click the fx button to insert a New Step.
  2. Revise the M code to point to the full column list: = Table.SelectRows(#”Added ColumnName”, each [Index] < #”TargetColumnIndex”.

Let’s break down what we’re doing here: the outer Table.SelectRows filters the inner table #” Added ColumnName” down to all rows that have an Index less than the “TargetColumnIndex” value we isolated a couple of steps ago.

Filter Rows

Finally, remove the helper columns keeping only “ColumnName,” and you have a nice list of columns to exclude!

STEP 4: REMOVE THE ORIGINAL EXTRA COLUMNS AND ROWS

We now have all the pieces we need to eliminate our junk rows and columns! Let’s jump back to our original query and clean it up.

Create a new step and change its code to reference the Starting Table:

  1. Click fx > rename step to “Back to Starting Table” > change code to = #”Starting Table”.
  2. Home > “Remove Top Rows.” Enter any value for the parameter.
  3. Edit the M code directly and change the 2nd parameter of Table.Skip(#”Back to Starting Table,” #”Filtered to Header Row”[Index]{0}), in this case, we want to reference the step where we isolated the header row number from earlier.
  4. Home > Use First Rows as Headers.

Create new step

Boom! We’ve dynamically eliminated the top rows!

Now for the final act, we’ll tweak Table.RemoveColumns (when you do “Remove Columns” Power Query uses this function) to use a dynamic parameter! Remember that list of columns we generated earlier, the list we want to extract? That’s what we’ll input into Table.RemoveColumns.

First, select any one of the junk columns and right-click > “Remove Columns.” Take a look at the formula that Power Query generated.

  1.  Table.RemoveColumns(#”Promoted Headers”,{“Column1”}).

We know that Table.RemoveColumns requires a list for its second argument, so we can reference the list of columns to remove from an earlier step:

  1. Table.ToList(#”Columns to Remove”).

Now we’re left with a clean matrix which we can easily unpivot without any problems.

    1. Right-click the Region column > Unpivot Other Columns.
    2. Rename columns something useful.

Rename

STEP 5: Convert magic query to function

  1. The final step is to convert your magic query into a function so that you can apply this dynamic transformation to each file/worksheet that needs to be unpivoted.

To see how the final combined query looks with the function, please download this file here.

Conclusion:

Using the technique of identifying a target row and column, you can create extremely powerful and dynamic queries that can handle input files that aren’t always perfect, because let’s face it, you can’t always rely on training or data validation to prevent end users from modifying your perfect templates. Our job as data modelers is to make the end user experience as friendly as possible by foreseeing and handling exceptions.

If you have any ideas on how you might use this in your reports, please feel free to share in the comments section!

Come for the Techniques, Stay for the Business Value!

We get it:  you probably arrived here via Google, and now that you’ve got what you needed, you’re leaving. And we’re TOTALLY cool with that – we love what we do more than enough to keep writing free content.  And besides, what’s good for the community (and the soul) is good for the brand.

But before you leave, we DO want you to know that instead of struggling your way through a project on your own, you can hire us to accelerate you into the future. Like, 88 mph in a flux-capacitor-equipped DeLorean – THAT kind of acceleration. C’mon McFly. Where you’re going, you won’t need roads.

Narayana Windenberger

Narayana (pronounced “Na-ryan-a”) has a broad analytical background with 8+ years of experience in analyst roles requiring both business acumen and IT know-how. After experiencing first-hand how communication costs can cripple turnaround time in projects, he forayed into self-service BI by learning SQL and deploying SharePoint-based Excel dashboards. Skipping the traditional IT middle man, he became the reporting expert as Business Intelligence manager at FoodChain ID, a leader in food certification. More recently he has dived into the world of DAX, PowerPivot and Power BI and enjoys the “problem-solving thrill” that comes from implementing fully automated dashboards.

Narayana has a high standard of excellence, brings a positive attitude to the task at hand, and is deeply curious to understand how things work. In addition to holding an MBA, he is a Microsoft Certified Solutions Associate in BI Reporting and is a Microsoft Certified Excel Expert.

This Post Has 27 Comments
  1. Wow, this is an excellent post.

    Your explanation of using ‘M’ to solve a common issue (changing file structure) is practical and powerful.

    Thank you for sharing.

  2. Hi Narayana, this looks really useful. Any chance you can post the Unpivot Reports.xlsx file that is in the source? We can then follow with the pbix you posted. Thanks.

  3. Narayana:

    Great post. Going to be very useful.

    Really appreciated both files so we could “play at home”.

    Couple of suggestions:
    1. Query is labeled 2015 vs. xlsx data which reflect 2017. May want to rename to Jan 2017
    2. Source in query points to file location on your PC. Not a big deal to change the pointer to our respective download location but
    you might want to mention the need to modify the source location in the query.
    3. Part 2?
    a. update query to capture the Month & Year (cell A1) from the worksheet
    b. create function to extract data from other sheets in the workbook and append; excludes sheets that do not include data (ie
    notes/instruction tabs

    Thanks for posting

    1. Hi Ham,

      Thanks for your thoughtful suggestions! I have updated the blog post and files as you recommended, and added a completed PBIX file in Step 5 that includes a function that combines all sheets into a single query.

      Enjoy!

  4. it won’t work for me. when i’m trying tu put a formula, i got an error: Expression.Error: ‘The name ‘Starting Table w/Index’ wasn’t recognized. Make sure it’s spelled correctly.’. i tried to copy your formula also, but no use.

    1. You may need to either rename the step in which you added the index to “Starting Table w/Index” or change the “Starting Table w/Index” reference to whatever your step is after you add the index, likely #”Added Index”

  5. This is a post that is sorely needed but unfortunately there seems to be some errors that is preventing dummies like me from getting the result. Can you update the post to be precise? If you don’t have time, no worries

  6. Narayana; very brainy!! Excellent;

    as suggested by Ham Barnes; could you pleas add Query how to capture the Month & Year (cell A1) from the worksheet

    Cheer!! Soni

      1. Present Condition of the Query is as below:

        Date , Column2, Column 3, Column 4

        28/3/18, Product No, Product, Catalog
        28/3/18, 10AXX10, Chairs, Soni1
        27/3/18, 10AXX10, Tables , Riser1

        Output

        Date , Product No, Product, Catalog
        28/3/18, 10AXX10, Chairs, Soni1
        27/3/18, 10AXX10, Tables , Riser1

        Kindly advise

          1. Hi Narayana ; Promote headers: When promoting headers I want to retain the name the first column as “Date”

  7. I have another question; during the Promote Header; I don’t want to rename the existing one or more Column Name? Please advise

    1. Thanks Narayana; your reply on Promating header with date is very useful. I have any case as below; Please advise.

      Present Condition of the Query is as below:

      Reference, Date , Column2, Column 3, Column 4

      R001, 28/3/18, Product No, Product, Catalog
      R001, 28/3/18, 10AXX10, Chairs, Soni1
      R001, 27/3/18, 10AXX10, Tables , Riser1

      Output

      Reference, Date , Product No, Product, Catalog
      R001, 28/3/18, 10AXX10, Chairs, Soni1
      R001, 27/3/18, 10AXX10, Tables , Riser1

      I am able achieve for Date based on your advise using PromoteAllScalars=false; How to do the same for “Reference” Column Header.

      Thanks

    2. That’s a tough one and I don’t immediately see an obvious solution since the Reference column stores text values. I think this would require thorough analysis which is beyond the scope of this comment section.

  8. How did you learn M? I am confused half the time when looking at it (and certainly cannot just figure it out on my own).

    1. I’d say I learned mostly by doing — using the out of the box transformations in PowerQuery (i.e. the buttons in the ribbon) and then seeing the code that is generated. In addition, I have experience with some other programming languages like Javascript, PHP, VBA, and SQL, which helps me understand the syntax and structure of the M formulas.

      If you’re starting from scratch you might consider reading “M is for (Data) Monkey: A Guide to the M Language in Excel Power Query.”

      Good luck!

  9. Hi Narayana,

    Thanks for this post which was very thought provoking as I’m trying to make my PowerQueries more robust to changes in the source files generally!

    I followed it through but didn’t find it intuitive so I challenged myself to streamline the process and apply the same steps to both rows and columns.

    This is what I came up with and I’d be interested in your thoughts in case there is any downside to doing it this way….

    Record.FieldValues(_) returns a list of the values in each field on the current row
    List.Contains(list,”Region”) will then return true or false to indicate which rows

    So I used 2 steps to do the bulk of the work :

    CheckWanted = Table.AddColumn(#”Previous Step Name”, “Wanted”, each List.Contains(record.FieldValues(_),”Region”)

    SkipNotWanted = Table.Skip(CheckWanted, each [Wanted] true)

    Then transpose the table and repeat the same process to deal with the columns (which are now rows)

    Then transpose back again.

    It seems to work in less steps and I prefer it because the process for handling extra columns is identical to the process for handling extra rows.

    Also the List.Contains is looking for an EXACT match on “Region” so it avoids the problems if another record contains the text Region1 for instance which would be an issue for Table.FindText

    Are there any downsides to doing it this way because so far it seems to be working just fine.

    Thanks

    1. I think posting has stripped out some characters it thought were HTML !

      The SkipNotWanted checks each [Wanted] IS NOT EQUAL TO true

      but its done with “less than” and “greater than” which have been stripped out of the posted comment

Leave a Comment or Question