Retailer Overlap Analysis Using PowerPivot

March 16, 2010

 
Clint Pondering Where to Buy More Cigars

 

“The Rite Aids on one side of town, the Walgreens on the other, and me right in the middle.”

-The Shopper with No Name (and a Fistful of Dollars)

 

One of the coolest things about my new job at PivotStream is all of the cool new data sources I get to play with.  I thought I’d take a brief break from the Rank member posts and share a little bit from one of those sources.

Retailer Overlap Report

I just finished cooking up this report – click for full-size version.

All Retailer Intersections with PowerPivot

What does that show?

It shows the top 15 US retail chains by number of stores, and their competition against one another by ZIP code population.

For example:  take the first row, Albertsons.  Of all the people who live near an Albertsons, 65.9% of them also live close to a CVS, but only 4.7% live near a Rite-Aid.

Cool, huh?

Marketing versus competitors

Let’s say you are in charge of advertising and promotions for a large grocery or drugstore chain.  One of your primary jobs is to lure consumers into your stores as opposed to your competitors’.  Every day you receive a collection of all the specials, advertisements, etc. that your competitors are running in various regions of the country.

Naturally, you want to respond to their efforts.  But you can’t focus on them all equally, or you’d fall hopelessly behind.  So, which ones should you pay the most attention to?  This is the kind of thing that would help you.

Sliceable too!

OK, let’s say you work for Safeway, and you only run advertising for them in the South region.  How useful is the nationwide report above?  Hard to say.

For instance, nationwide, Walgreens competes with you for 47.3% of your customers, and Rite-Aid is second at 36.1%.  Does that hold up in the South?

One click and you have a brand new report:

South Region Only Retailer Intersections with PowerPivot

And look, Walgreens falls to 14.1%!  Rite Aid climbs to 52.5%…  but CVS now checks in at a whopping 73.5%!

So no, the national report would mislead you.  Same thing would be true if you ran all newspaper advertising nationally for Safeway, but you were ignoring an ad that CVS was running in the South.

Also Sliceable by Income Range!

Lastly, let’s say you work for ShopRite, and you control advertising in the South region.  Walgreens is running a monthlong special on Product X, and you happen to have the ability to run a better special on Product X…  but should you?

Well, you consult the report above, and it shows you that Walgreens only competes with you for 34.9% of your customers.  So you decide not to run a special on Product X, since that costs you a lot of money for not enough benefit.

But suppose you knew that Product X was primarily targeted at households making between $50K and $75K.  Slice by that income and you get:

South Region Only Specific Income Range Retailer Intersections with PowerPivot

Hey look, Walgreens competes with you for 61.6% of your customers in that income range. 

So that ad campaign might be targeted specifically at stealing business from ShopRite, for all you know :)

How did I do this?

It’s complicated, and once again pushed the bounds of what I know about DAX measures.  It starts out a lot like a Market Basket Analysis, which takes some work to explain, but then it takes another twist.

I’ll explain in subsequent posts :)

Flexibility!

But the crucial thing to note here for now is that this amount of flexibility is impossible in Excel by itself.

I mean, if I wanted to build a report that showed overlap between any two retail stores, with particular variables like region and income level held constant, I can do it.  Takes awhile but I can build one.

But as soon as someone says something like “I want to see this report filtered down to a different region,” well, you’re doing surgery on your original report.  Sometimes that takes almost as long as it did the first time around.

And when you’re done, well, now you have TWO reports to maintain.  Fun fun!

That’s the coolest thing about this report for me:  it remains flexible.  Its consumers don’t have to ask me for new versions of the report.  They can just click to filter.

And if they want new variables added, like racial demographics, that’s just a matter of adding new calculated columns to the PowerPivot window, not a new report.  I might even do that for next time.


Rank continued – dealing with unknown members

March 11, 2010

 
Unknown Player Right Next to Edgerrin James

 

TUCO:  “There’s no name on it!”

BLONDIE:  “Well, there’s no name here, either.”

 

 

Ode to YouTube HD

When I used to work on Bing and would look at search logs, I was always puzzled by some of the search strings that went by.  Well today I gave them one of my own –“Clint Eastwood holding rock,” among others.  I hope that confuses someone.

YouTube in HD, though – you are my new best friend for finding obscure screenshots from movies.

The problem from last time

Remember my Rank report that started with #2?

Number Two Tops the List

Filtering out Unknown PlayersWell, the reason for that is that I have some unknown players.  I’ve been filtering them out using the Row filter feature, shown here at right.  Note the last checkbox in the list, with the blank name unchecked.

(Also note the appearance of Zeron Flemister, two checkboxes from the bottom.  My friends and I had a tradition of always picking him in the last round of fantasy drafts just because we thought his name was so funny.)

 

 

 

 

 

Anyway, if I clear that filter and try to include that blank player in the pivot, I get the lovely error below.

image

Actually, that’s a pretty good error string, since it tells me pretty much exactly why the measure failed to evaluate.

Question 1: Where do the blank players come from?

My first hunch here was that my CleanPlayers[Full Name] column contained some blank values.

But then I remembered that the whole point of the CleanPlayers table was to make sure all unnamed players were removed.

So if that column does not contain any blank player names, why on earth do I get a blank player name in the list when I put that column in my pivot?

Answer: The Plays table sneaks them back in

alien-ceiling-large

 

“They must have found another way in, something we missed!”

-Ripley

(Best picture I could find.  Seriously.)

Here’s the thing:  when I removed all of the “blank name” players from my Clean Players table, I didn’t go back and remove all rows from the Plays table that involved those players.

So those rows from the Plays table now point to *nothing* in the Clean Players table.

To demonstrate, I sorted both the Plays and CleanPlayers table by [PlayerID]:

Plays Table First Player ID is 4 CleanPlayers Table First Player ID is 70

The first player in the Plays table is ID=4.  The first player in CleanPlayers is ID=70.

So when I add a measure from the Plays table (like Rushing Yards, or even my Rank measure), that forces a blank/unknown item to appear in my Clean Players list, even though there aren’t any such blank rows in the CleanPlayers table.

Step one of the fix:  find the error

OK, I spent a LONG time trying to figure out WHERE the error occurred in my measure.  Along the way I sent mail to my buddies in Redmond and begged them for a “step into” debugger feature.  But I know that it’s an incredibly difficult feature to build.

Let’s make a long story short.  Remember that second filter clause from the last post that I wasn’t sure was needed?  The one that made sure we weren’t comparing a player to himself?  Well, it’s actually the place where the error occurs, too:

   Players[FullName]<> Values(Players[FullName])

I remove that from the formula and now just have this for my rank measure:

   COUNTROWS(
     FILTER(
       ALL(Players[FullName]),
       [RushYards](Values(Players[FullName])) < [RushYards]
     )
   ) + 1

And that no longer errors with the unknown player in the pivot:

DAX Rank Measure that Does Not Error

Good news: not only does the measure no longer error, I verified that it returns the exact same results for all 4,000 players as the measure did before I removed the second clause.  It was like my measure had appendicitis – a critical problem in a needless organ :)

Bad news:  That unknown player is still grabbing the top rank, demoting Edge to 2.

Good news:  I figured out how to fix that.

Bad news:  Man is the fix ugly.  I’m going to see if there’s a better way before I blog about my Rube Goldberg solution.

I really did intend to finish this off for you guys today but don’t want to spread bad methodology.  Stay tuned.


Writing a RANK measure (and living to tell the tale)

March 8, 2010

 
Beavis_and_Butthead

“Number One, I order you to take…  a number two.”

-Captain Jean-Luc Butthead

 

 

Before I get started, do you have any idea how hard it is to find cartoon images from the pre-Internet era?  I just lost ninety minutes, first looking for the Far Side cartoon of “Custer’s Last Group Photo” (where everyone is holding up the “we’re number 1!” finger – get it?) and then struggling to find a version of the Beavis and Butthead image above at a better size.  No dice.  OK, enough of that…

DAX v1 has no RANK() function

That’s right.  If you want to write a pivot measure that tells you where a particular store, or product, or customer ranks according to another measure, and according to current filter context, well, a RANK() function would be nice.  But there wasn’t time to implement it in PowerPivot v1.

Never fear, there’s a workaround…

OK, yes, there’s a workaround – the FAQ says so right here

But “never fear” might be an overstatement.  In fact, I should pause right here and say…

PowerPivot is simple, this technique is not

I’ve been doing this blog since October.  I’ve covered a lot of ground.  If you read this post and fear that you have to understand it in order to get great results with PowerPivot, well, then I’ve done you a great disservice.

Because in many ways, PowerPivot is actually easier than Excel itself.  If you are new here, I strongly recommend going back to the Great Football Project and starting from the beginning.

OK, back to the power technique

I don’t know about you, but personally, my first impression of the formula referenced in the FAQ sounded a lot like “double yoo tee eff.”  Check it out:

  COUNTROWS(
      FILTER( 
          ALL(DimCustomer[CustomerName]), 
          DimCustomer[CustomerSales]
              (Values(DimCustomer[CustomerName]))
                  
< DimCustomer[CustomerSales]
          && DimCustomer[CustomerName] <>  
                   Values(DimCustomer[CustomerName]) 
      )
  )
+1

Explaining how it works

That formula has been bothering me for a long time.  I knew it wasn’t going to be easy to break it down, but on our little DAX adventure together, we must leave no stone unturned, even the really heavy ones.

So today, I dove in.  And whenever I get uncomfortable, I run to familiar ground.  Which means, of course, oh yes…

The Great Football Project Returns!

Quick!  Someone get me some early-2000’s running backs!

  94801734-marshall-faulk   Edgerrin-James_1 Shaun_Alexander

Phew, I feel better already.  OK, let’s dig in.  A long time ago I defined a [RushingYards] measure using the CALCULATE function.  Let’s park that in a pivot:

PowerPivot Rushing Yards 
I want to write a measure that gives me the rank of each player, by Rushing Yards, no matter how I have sliced or filtered my pivot.

First step:  convert the sample formula over to my table/column names

When I substitute my football workbook’s names into that formula, I get:

   COUNTROWS(FILTER(ALL(Players[FullName]),
   [Rushing Yards](Values(Players[FullName]))
   <[Rushing Yards] && Players[FullName]<>
   Values(Players[FullName])))+1

I won’t bother pretty-printing that monster just yet, because I’m going to dissect it step by step.

Outside functions:  COUNTROWS + 1

The formula is basically:

   COUNTROWS(all kinds of crazy stuff) + 1

Which starts to make a little bit of sense, if you’re feeling sleuthy.  Really, what this is doing is:

   COUNTROWS(# of Player Rows where Player has more yards) + 1

Make sense?  Find how many players outrank the current player, add one, and that’s your rank.  OK, so how does the middle stuff determine the number of players that outrank the current player?

Inside functions – FILTER and ALL

   FILTER(ALL(Players[FullName]),
   [Rushing Yards](Values(Players[FullName]))<[Rushing Yards]
    && Players[FullName]<>Values(Players[FullName])))

Recall that the syntax for FILTER is:

   FILTER(Table Expression, Filter Clause)

In this case, our table expression is:

   ALL(Players[FullName])

Which makes sense.  Remember, on a particular row of our pivot above, the Players table is going to only contain the row from that player, for purposes of measure calculation. 

Ranking Edgerrin James against himself, for instance, isn’t very interesting, unless you want to compare the Colts version to the Cards version, or heaven forbid the Seahawks version.  (The wear and tear on RB’s really is astounding).

So in order to have any players to rank the current player against, we need a table where the filter has been cleared.  That’s why we use ALL.

That leaves us with the Filter Clause, and it deserves its own section.

Inside Functions – the “Filter Clause” parameter to FILTER

OK, so we have a table that consists of all player names.  And we want to filter that down to just those players that have more Rushing Yards than the current player.

The FILTER function goes through that table of player names, one by one, and for each row, evaluates the filter clause to determine whether that row is “in” or “out.”

Our filter clause is really two clauses connected by the logical “and” operator – &&.  So a row must meet BOTH criteria to be included.

First clause:

   [Rushing Yards](Values(Players[FullName]))<[Rushing Yards]

First Row of PivotThis gets tricky, so let’s anchor it in an example.  Let’s say we are evaluating this measure for the very first row of our pivot, Edgerrin James:

Player Table

And the FILTER function is going to step through  our table of all player names, at right.  (There are about 4,000 players in my players table).  So, FILTER starts with Rabih Abdullah.

The right side of the clause above, [Rushing Yards], is the [Rushing Yards] measure, evaluated in the context of this source row, Rabih Abdullah.  Well, he totaled 172 yards in his rushing career.

But what about the left side of the clause:

  [Rushing Yards](Values(Players[FullName]))

That uses the context from the *pivot*, not the source table.  the VALUES function returns the list of valid values of the specified column, according to current pivot context.  Since the specified column is on the rows axis of our pivot, it only returns one value in this case, which is Edgerrin James.

Edge has 9,842 rushing yards.  So the clause becomes:

   9842 < 172

Which is of course FALSE.  Rabih’s row gets excluded by the FILTER function.

Wash rinse, repeat.  And then repeat.

FILTER then moves on to the next source row:  Khalid Abdullah.  The pivot context remains the same (Edge James), so the left side of the comparison is still 9842.  Khalid has less than that, so again, excluded.

Well, in my data set, no one has more yards than Edge, he’s the top dog in my data set.  So when FILTER finishes its pass, there are no rows left included.  COUNTROWS then returns 0, the + 1 from the end of the formula gets added, and Edge gets his proper rank of 1.

Second Row of PivotThe measure then moves on to the next pivot context, the second row, which is Curtis Martin:

And then FILTER starts over again, running through the entire Players table again, row by row, starting with our friend Rabih.

Two loops tucked inside each other.  Wow.

The other half of the filter clause

Remember this?

    && Players[FullName]<>Values(Players[FullName])))

That just makes sure we don’t count the player against himself for purposes of rank.  Off the top of my head, I’m not sure that’s 100% necessary.  I’ll have to noodle on that a bit.

OK, let’s see it in action

I add my new measure to the pivot, and I get:

Number Two Tops the List

What the heck?  Is my measure broken?

No, it’s not.  I’ll explain what’s going on in my next post, since this one has run so long already.


ALL() Revisited

February 23, 2010

 
No Filter

 

“He has no filter. He says whatever comes through his mind. He never edits himself.”

 

 


Revisiting ALL()

Yes, I’ve used ALL() a number of times already.  But I’ve never covered it directly – it’s always been a side topic to something else.  So I thought it was time I went back and made sure I had a specific post on it.

The short version is: ALL() removes filters from specific fields during a measure calculation.  It “De Niros” your measure… but in a controlled manner.

Syntax

    ALL( table_or_column, [column1], [column2], etc.)

So the first parameter is required, and can either be a single column or an entire table.

After that, you can specify as many other columns as you’d like…  as long as the first parameter didn’t specify a table. If your first  param is a table, you can’t specify additional columns.

Also note that even when specifying columns, the ALL() function requires you to always include the table name.

Lastly, all columns listed must be from the same table.

Legal Examples:

  1. ALL(Table1)
  2. ALL(Table1[Column1], Table1[Column2])

Illegal Examples:

  1. ALL([Column1])
  2. ALL(Table1, Table1[Column3])
  3. ALL(Sales[Column1], Customer[Column2])

When to use ALL() – Example #1

You never use ALL() by itself.  You always use it in the context of another function.  It’s all about setting context for a calculation.  It is not a calculation unto itself.

The simplest place to use ALL is in a measure that always shows you a grand total of something, regardless of what filters are in play.  Revisiting the Temperature Mashup demo for a moment, here’s a pivot sliced by Temperature, showing a normal Sum of Order Quantity:

Sum of Qty Only 
Now I add a measure that sets the Temperature table to ALL:

    [AllTempsQty] = CALCULATE(
                      SUM(Sales[OrderQuantity]), 
   
                    ALL(Temperature)
                    )
 

That ends up looking like:

PowerPivot Measure using ALL

(If you need a refresher on CALCULATE, which is probably the most useful function in all of DAX, check out this post.)

So, on Hot days, we sold a total of 2,245 Accessories.  And overall, across all temperatures, we sold 36,092 Accessories.

But note that the numbers are different across Category for both measures.  That’s because [AllTempsQty] did not use ALL against the Category columns (or table).

Cool.

Example #2 – % of Total

OK, but in many cases, you don’t actually want to see [AllTempsQty] in the pivot.  You just want to compare the filtered version of a measure to the unfiltered version.

So we can define something like:

  [PctOfAllTempsQty] = SUM(Sales[OrderQuantity])  / 
                       CALCULATE(
                          SUM(Sales[OrderQuantity]),
                          ALL(Temperature)
                       )

Which gives us:

Percentage of Total PowerPivot Measure using ALL function

Note that I switched Excel’s number formatting for this measure to be Percentage, keeping me from having to include * 100 in my measure.

Of course, if I wanted to keep my [AllTempsQty] measure, then I could have defined my new percent total measure as the following instead:

  [PctOfAllTempsQty] = SUM(Sales[OrderQuantity]) / 
                       [AllTempsQty]

                    

And that would have produced the same results.

Example #3 – Using ALL() in places other than CALCULATE()

If we revisit the post on derived measures, ALL() is an excellent modifier to use against an existing measure.

Let’s take a measure I defined in the Temperature Mashup, the [Qty per Day] measure:

Quantity per Day

And then we define a new measure:

  [QtyPerDayAllTemps] = [QtyPerDay](ALL(Temperature))
                 

That yields:

Using ALL function to derive another measure

You like?  I like :)


The Superheroes of DAX, and derived measures

February 19, 2010

DAX Heroes

 
 

“Did someone say DAX?”

 

 

 

I wanted to start today’s post with a quick thank-you to the guys who tirelessly answer my DAX questions.

Marius Dumitru, Howie Dickerman, and Karen Aleksanyan.  You guys rock.  Thanks again for your patience educating me.  I’ll do my best to pay it back by sharing knowledge here.

(There’s no resemblance in real life to these three pictured above.  Well, except perhaps for Mr. Furious.  But it’s just an awesome movie and had three superheroes, which is rare – superheroes tend to be packaged in even-numbered groups.)

Today’s Tip:  Deriving measures from others

Today’s tip is a short lesson but boy is it helpful. 

Let’s say you have a fairly sophisticated measure like the “Number of unique products sold at number of stores” measure that I demonstrated the other day using SUMX().  And now you want a new version of that measure that only counts a specific type of product, as opposed to all products.

Copy/Paste the original measure formula?

Do you copy/paste the formula for the original measure into a new measure to make your edits?  I recommend against it, for two reasons:

  1. You are now nesting an additional level of logic inside an already sophisticated formula – this makes things less readable and harder to debug (especially since v1 PowerPivot does not let you “trace into” a formula step by step.
  2. Maintenance-wise this is a hassle going forward.  If you ever change the base measure, you need to change all copies of it as well.  Tedious.  And it’s even worse if you forget to change one of the copies.

Better answer:  use FILTER() against the original measure

You can avoid both of those problems with the following syntax:

    [New Measure] = [Original Measure] (<Table Expression>)

The simplest example of a table expression is to use the FILTER() function.  Let’s say we want our original measure, but only to count transaction of product type = fruit:

    [New Measure] = [Original Measure] (
                      FILTER(DataTable[ProductType]=”Fruit”)
                    )

Try it out!

Trust me, it works :)  And you will love it :)

Other flavors

I’m pretty sure that any expression which returns a table is valid as the <Table Expression>. 

EX:  I used CALCULATETABLE and it also works great.  CALCULATETABLE is like FILTER but allows for any number of filter params, whereas FILTER only gives you one.

    [New Measure] = [Original Measure] (
                     CALCULATETABLE(
                      DataTable[ProductType]=”Fruit”,
                      DataTable[Weekend]=”True”
                      )
                     )

Have fun everybody.  Catch you next week.


SUMX() – The 5-point palm, exploding fxn technique

February 16, 2010

 
Pai_Mei

 

 

“AGAIN!”

-Pai Mei

 

 

 

SUMX() – the great iterator

Have you ever written an array formula in Excel?  (Don’t worry, most people haven’t).  Have you ever written a FOR loop in a programming language?  (Again, don’t worry, there’s another question coming).  Have you every repeated something over and over again, slowly building up to a final result?

That’s what SUMX() does.  It loops through a list, performs a calc at each step, and then adds up the results of each step.  That’s a pretty simple explanation, but it produces some results in pivots that are nothing short of spectacular.

Anatomy of the function

     SUMX(<Table>, <Expression>)

THE BRIDE: “What praytell, is a five-point palm, exploding function technique?”
BILL: “Quite simply, the deadliest blow in all of the analytical martial arts.”
THE BRIDE: “Did he teach you that?”
BILL: “No. He teaches no one the five-point palm, exploding function technique.”

That’s kinda how I feel about the description of SUMX in the Beta release:  “Returns the sum of an expression evaluated for each row in a table.”  It merely hints at the power within.

Oddly, the best way to show you what I mean is to start with some useless examples and then build up to useful ones.  For all examples, I will use the following simple table, Table1:

Sample Table for SUMX Table1 

Useless Example #1:  By the whole table

     SUMX(Table1, Table1[Qty])

Returns:  35, which is the total of the Qty column.  Might as well just use SUM([Qty]).

Why:  Well, it iterates over every row in Table1, and adds up [Qty] at each step, just like the description says it would.

Useless Example #2:  By a single column

     SUMX(Table1[Product], Table1[Qty])

Returns:  An Error

Why:  Table1[Product] is not a Table, it’s a Column.  And SUMX demands a Table as the first param.

Useless Example #3:  By distinct values of a column, sum another

OK, I’ll wrap the [Product] column in DISCTINT(), since that returns a single-column table:

     SUMX(DISTINCT(Table1[Product]), Table1[Qty])

Returns:  An Error

Why:  [Qty] is not a column in the single-column table DISTINCT([Product]).  Only [Product] is.  Why did I even try this?

That’s where I gave up awhile back.  Until I learned…

Almost-Useful Example:  The Second Param Can Be a Measure!

And even better, that measure CAN access other columns even if you use DISTINCT.  First let’s define a [Sum of Qty] measure:

     [Sum of Qty] = SUM(Table1[Qty])

And then re-try the previous example with the measure, not the column:

     SUMX(DISTINCT(Table1[Product]), [Sum of Qty])

Returns:  35 Yes, the total, again.   But this time, the “Why” is worth paying attention to.

Why:  Let’s step through it.  Remember, for each value of the first param, SUMX evaluates the expression in the second param, and then adds that to its running total.
Distinct Products

Step One:  SUMX evaluates DISTINCT([Table1[Product]) which yields a single-column table of the unique values in [Product]:

 

Step Two:  SUMX then filters the Table1 (not just the [Product] column!) to the first value in its single-column list, [Product] = Apples.

Table1 Filtered to Apples by SUMX
Then it evaluates the  [Sum of Qty] measure against that table, which returns 17.

Steps Three and Four:  The process repeats for Oranges and Pears, which return 13 and 5:

Table1 Filtered to Oranges by SUMX

Table1 Filtered to Pears by SUMX
Last Step:  SUMX then adds the three results it obtained:  17, 13, and 5, which yields 35.

A lot of work to get the same result that the [Sum of Qty] measure can get on its own, but now that you know how it operates, let’s do something else.

And now, the Useful Example!

Let’s define another measure, which is the count of unique stores:

     [Count of Stores] = COUNTROWS(DISTINCT(Table1[Store]))

For the overall Table1, that returns 2, because there are only 2 unique stores.

Let’s then use that measure as the second param:

     SUMX(DISTINCT(Table1[Product]), [Count of Stores])

Distinct Products

Step One:  same as previous example, get the one-column result from DISTINCT:

 

Step Two:  filter to Apples, as above:

Table1 Filtered to Apples by SUMX

…and the [Count of Stores] measure evaluates to 2 – 2 unique stores have sold Apples.

Step Three:  Oranges

Table1 Filtered to Oranges by SUMX

…again, the measure evaluates to 2.  2 unique stores sold Oranges.

Step Four:  Pears

Table1 Filtered to Pears by SUMX

…hey look, only one unique store sold Pears.  So the measure evaluates to 1 here.

Last Step:  Add them all up.  2 + 2 + 1 = 5.  SUMX returns 5.  This basically means that there are 5 unique combinations of stores and products that they sell.

Why is that useful?

Well, I can’t share the precise case I was working on, because it belongs to a reader’s business.  But trust me, you are going to find yourself wanting this sooner or later.

Things to keep in mind

  1. SUMX responds to pivot context just like anything else.  So if you slice down to just a particular year, your results will reflect only what Stores sold in that year.
  2. AVERAGEX, MINX, MAXX, and COUNTAX all work the same way.  So if you want to iterate through just like SUMX but apply a different aggregation across all of the steps, you can.  Those would return (5/3), 1, 2, and 3, respectively in our example.
  3. The fields referenced in SUMX do NOT have to be present in your pivot view.  In my case, SUMX was working against [Store] and [Product].  But my pivot could just be broken out by [Region] on rows and sliced by [Year], and the measure still works.  (I like to think of it as a stack of invisible cells underneath each pivot cell that you can see, and SUMX is rolling up a lot of logic across those invisible cells to return a simple number to the top cell you can see.)

More to come!

Yeah, I am not even done with SUMX.  Like Jules told you, it’s some serious gourmet DAX :)


DAX: I’ve been shortchanging you

February 15, 2010

You Had Me at Calculate

Mmmm! Daaaamn, Jimmie! This is some serious gourmet DAX! Usually, me and Vince would be happy with some freeze-dried Taster’s Choice right, but he springs this serious GOURMET DAX on us!

-Jules Winfield

 
 
 
What I’ve seen so far is great, and yet…

The things I have done with DAX so far are a HUGE boost to what I can do in Excel formulas alone.  Here are some of the functions I have explored to date:

  1. RELATED – a faster, simpler VLOOKUP.
  2. CALCULATE – the big Kahuna.  A version of SUMIFS that I can use in a pivot measure to remove, alter, or override pivot filters and just aggregate the source rows that I want.
  3. ALL – the ability to strip pivot filters from certain fields or tables during a measure calc, enabling very flexible custom ratios when combined with CALCULATE.
  4. DISTINCT – returns the set of unique values of a particular column within the current pivot filtering context, very useful (so far) when combined with COUNTROWS.
  5. COUNTROWS – no more need for finding the unique-valued column in a current context, you can just count the source rows that meet current filter criteria.

But I’ve seen some things over the past week that made me realize:  I’ve been lax…  about DAX.

I just found a cupboard full of “gourmet DAX”

Funny thing folks:  in many ways, I am learning right alongside of you when it comes to DAX.  When I left Redmond in August, DAX was just starting to appear in the internal builds.  So while I was getting a little bit of an advanced look, in other ways I’ve been watching the DAX movie in progress just like everyone else.

Last week, a reader asked me a series of questions basically saying “can you do this in DAX?”  These were crazy things they wanted to do – VERY useful stuff for them, and totally understandable why they wanted it.  And my initial answer was “um, no, DAX can’t do that.”

I was wrong.  DAX very much CAN do those things.  And many, many other things I had not considered.  I’m a little embarrassed by holding out on you guys, but I’ve been holding out on myself, too.

I mean, I knew I had not yet dived into the time intelligence stuff.  Other people are doing a decent job of that, and my intent was to build up to it, slow and steady.  But I was missing other things, too.

New resolution:  at least two DAX posts a week

This is a long time coming really.  It’s time to devote a much bigger chunk of my efforts toward educating everyone – myself included – on the capabilities and finer points of DAX. 

Here’s a quick preview of what’s in store for us:

  1. SUMX() – the FOR loop of DAX, kinda like Array Formulas in Excel
  2. CALCULATETABLE() – generate a custom filtered table, on the fly, and then use that as the table argument to another function
  3. Using one measure as an input to defining another – self explanatory in some ways, but you won’t believe how often this gets you out of a jam.
  4. Using a different formula at one level of a pivot vs. another – just one example of something you can do by combining some of these techniques.

Seriously, I am excited.  I’m like a kid who’s discovered a secret room in the castle.


Jewelry Mashup for Alpha Geek Challenge

February 11, 2010

PowerPivotBookHello everyone. Rob invited me to do a guest post to talk about my Wives vs. Girlfriends entry in the Alpha Geek Challenge. That is a politically incorrect and provocative title for a real data analysis mash-up that I did with PowerPivot.

I spent the last 33 days locked in a room finishing five manuscripts for books about Excel 2010. One of those books will be PowerPivot For the Data Analyst from QUE. In the course of writing that book, I had to generate a lot of sample data sets to use as examples.

Some interesting things along the way:
1) I had some daily sales data for 2 ladies boutique stores. One is in a mall. The other is in an airport located 10 miles away. 3 years of mall history, only 2+ years of airport history because that store opened after the mall location.

2) Beginning theory: when there are rain delays, people have more time in the airport, so they buy more.

3) Beginning theory: when it gets hot, people go to the mall.

WeatherUnderground4) I knew that Rob had found weather data from somewhere, but as I started searching around, I could not find any good free weather databases. I did find a page a WeatherUnderground.com where you could retrieve weather statistics for one day for one city. Luckily, both of these stores were in the same city, so I only had to gather data for one city.

Unfortunately, getting three years of data meant visiting this web site 1095 times! Excel VBA lets you use the brute force approach, so while I was typing a chapter on one laptop, I set up a VBA macro on the other laptop to run an Excel Web Query over and over and over, changing the date in the URL each time. As each web page would come back to Excel, I would use VLOOKUP to pluck out the rainfall and the high temperature for the day.

5) Getting the data into PowerPivot was easy. Two linked tables. One relationship by date.

6) In the PowerPivot grid, I added some calculated columns with DAX:

  • To group the temperature data into buckets, I used a column called “Highs in the” and a DAX calculation of =INT(Weather[High Temp]/10)*10.
  • To group the daily dates into weekdays, I used =FORMAT(Sales[Date],”ddd”). This returns Mon, Tue, Wed, etc. You have to use Colin’s trick from the FAQ to actually sort those into custom date sequence. (Thanks Colin!)
  • Back in Excel, I had used the range version of VLOOKUP to assign classifications based on rain. The rainfall in inches was classified into “0-None”, “1-Trace”, on up to “4-Hurricane”.
  • I did not realize it when I started, but as I was doing DAX Measures, I found the need to have some concatenated fields so that I could do a count distinct. =CONCATENATE(Sales[Location],Sales[Date]) and =concatenate(Sales[Location],Sales[WeekdayName]).

7) I used the “four chart” layout in PowerPivot. I ended up using only three of those charts in the contest entry.

8) The important Measure that was enabled by DAX was “Sales by store by day that falls into the current filter context”.  There were some days in 2007 where only one store was open, and many days where 2 stores were
open. DAX Measures let me calculate total sales / (# of stores open on this day). That DAX Measure is: =sum(Sales[Net Sales])/Countrows(DISTINCT(Sales[LocationDays]))

9) During discovery, I used slicers, but for the contest entry, I needed different charts to have different filters, so I changed the slicers back to report filters so that one chart could be for the airport and another chart could be for the mall.

Surprising results:
As rainfall increases, sales plummet the airport. I guess that when people are freaking about because of flight delays, they are not busy shopping.

Chart1

Higher temperatures mean lower sales at the mall. When I was a kid, we would flock to the mall because they had air conditioning. Now, the people buying these high-end handbags probably already have air conditioning and no longer need to head to the mall.

Chart2

As I was looking for other trends, I ran one by weekday at the airport. Friday was the peak sales day during the week, probably indicating that business travelers are picking up a piece of jewelry on their way back home to their wife. But; I was surprised to see that the one day which beat Friday was Sunday. Sunday? If the business travelers are buying gifts for their wives on Friday, WHO the heck are they buying gifts for on Sunday? One theory is noted in the chart. In reality, that Sunday spike is probably from business women who are accessorizing up before heading out for their business trip. It would probably be interesting to break that Sunday data by hour to see if the sales are early in the day when people are heading home from weekend trips to the beach or on Sunday afternoon/evening when the business commuters are heading back out.

Once I had reached some interesting conclusions about the data, I spent some time formatting the workbook to not look like Excel. I got rid of a lot of the extra stuff that is standard on Excel charts, but then added some graphics (the rain is a pattern fill on the chart area in the left graph and the Delay Delay Delay graphic is a picture fill applied to the plot area. To use pictures in individual columns of the other charts, you have to click on a column twice. Once to select the series, and then a second time to select that one point. Then, use Format, Fill, from Picture.

I suppose the surprising PowerPivot lesson here was that the slicers were really helpful to me when analyzing the data, looking for interesting trends. Once I had decided on the message that I wanted to convey, though, the slicers had to go.

Thanks to Rob for allowing this guest post!


Touchdown Problem Solved, and COUNTROWS

January 8, 2010

In the last post I was struggling with horribly-designed source tables.  To be honest, I never found a way to relate those two tables to each other, even through other intermediate tables that initially seemed promising.

But I DID find two columns in my Plays table itself that saved my bacon – [OffensiveTeamScoreAtPlayStart] and [OffensiveTeamScoreAtPlayEnd].

PowerPivot PtsScoredOnPlay Calc ColumnA simple calc column subtracting the two and now I have [PtsScoredOnPlay]:

  =[OffensiveTeamScoreAtPlayEnd]-[OffensiveTeamScoreAtPlayStart]

And that appears to give me what I want – see the picture at right.

(I was a little suspicious at first that the Offensive team could score 2 pts, since only defenses can score safeties, but then I realized these were 2-pt conversions by the offense – see how subtle “business rules” can be?)

OK, so with that calc column in hand, and my knowledge that all Touchdowns score 6 pts, I can go back to my PivotTable and add a measure:

 

   =CALCULATE(
         COUNTROWS(Plays),
        
Plays[PlayTypeName]="RUN-Run",
         Plays[PrimaryRoleName]="Runner",
         Plays[PtsScoredOnPlay]=6
    )

Dissecting the Formula

I’ve broken the formula into separate lines above.  Let’s go line by line.

  1. The first line is just the CALCULATE function.  I have previously described CALCULATE as a supercharged version of SUMIF.  That bears repeating.  Think “improved SUMIF.”
  2. The second line is the expression to be aggregated.  You are not limited to SUM.  In fact you can place many custom expressions here.  In this case I am using the COUNTROWS function.  I will explain that below.
  3. The remaining 3 lines are just conditional filters, like the “if” part of SUMIF, or more accurately, SUMIFS, since there are more than one.  Notice that the last one filters down to just plays that scored 6 pts, which are touchdowns.  The other two filters insure that this was a Rushing play, and the current player was indeed the one carrying the ball.

COUNTROWS is another good (new) friend

Longtime PivotTable users are familiar with a common trick.  When you want to count the number of rows from your source data that meet various criteria (in a PivotTable), you find a column in your data that is guaranteed to be unique, and then you add that field to your Values area, and change the “Summarize By” to count.

And if you didn’t have a column that contained unique values, well, you created one.

With COUNTROWS, you no longer have to do that.  COUNTROWS(<TableName>) works every time.

Tying it all together

OK, then, the formula above says:  “In my Plays table, count the number of rows that are rushing plays, where the current player is the one carrying the ball, that resulted in 6 pts.”

Does it work?  You bet.  PivotTable on left, ESPN on right.  (Priest Holmes, 2004).

PowerPivot RushTD Measure Priest Holmes PowerPivot RushTD ESPN Priest Holmes

OK, back to Sexy!

In the next post, I can now return to what I was originally trying to do, which was something semi-mind blowing…  if you’re an Excel geek anyway :)


DAX CALCULATE, and Rapid Iteration

December 16, 2009

Ok, in the last football post, I had written the following measure:

          PowerPivot DAX Calculate Zoomed

Which basically means, “sum up the [Yards] column but only those rows with [PlayTypeName] = “RUN-run” “

Now we get to find out if it works :)

I need some real-world data that I can validate against.  Out on NFL.com I can find historical player stats.  Let’s look at one of my favorite players from recent years, Priest Holmes.  He amassed a lot of Rushing Yards and should be a good sample.  Here are his statistics by Game for 2004:

          Priest Holmes 2004 Rush Yards

OK, so let’s make a PivotTable that just shows Priest’s 2004 Rush Yards by Game, and compare that to the Rush Yards column from above:

         PowerPivot Says Same Numbers as NFL          Priest Holmes 2004 Rush Yards Zoomed

                           My Pivot Table                                      From NFL.com

Bingo!

“You know sometimes I even impress myself?”

-Han Solo

An exact match!  That might seem a bit bland to you folks out there, but to me, it is VERY exciting.  I’ve got 40+ tables, a Plays table that is loaded with crazy complexity, no database training whatsoever, and yet…  I now have a measure that agrees, DOWN TO THE YARD, with NFL.com!

Wahoo!  In all honesty, when I started this project, I had no idea how far I could go.  I intentionally chose something that *might* defeat me.

I’m feeling pretty darn optimistic now, though :)

I’m suspicious, however…

I recall that my Plays table contains multiple rows per play, so that, for instance, it can capture data about the *defensive* players involved in a play:

Single Rushing Play

Right there, a single play, three rows.  One for the runner himself, one for the player who tackled him, and another for the player who assisted the tackle.  That will come in handy when I get started looking at defensive stats.

But for now, I suspect that means defensive players are getting credited with rushing yards, too.  When I filter to Antoine Winfield, the Assister from above, my PivotTable confirms my suspicions:

         Antoine Winfield Rushing Yards

What we are seeing there is the total net yards of running plays in which Antoine Walker was involved in tackling the runner.  I don’t want that.

But all I have to do, then, is add another clause to my CALCULATE function.

Calculate Function Fixed

And now the pivot table shows:

Antoine Winfield After Fix    Priest Holmes After Fix

Priest Holmes is unchanged.  Antoine Winfield now has no rushing yards.  Perfect!

CALCULATE is good.  CALCULATE is your friend.

“OK Rob, a whole post just so you can make ONE change??”

Yes.  I spent more time than usual on this.  Here’s why:  Everything above took me less than 10 minutes in real time.  It took far longer to capture the screen shots than simply to blaze through it.

But when I worked with a BI consultant, a few years back, the same exact iteration took about a week.

PowerPivot Compared to Traditional BI Development

How do we account for the difference?  Is it because the consultant produced better results?  Was the Rushing Yards measure more accurate or robust than mine?  Was it somehow more formalized, more robust?  Or was the consultant not very good?

The answers are no, no, no, and an emphatic “no.”  (The consultant was fabulous, akin to godlike.  There’s that word again – “akin.”  Why do I keep saying that?)

The real difference, as I’ve said before, is that with PowerPivot, the “modeler” and the “business user” are the same person.  I’m the one writing the expressions, and the one who knows the most about what I want, because I know the “business” (football, in this case) inside and out.  Iteration in one person’s head is blazingly fast.

  1. Finding the comparison stats on NFL.com – Only the business user knows where the best validation data sets are.
  2. Creating the pivot for comparison purposes – even just choosing my player for comparison reflected business knowledge. 
  3. Realizing that I likely was incorrectly counting defensive players in my measure – because I had access to the source tables and the “business rules” in my head, I spotted this problem before it ever made it into a report.

Seriously, this was like a trip down memory lane, on hyper fast forward.  I’d get a cube from the consultant, build some pivots, see that things were not accurately reflecting football rules, point out the problem, wait for the next version, repeat.

Not anymore :)

“Are you saying we don’t need BI Pros anymore?”

No, I am NOT saying that.  Even in this football example, I am cheating.  The original data from STATS arrived as a jumble of text files.  The schema was terrible.  Many of the required attributes (like the score of the game on a particular play) were completely missing.

There was a TON of work, done by the BI professional, to get from that horrible mess to the 40+ tables that I am working with now in PowerPivot.  And there is no way, no way at all, that I could do that myself – not back then and not today.

In terms that an MS BI Pro understands, the Integration Services work remains.  In fact, it becomes even more important, since you need to make the resulting schema not just work for people like you, but also for people like me :)

But the Analysis Services work – you can start sharing that with the Excel business users.