Post by Rob Collie
From Last Week’s Client Work
Last week a client asked us to solve a somewhat unusual problem: given any two lists of Twitter followers, tell us how many followers “overlap” between the two lists.
How Many of Han Solo’s Followers Also Follow Leia Organa, and Vice Versa?
(Randomly-generated Twitter handles are funny. I particularly like “@Gommo” and “@Xxfok”)
Loading the Data: Using Power Query
Let’s use Power Query to perform the import this time, both because we’re using PQ a lot more around here now that we have Power Update, and because we’re gonna need PQ for the more complex steps later.
Note that all of the steps below are performed using Excel 2013. (I find Power Query to be a bit too clumsy in Excel 2010.)
Importing from a Table Using Power Query: Step 1
(Unchecked “has headers” because of the “Han Solo’s Followers” Row)
Now we trim that top row (“Han Solo’s Followers”) off:
Trimming the Top Row Off of our Data in Power Query
And then, we use the “new” top row as the header row:
Which gives us this in the Power Query editor:
Data is Looking Good. Note that We’ve Renamed the Query from “Table3” to “Han”
(Also note the list of steps that Power Query has saved up for us, to be replayed automatically on refresh!)
Set Your Data Types in Power Query!
Heed the following advice now, or suffer the wrath of a painful bug later. The bug is the result of, ahem, a Power Struggle between the Power Tools (Power Query and Power Pivot), where each wants to be “in charge” of things like table name, column names, and data types.
In short, if you are EVER going to change ANY of these, perform the change in Power Query, NOT Power Pivot:
- Table Name – change this in PQ’s Query Settings, as pictured above. (Don’t right click and rename the resulting sheet tab/table in Power Pivot).
- Column names – right click and rename them in Power Query (don’t right click and rename in Power Pivot).
- Deleting Columns – again, delete them from the Power Query side, using the Delete key or the Remove Columns button on the ribbon. Don’t delete on the PP side.
- Data types – use the Data Type dropdown on PQ’s Query Editor ribbon (pictured below). (Don’t use the similar dropdown’s on Power Pivot’s ribbon).
If you use Power Pivot to change any of those, even once, chances are good that PP will now think it’s “in charge” of ALL changes to that table from that point forward, and then if you EVER change ANYTHING in Power Query (even things OTHER than table/column names or data types, like filtering out bad rows), you are suddenly confronted with this problem, for which in Excel 2013 the only simple “fix” is to delete your entire table from Power Pivot and start over again. Blech.
So let’s set those data types now, just to be safe in case we need them properly set later, OK?
Set the Date Data Type for “Follow Date”, But the Other Column is Already OK
Now, load the query directly into the Data Model (aka Power Pivot):
Duplicate and Modify the Query to Import Leia’s Followers
We need to run all of the same steps to import Leia’s followers, so rather than starting from scratch, let’s Duplicate the Han query:
And then we edit the duplicate query:
Again, we load the Leia table straight into the Data Model (Power Pivot), yielding:
Simple Solution Using Lookupvalue
Let’s do this the easy way first. In the Han table, add this calculated column:
That results in mostly empty values, since most of Han’s followers are NOT also in Leia’s follower list, but there definitely ARE some matches:
Even better, we can modify the calc column formula so that it returns 1 for matches and 0 for non-matches:
So now we can use this column as the basis for a measure that counts overlap:
[Leia Followers Who Also Follow Han] :=
Then we can, say, add a Month column to the Han table:
And now we can pivot/summarize:
Of course, we could obviously now do the reverse steps for Leia – add the LOOKUPVALUE column, the measure, etc., but we won’t show that here.
(Note that once we DO duplicate the formulas, we still end up with 211 shared followers, so the extra work wouldn’t yield much of an interesting result.)
Interested in Learning How to Do this Kind of Thing?
Our New Online Power Query Course just launched this year. You’ll learn Power Query from the best in the business, two Excel MVPs Ken Puls and Miguel Escobar. We’ve included 7 Hours of Self-Paced Video over 31 Courses, with lots of example files, and a Digital Badge powered by Acclaim.
CLICK HERE to claim your spot.
The Hard (aka Dynamic/Flexible) Way
Let’s say we don’t want to take the easy way out. We want to be hard on ourselves, for reasons that may become clear later.
We can alternatively approach this as if it were a classic “many to many” problem:
Han Followers, Leia Followers, and a “Bridge” Table, MergedFollowers
(Which is the Union of All Han Followers Plus Leia Followers)
So, How Do We Create the MergedFollowers table? Power Query of course!
Using Append from the Power Query Ribbon
(Creates One Long Table – Han and Leia tables “stacked” vertically)
Then we Remove Duplicates from the Follower Column
(Removes 211 duplicate rows in our case)
Lastly We Rename the Query
Then we load into the Data Model, create the relationships, and we get:
It is VERY important that the relationships point in the right direction. Han and Leia should be your Lookup tables. Don’t get that backwards or nothing subsequent will work.
Note that we ALSO should have removed the [Follow Date] column from the MergedFollowers table – we don’t need it, and it’s misleading: for the duplicates, we’ve arbitrarily been “left” with the Han or the Leia follow date, not some sort of merge or average or whatever. (Yes, I noticed this right before posting, that I’d left the column in by accident).
No Calculated Columns in the Hard Way!
Rather than adding calc columns and then summing them up, in this approach our measures can follow the Many to Many pattern, or at least a close cousin thereof.
First, let’s add some basic measures (calculated fields) for Han Followers and Leia Followers, as well as the overall merged population:
Han Followers :=
Leia Followers :=
Total Follower Population :=
Yielding this simple pivot:
Now we add the “magic” measure…
Han and Leia Shared Followers :=
CALCULATE([Total Follower Population], Han, Leia)
Which, loosely translated to the extent I loosely understand it, is “give me the [Total Follower Population] but enforce matches from both the Han and Leia tables.”
And that gives us…
WooHoo! The Same Answer We Had From the Easy Way!
(Aren’t you glad we burned all that extra effort?)
Advantages of the Hard Way?
Simply put, the hard way is more Dynamic.
First off, as we add more tables, like the Luke table, we don’t want to have to add pair-wise calc columns for Luke-Han overlap and Luke-Leia overlap, and then later Lando-Han, Lando-Leia, and Lando-Luke (when we add the charismatic Lando and his millions of followers to the mix).
Of course, even in the Hard way, we DO still have to add pair-wise measures, so, um… yeah. That’s not a tremendous savings.
Here’s where it starts to get interesting:
If we have properties of followers, and those properties can be DIFFERENT, for a single follower, depending on their relationship with Han or Leia, the Hard way takes care of us.
For this example I just invented a column in each of the Han and Leia tables:
Han and Leia Tables Now Each Have a Column for ActivityLevel
(Active = Engages Frequently, Dormant = Used to Engage, Went Quiet and Lurker = Never Has Engaged)
The formulas for those columns are not important – they are just randomly-generated.
The important thing is that a SINGLE USER can be a Lurker for Han but be Active for Leia, or vice versa, etc.
And our Hard Way measure lets us build a pivot like the one above, where a simple LOOKUPVALUE, being statically determined by “is this user in both lists”, will not help us.
There are a MILLION things we could continue on and explore at this point, like percentage overlap (which is just a simple DIVIDE measure), displaying the lists of users (also pretty simple), but wow, this is one LONG post already. So let’s call it quits for now.