The Formula for the Slope of a Linear Regression Line. It’s Greek to Me.
(Get it? Greek? Sigh. Anyway, click the image to view the article on StatisticsHowTo.com)
In Case of Emergency, Call JT Statmaster!
I struggle mightily to understand formulas expressed as Greek symbols. I don’t know why really. Probably because it seems so abstract – that notation sacrifices “humanity” in order to achieve precision and uniformity. I get that, but it doesn’t make it easy for me.
Fortunately, we have people like JT Joyner. JT was a student in one of my classes late last year. And I dare say he was one of those “star pupils.” He took to Power Pivot like a natural. So natural, in fact, that a couple days after the class, he emailed me a Linear Regression workbook example, implemented in Power Pivot measures. He translated Greek into, you know, formulas.
Six months later, yeah, I am just getting around to doing something with it. But this is exciting stuff for sure. I’m pumped. Let’s dig in.
Diagramming and Explaining the Greek
First, let’s explain what N, X, and Y mean:
N = Number of Doctors. X = Number of Visits. Y = Prescriptions Written.
And then let’s clarify what those Greek Sigmas mean:
What Does the Sigma Say?
(Kow kow kow kow kow k-kow!)
Reminder of the Problem We are Solving
Simple question: If pharma sales rep Billy visits a doctor more frequently, will that translate into that doctor prescribing more of Billy’s drugs, I mean, medications?
And if so, what is the “value” of a single visit? Will that result, in average, on 10 more prescriptions per year or 500?
And Now, a Word on the “Power-Pivot-ness” of This Example
I deliberately chose/constructed this problem in a way that would “shine” in Power Pivot and, um, not shine in regular Excel.
Here is what a “regular” example would look like:
Rob’s Weight Over Time – Height Held Constant at 6-foot 1 inch.
(This is a “regular” example – one row “captures” one sample/trial)
If we treat Age as X (the input) and Weight as Y (output), we can perform a linear regression and determine that I will weigh more than 450 pounds before it’s all over.
But that’s NOT how our example is constructed here. No way. Ours is more reflective of a noisy, dynamic, and changing world.
But Our Samples/Trials are NOT Captured by Single Rows!
As illustrated by the two tables above, our “trials” are spread across MANY rows each! A given visit, or a given month of prescriptions, is not terribly significant. It’s the aggregate impact of aggregate behavior that is the focus of this model, not whether a single visit impacted a single month of prescriptions. (Although a future modification to this model COULD take that approach.)
This distinction is important – single rows defining trials versus trials defined by aggregates. So keep that in mind as we proceed.
All Right, Let’s Get to the Measure Formulas!
First, let’s bring back that “diagram” of the Greek formula:
Now let’s translate those into formulas:
[Number of Doctors] //which is N in the formula above =
[Sigma X Visits CHG] //Term #2 in the diagram above =
CALCULATE([CHG Visits vs PY], ALLSELECTED(Doctors))
[Sigma Y Prescrip CHG] //Term #3 Above =
CALCULATE([CHG Prescriptions vs PY],ALLSELECTED(Doctors))
[SUMX of X Squared Visits CHG] //Term #5 Above =
[Sigma X Visits CHG]*[Sigma X Visits CHG] )
[SUMX of X Times Y] //Term #1 Above =
[Sigma X Visits CHG]*[Sigma Y Prescrip CHG] )
Phew, that looks hard!
Actually, by far the hardest thing is keeping all the names straight. X, Y, and SUMX – my eyes start to blur after a bit.
The math itself isn’t so bad, the only wrinkles are the SUMX and the ALLSELECTED. So let’s explain those, but later. For now let’s just cut to the chase.
[Slope] = # of Additional Prescriptions “Yielded” by One Extra Visit!
Apparently, One Extra Visit to an Independent Doctor
Will Yield 1,350 Additional Prescriptions
In part 1 I had made the dubious decision to use % Change as my X and Y values. That led to some SUPER confusing stuff, in that I had no idea what my output Slope value meant.
After consulting with JT, I switched to Absolute Change instead of % Change. So that way, my output slope is easy to interpret – a single incremental visit is expected to yield the Slope amount of additional prescriptions written.
So, all of the formulas in this post, including the ones above, use totals, and changes in totals, rather than % change.
So, here’s the formula for Slope:
([Number of Doctors] * [SUMX of X Times Y] ) – ([Sigma X Visits CHG] * [Sigma Y Prescrip CHG])
([Number of Doctors] * [SUMX of X Squared Visits CHG]) –
([Sigma X Visits CHG] * [Sigma X Visits CHG])
And remember, that’s just all of the measures finally coming together in the shape of that Greek formula diagram. Nothing more.
Wait a Second…
In part 1, my analysis of the fake data told us that Independent doctors hated being visited (negative slope), whereas HMO doctors liked it (positive slope).
But today, when I switch from % change to absolute change, I get a very different result:
OK… Today, BOTH Org Types Respond Positively, and Independents respond 10x Better!
Why is that happening? I… have… no… idea. Someone please explain how switching from % change to absolute change can turn a negative slope into a largely positive one.
This is driving me nuts, I am starting to regret choosing it as a blog topic
Maybe THESE are the times when you need a Steven Levitt?
Anyway, back to things I understand…
Well, in terms #1 and #5, we need to evaluate things on a PER-DOCTOR basis, and THEN sum up the answers we got for each doctor.
Why? Because that’s how linear regressions must be calculated. So we use SUMX in those cases, iterating over the rows in the Doctors table. Good thing we have SUMX eh?
This wrinkle is a bit more optional. The original example sent to me by JT used ALL instead, but even that isn’t strictly necessary.
You don’t need ALL(), or ALLSELECTED(), unless you plan to put fields from the Doctors table on your pivot. The grand total cell is implicitly equivalent to ALL() – it has the same filter context – so no reason to bother.
If Your Pivot Looks Like this, ALL and ALLSELECTED are NOT Needed
But if you wanted to display individual doctors on the pivot, and still have the [Slope] measure be correct in the context of each row, well, now you DO need an ALL or an ALLSELECTED:
So, for instance…
If you didn’t have an ALL or ALLSELECTED, the Slope would be
different for each doctor, which is NOT what we want
Now, it’s a fair question to ask, why the heck would I put individual doctors back on my pivot, if all I was trying to do was judge the overall impact of “more visits” on “how much of my drugs the doctors prescribe.”
So I concocted one example above, where we take the [Slope] value, multiply it by 10, and then add that to each doctor’s [Qty Prescribed] amount to get a projection for 2014 – assuming we bumped visits by 10 to each doctor.
[Prescriptions in 2014 if we added 10 Visits] =
[Prescribed Qty] + ([Slope] * 10)
I don’t know how often we’d want to do that, but hey, just in case.
So wait, why ALLSELECTED and not just ALL?
If I wanted to have doctors on the pivot (necessitating an ALL) but still slice by doctor properties (like OrgType), and have the [Slope] just consider the selected type of doctors, well, THEN I need ALLSELECTED.
Because ALL would still be looking at ALL of the doctors, and not just the selected type.
If I had used ALL instead of ALLSELECTED, I’d Still Be Getting 277.3 and not 107.9