The Formula for the Slope of a Linear Regression Line.  It’s Greek to Me.
(Get it?  Greek?  Sigh.  Anyway, click the image to view the article on StatisticsHowTo.com)

### In Case of Emergency, Call JT Statmaster!

I struggle mightily to understand formulas expressed as Greek symbols.  I don’t know why really.  Probably because it seems so abstract – that notation sacrifices “humanity” in order to achieve precision and uniformity.  I get that, but it doesn’t make it easy for me.

Fortunately, we have people like JT Joyner.  JT was a student in one of my classes late last year.  And I dare say he was one of those “star pupils.”  He took to Power Pivot like a natural.  So natural, in fact, that a couple days after the class, he emailed me a Linear Regression workbook example, implemented in Power Pivot measures.  He translated Greek into, you know, formulas.

Six months later, yeah, I am just getting around to doing something with it.  But this is exciting stuff for sure.  I’m pumped.  Let’s dig in.

### Diagramming and Explaining the Greek

First, let’s explain what N, X, and Y mean:

N = Number of Doctors.  X = Number of Visits.  Y = Prescriptions Written.

And then let’s clarify what those Greek Sigmas mean:

What Does the Sigma Say?
(Kow kow kow kow kow k-kow!)

### Reminder of the Problem We are Solving

Simple question:  If pharma sales rep Billy visits a doctor more frequently, will that translate into that doctor prescribing more of Billy’s drugs, I mean, medications?

And if so, what is the “value” of a single visit?  Will that result, in average, on 10 more prescriptions per year or 500?

### And Now, a Word on the “Power-Pivot-ness” of This Example

I deliberately chose/constructed this problem in a way that would “shine” in Power Pivot and, um, not shine in regular Excel.

Here is what a “regular” example would look like:

Rob’s Weight Over Time – Height Held Constant at 6-foot 1 inch.
(This is a “regular” example – one row “captures” one sample/trial)

If we treat Age as X (the input) and Weight as Y (output), we can perform a linear regression and determine that I will weigh more than 450 pounds before it’s all over.

But that’s NOT how our example is constructed here.  No way.  Ours is more reflective of a noisy, dynamic, and changing world.

### But Our Samples/Trials are NOT Captured by Single Rows!

As illustrated by the two tables above, our “trials” are spread across MANY rows each!  A given visit, or a given month of prescriptions, is not terribly significant.  It’s the aggregate impact of aggregate behavior that is the focus of this model, not whether a single visit impacted a single month of prescriptions.  (Although a future modification to this model COULD take that approach.)

This distinction is important – single rows defining trials versus trials defined by aggregates.  So keep that in mind as we proceed.

### All Right, Let’s Get to the Measure Formulas!

First, let’s bring back that “diagram” of the Greek formula:

Now let’s translate those into formulas:

[Number of Doctors]  //which is N in the formula above  =

CALCULATE(COUNTROWS(Doctors), ALLSELECTED(Doctors))

[Sigma X Visits CHG]  //Term #2 in the diagram above =

CALCULATE([CHG Visits vs PY], ALLSELECTED(Doctors))

[Sigma Y Prescrip CHG]  //Term #3 Above =

CALCULATE([CHG Prescriptions vs PY],ALLSELECTED(Doctors))

[SUMX of X Squared Visits CHG]  //Term #5 Above =

SUMX(ALLSELECTED(Doctors),
[Sigma X Visits CHG]*[Sigma X Visits CHG]                   )

[SUMX of X Times Y]  //Term #1 Above =

SUMX(ALLSELECTED(Doctors),
[Sigma X Visits CHG]*[Sigma Y Prescrip CHG]       )

### Phew, that looks hard!

Actually, by far the hardest thing is keeping all the names straight.  X, Y, and SUMX – my eyes start to blur after a bit.

The math itself isn’t so bad, the only wrinkles are the SUMX and the ALLSELECTED.  So let’s explain those, but later.  For now let’s just cut to the chase.

### [Slope] = # of Additional Prescriptions “Yielded” by One Extra Visit!

Apparently, One Extra Visit to an Independent Doctor

In part 1 I had made the dubious decision to use % Change as my X and Y values.  That led to some SUPER confusing stuff, in that I had no idea what my output Slope value meant.

After consulting with JT, I switched to Absolute Change instead of % Change.  So that way, my output slope is easy to interpret – a single incremental visit is expected to yield the Slope amount of additional prescriptions written.

So, all of the formulas in this post, including the ones above, use totals, and changes in totals, rather than % change.

So, here’s the formula for Slope:

[Slope] =

CALCULATE(
DIVIDE(
([Number of Doctors] * [SUMX of X Times Y] ) –                 ([Sigma X Visits CHG] * [Sigma Y Prescrip CHG])
,
([Number of Doctors] * [SUMX of X Squared Visits CHG]) –
([Sigma X Visits CHG] * [Sigma X Visits CHG])
)
,
ALLSELECTED(Doctors)
)

And remember, that’s just all of the measures finally coming together in the shape of that Greek formula diagram.  Nothing more.

### Wait a Second…

In part 1, my analysis of the fake data told us that Independent doctors hated being visited (negative slope), whereas HMO doctors liked it (positive slope).

But today, when I switch from % change to absolute change, I get a very different result:

OK…  Today, BOTH Org Types Respond Positively, and Independents respond 10x Better!

Why is that happening?  I… have… no…  idea.  Someone please explain how switching from % change to absolute change can turn a negative slope into a largely positive one.

This is driving me nuts, I am starting to regret choosing it as a blog topic

Maybe THESE are the times when you need a Steven Levitt?

Anyway, back to things I understand…

### Why SUMX?

Well, in terms #1 and #5, we need to evaluate things on a PER-DOCTOR basis, and THEN sum up the answers we got for each doctor.

Why?  Because that’s how linear regressions must be calculated.  So we use SUMX in those cases, iterating over the rows in the Doctors table.  Good thing we have SUMX eh?

### Why ALLSELECTED?

This wrinkle is a bit more optional.  The original example sent to me by JT used ALL instead, but even that isn’t strictly necessary.

You don’t need ALL(), or ALLSELECTED(), unless you plan to put fields from the Doctors table on your pivot.  The grand total cell is implicitly equivalent to ALL() – it has the same filter context – so no reason to bother.

If Your Pivot Looks Like this, ALL and ALLSELECTED are NOT Needed

But if you wanted to display individual doctors on the pivot, and still have the [Slope] measure be correct in the context of each row, well, now you DO need an ALL or an ALLSELECTED:

So, for instance…

If you didn’t have an ALL or ALLSELECTED, the Slope would be
different for each doctor, which is NOT what we want

Now, it’s a fair question to ask, why the heck would I put individual doctors back on my pivot, if all I was trying to do was judge the overall impact of “more visits” on “how much of my drugs the doctors prescribe.”

So I concocted one example above, where we take the [Slope] value, multiply it by 10, and then add that to each doctor’s [Qty Prescribed] amount to get a projection for 2014 – assuming we bumped visits by 10 to each doctor.

[Prescriptions in 2014 if we added 10 Visits] =

[Prescribed Qty] + ([Slope] * 10)

I don’t know how often we’d want to do that, but hey, just in case.

### So wait, why ALLSELECTED and not just ALL?

If I wanted to have doctors on the pivot (necessitating an ALL) but still slice by doctor properties (like OrgType), and have the [Slope] just consider the selected type of doctors, well, THEN I need ALLSELECTED.

Because ALL would still be looking at ALL of the doctors, and not just the selected type.

If I had used ALL instead of ALLSELECTED, I’d Still Be Getting 277.3 and not 107.9

#### Rob Collie

One of the original engineering leaders behind Power BI and Power Pivot during his 14-year career at Microsoft, Rob Collie founded a consulting company in 2013 that is 100% devoted to “the new way forward” made possible by Power BI and its related technologies. Since 2013, PowerPivotPro has rapidly grown to become the leading firm in the industry, pioneering an agile, results-first methodology never before seen in the Business Intelligence space. A sought-after public speaker and author of the #1-selling Power BI book, Rob and his team would like to help you revolutionize your business and your career.

1. I suspect the % vs absolute issue has to do with … say, 1% gain on 10000 prescriptions is still 100 prescription gain, where a big 10% gain on 100 prescriptions is only 10 gain. I can imagine the direction flipping if things lined up right… ?

2. tede says:

My head hurts trying to understand this…. I think I need more coffee. Good stuff but i’m going to have re-visit this a few times.

3. I wished that I had this post last weekend. I ended up working out something similar to this for the Zillow Rental per square foot data. Took a good part of an afternoon to work out. However, now I have a pattern that will be useful for other projects and some predictive information about rental prices all around the country.

BTW it’s a very nice learning data set that imports very nicely with Power Query. http://files.zillowstatic.com/research/public/Zip/Zip_ZriPerSqft_AllHomes.csv.

For other reasons I used Power Query to un-pivot the data first. About 600,000 data samples once un-pivoted…

Regarding the Linerier Regression function, yes I agree it took me a while to parse through all of the Greek symbols to get this done.

When I have about an hour I’ll have to read on this full post again, and see if I can learn some new tricks.

Also for folks trying this. If you are checking your work. Use either the Slop function or better the Linest Arrray function. (with a 2×5 cell selection) I know it’s an array function, however it does provide a good confirmation that you got everything in this correct.

Finally, it looks like you have just worked out the Slop, Have you worked out the intercept so you can do predictions, and the R^2 value to understand how good the corrolation will be.

My goal was to try to work out what rent prices might be a year or so in the future, if linier regression looks to be predictive.

Very cool artical thanks… 🙂

4. blotonthelandscape says:

Here’s where I’m a bit confused: In an ordinary Regression model with categorical variables (e.g. HMO vs Independent), you’d code one as your base variable, and your slope would be the difference in impact made by the other category. What you seem to be doing instead is running separate regression models for each category.

This probably accounts for the weird percentages you’re seeing; a regression model is essentially y= a + bx (anyone with high-school algebra will recognise this formula for a straight line), where a is your base level of sales (normally it’s meaningless, but in your example it would be the number of prescriptions sold if you made zero visits), and b is your slope; y in your example would be prescriptions sold, and x would be number of visits. Both a and b are estimated by the model, so if you change your reference dataset, you’ll change your base prescriptions.

I would recommend, at a minimum, adding the base calculation to your pivot table. See http://www.statisticshowto.com/how-to-find-a-linear-regression-equation/ under step 2, the calculation for (a), which conveniently reuses a lot of the intermediary calculations you’d already created for calculating the slope!

In reality, what you’re wanting to do is Multiple Regression. Instead of having a slicer for each kind of doctor, you’d want an additional variable in the model, which is 0 for HMO, and 1 for Independent. This would change the meaning of (a) to “number of prescriptions for an HMO doctor who received zero visits”; you’d get a new slope variable as well (I’ll call it (c)), which is the difference that Independent doctors make (other things being equal).

All that said, Multiple Regression may be beyond the reach of Powerpivot; I’ll have to dig out the ol’ “first principles” textbooks and have a try, but Matrix multiplication and DAX may be a little bit out of my comfort zone.

Just to add to the complexity, Vector Autoregression (“VAR”) would be better suited to the “slicer” approach, but that’s ridiculously complicated and most likely beyond the reach of Powerpivot (famous last words!). Any other stats nerds willing to have a go?

A final note (this comment got out of hand); when using stats, tread carefully! You’re moving out of the “X=Y” world you’re comfortable with, and into the world of “X is somewhere in the region of Y”. Case in point: in part 1, although you did caveat, you said that “if we had an X value of 6, the linear regression “predicts” that Y would be 20.” This is a simplification that’s technically incorrect; a better simplification would be that it predicts “where X=6, the average of all Y’s is 20”. In the work I do (health insurance), it will almost never be the case that your simplification is true (in other words, if I took all the sixes in my data, almost none of them would actually have a Y of 20).

5. J.T. says:

To TGB417…The measure for the intercept is (sum(Y)-Slope*sum(X))/N. N being the count of recorders.

6. J.T. says:

I’d like to give my take on “Someone please explain how switching from % change to absolute change can turn a negative slope into a largely positive one.”

First off, the model in part 1 and the model in part 2 both are asking the same big picture question: “How do visits impact # of prescriptions written?” However the dependent variables used are different. Part 1’s dependent variable is YOY Prescription % change and, in Part 2, it is # of Prescriptions. Saying that, I would expect the results to be different because of the actual question each one is addressing is related but not the same. I have to admit…the approach in part 1 is interesting where as the part 2 model is the typical approach. Here’s my take which might help other’s insight.

The confusion goes all the way back to Part I, and the interpretation on the slope in that example. Stating that the slope value implies “The Slope of the Line is -0.4%, Indicating that Every 1% Increase in Visits Results in a 0.4% LOSS of Prescriptions.” is incorrect.

The correct interpretation is “The Slope of the Line is -0.4%, Indicating that Every 1% Increase in YOY Visits % change, Results in a 0.4% decrease in YOY Prescriptions % change.”

Another way to say this is the rate change of y (YOY Prescription percent change) is negative as x (YOY visits percent change) increases. This does not mean if you increase x, y is negative…to know that, you need the intercept of the line and not just the slope. What it is saying is that there is diminishing return as you increase YOY visits percent change.

In part 2, the x and y variables change from YOY % changes to x (number of visits) and y (number prescriptions). The results in part 2 are mathematically related to the model built in part 1 but the results on Part 1 are easier to grasp. Trying to draw comparisons between the two is impractical and difficult.

I hope this clarifies things and was helpful. I think the main point of the post is…regressions are possible within Power Pivot…please explore!

7. I think I managed to turn this method into a polynomial regression by performing it on logs of the x/y variables… That being said, doing this a few times really makes me wish PowerPivot allowed creating UDFs — kinda silly to make several measures/columns each time you wanna do what’s essentially the same thing.

In retrospect, like blotonthelandscape I’d kinda want to upgrade my approach to multivariate regression as well, but am feeling similarly overwhelmed. If only!

8. volfied says:

This is great stuff. I’ve been trying to apply it to my own work, but I’m having trouble making it work for time series analysis. I need to determine the linear regression slope each quarter for attendance at classes held over the past two years. My data is aggregated by event (class instance).

I’m pretty sure time is X and attendance is Y, but I’m not sure how to express time. Would number of classes be N? I’ve also tried to recreate the method described here (http://www.real-statistics.com/regression/least-squares-method/), which is Covariance/Variance and does not require an N, but I can’t figure out how to replicate COVAR() in PowerPivot.

Any chance you can point me in the right direction?

9. J.T. says:

In your case X is Time and Y is attendance. Here’s a small example. Say you have two data points for a science class: 1/1/2015 the attendance was 20 and 1/2/2015 the attendance was 23. In (x,y) nomenclature that would be (1/1/2015,20) and (1/2/2015,23). Instead represent them as (0,20) and (1,23) and run the regression. In this case N is 2, the number of data points. See if that helps with how to use dates.

Go ahead and compute the slope and intercept along with the points on regression line. Then the formulas for R^2 might make more sense once you have those results.

1. Great commentary to help make this more accessible to us non-stat folks. One more question ref. the time/attendance scenario. To work correctly, does a class on 1/4/2015 get coded as 3 (third class) or 4 (fourth date, with no 3)?

1. J.T. says:

It would be coded as the fourth date if there was school on 1/3/2015. However if there is no school on 1/3/2015…you might want to consider coding it as third day.

10. volfied says:

Got it working. Thank you!

11. Jorge Morales says:

I build a slightly different implementation but it works…thank you!

12. Jorge Olea says:

Wow it is a kind of old entry but I haven’t found another reference about this topic. Is it possible to do this same approach with dates, I mean, instead of visits, the month as the independent variable and forecast the next 12 months? What I’m trying to do is to get the same result via power pivot than Excel function forecast.ets.