Wednesday, May 31, 2017

Sanity Check: Equivalence of the t-test and F-test in linear regression, and how you get one from the other

Sanity Checks are missives on a specific math point in need of clarification. I try to do so using the fewest words possible. Usually, this is still quite a few words.

Here is what I would like.

I would like to type "equivalence of the t-test and F-test in linear regression, and how you get one from the other" into Google and arrive at a website page that explains the equivalence of t-test and F-test in linear regression in a comprehensible fashion. Bonus points for explaining how you get one from the other.

One ought not think this is an impossible order, current Google results speaking to the contrary.

Cursing darkness, lighting candle, etc.

Read on.



Begin at the Beginning

I'm going to back up a few steps in order to take a run at the thing. In doing so, I risk making this seem more complicated than it is, the "this" being the equivalence of the t-test and F-test in linear regression, and how you get one from the other. You must resist such interpretation. I could offer an explanation that is brief and difficult to understand and then I would be no better than the obfuscated gruel we call "the rest of the Internet." Instead, I will include lots of steps and fill in the gaps with many explanatory words and make the train of thought easy to stay on top of, even if the journey does take a more circuitous route. The main result I will derive, although one result I will provide only a conceptual description rather than a page of symbol pushing, and one or two results I will ask you to take as a starting point rather than deriving them de novo. I said I was going to back up a few steps, not rescale the womb.

So, then. Linear regression. We have a cloud of n data points (xi,yi) through which which we seek the best fitting line, "best fitting" being in the least squares sense, meaning that the sum of the squares of the difference between the data yi and the value predicted by our line yi-hat = α-hat + β-hat ⋅ xi is the smallest it can be out of all possible lines. In symbols, we minimize the sum of squares Σ (yi - yi-hat)2, where the sum goes from 1 to n. Sums of squares are the lifeblood of regression, and many statistical things besides, and so I should introduce the ones we'll need up front. Let's begin with these three:

    Σ xi2
    Σ yi2
    Σ xi ⋅ yi

In words: the sum of x-squared, the sum of y-squared, and the sum of cross products. Alas, I have already lied to you, because these raw quantities are useless; what we require are "corrected" sum of squares, which are these quantities computed about their respective means:

    SSxx = Σ (xi - x-bar)2
    SSyy = Σ (yi - y-bar)2
    SSxy = Σ (xi - x-bar)(yi - y-bar)

Here, x-bar and y-bar are the x and y means of the data. Because these quantities appear many times in what follows, I have assigned them names (SSxx, SSyy, and SSxy). Get comfortable with them.

Footnote: If you're wondering where the terminology "corrected" comes from, it comes from writing these expressions in their "computationally efficient" form:

    SSxx = Σ xi2 − (n ⋅ x-bar2)
    SSyy = Σ yi2 − (n ⋅ y-bar2)
    SSxy = Σ xi ⋅ yi − (n ⋅ x-bar ⋅ y-bar)

The trailing term involving the mean can be thought of as "correcting" the raw sum of squares. Ergo, "corrected" sum of squares.

Footnote: Because we will never again use the raw sum of squares, I'm going to drop the "corrected" qualifier. When I write "sum of squares" I mean SSxx, SSyy, and/or SSxy.

The line through our data has the form yi-hat = α-hat + β-hat ⋅ xi, because that is the equation of a line you've been using since Gymboree. The slope of the line is β-hat and the y-intercept is α-hat. We obtain the slope first, as the ratio of two sums of squares:

    β-hat = SSxy / SSxx

As you may recall, this expression comes from solving the normal equations. This is one of the results I'm going to ask you to accept as given. The normal equations aren't terribly difficult to derive, but they involve a splash of calculus that might confuse a student who has not encountered it before. I refer you to any competent statistics textbook for more info (simply Googling "normal equations" might be asking for trouble, which is sort of the running theme of today's post).

We obtain α-hat by remembering a regression line always passes through the mean of the data (i.e., y-bar = α-hat + β-hat ⋅ x-bar). Solve for the quantity of interest:

    α-hat = y-bar - β-hat ⋅ x-bar

And there is everything we need for the equation of our regression line. Easy peasy.

Intermission: Explaining -hat notation

It is time to explain my "-hat" notation.

The slope and y-intercept of our line are not the slope and y-intercept of the line, rather they are just an estimate of the slope and y-intercept of the line. You may be thinking that is redonkulous, for there they are -- we just calculated them and could apply them to draw a real actual line that goes through the data should we so desire.

What you must always always always remember about any result in statistics is we are working with a sample. We don't care about what we can say about our sample, we care about what we can say about the population the sample was drawn from.

Imagine if you will some cosmic population of data, one that contains every (x,y) data point we could possibly sample even were we immortal and indefatigable. That cosmic data has a cosmic line that goes through it -- the One True Line, let us call it -- and, obviously, there is a cosmic slope and cosmic y-intercept associated with that line. However, we never see any of this and never will. The cosmic data is forever hidden from us albeit we know it must exist, like the alluring undergarments of your first crush forever hidden beneath a delicate summer sundress. All we can do is take samples of the cosmic data, a myopic and coarse approximation of its true glory, and fit our small and silly regression line through it.

The slope and y-intercept of our regression lines are not the slope and y-intercept of the population the sample was drawn from. Yet, the former must bear some relation to the latter, otherwise the enterprise has little point. This relationship, or the lack thereof, motivates statistical testing of our line, a topic which has brought you into my Internet clutches this very day. That story continues momentarily. However, it is tradition to indicate somehow that our calculated values are just approximations of the Real McCoy or, to use proper statistics lingo, they are statistics that estimate a population parameter.

A population parameter is usually indicated with a Greek letter. So the equation of the One True Line would typically be written y = α + βx. A different notation is used when describing the impostor we calculate from our sample. Two notations are in widespread use: 1) we can use Latin fonts instead of Greek: y = a + bx. Or, 2) we can festoon our Greek letters with little hats -- what the English majors call a circumflex (^).

I prefer the hat trick to the Latin, so I should explain why I am not using it: because I am stupid. I was unable to concoct HTML to produce the desired effect -- a circumflex perched atop the Greek letters α and β -- and does so reliably in every browser variant I tried, half of which render my attempt like the offspring of some unholy genetic coupling. Lest you think Internet assistance on this topic might be readily at hand, I would note typing "html hat" at Google returns a website offering bulletproof baseball caps (not kidding). Suddenly, the whirlwind of tragedy obtained upon searching "equivalence of the t-test and F-test in linear regression and how you get one from the other" begins to make a lot more sense.

Which leaves me in a quandary, whether I should leave you with the admittedly clumsy "β-hat" and the like, or go for broke, lesser browsers be damned, and leave many of you squinting at wrongly formatted expressions (you will note this issue also extends to mean values, which I have been writing as "x-bar" and "y-bar" rather than attempting to place a bar atop the variable). I have decided to keep with the bastard suffixed expressions. So y-hat, β-hat, α-hat (and, x-bar and y-bar) it is.

If I err, I err on the side of clarity, which marks me at best as a maverick and at worst heretic amongst the mathozoids, those craven creatures with their LaTeX and their CSS who jib and jeer from the sidelines, thumping their superior chests proudly as we humble pilgrims make for Compostela trudging through their fetid ranks.


Now that we have a line fit to the data, we can introduce three more sums of squares (actually, two new ones and a rebranding of one already introduced). A picture is worth a 1000 words, which will not prevent me from adding 1000 words after:

regression example

The new player here is yi-hat -- the value of y predicted by the line given xi (note I have included proper hatness in the figure, for Photoshop does not suffer the constraints imposed by HTML). Our three new sums of squares are: 1) the sum of the squares of the difference of the (true) yi and yi-hat, which I will call the error sum of squares (denoted SSE), 2) the sum of the squares of the yi-hat about the mean of y, which I will call the regression sum of squares (denoted SSR), and 3) the sum of squares of y, which we were calling SSyy, but we will now call the total sum of squares (denoted SST). Summing up, we have:

    SSE = Σ (yi - yi-hat)2
    SSR = Σ (yi-hat - y-bar)2
    SST = Σ (yi - y-bar)2

Footnote: Note my use of slightly different notation to describe the new sums of squares. Those previously appearing as general quantities not necessary related to regression possess double subscripts (SSxx, SSyy, SSxy). Those appearing after we have a regression line in hand are subscriptless (SSE, SSR, SST). This is what's known in the education biz as a mnemonic, a signpost to help the student stay on the chosen path.

Footnote: I actually prefer the term residual sum of squares in place of error sum of squares, but we had no choice but to call the regression sum of squares SSR, and so we require a name that does not also begin with the letter r. Hence "error" and SSE.

Before continuing, I should note our new friends have shadows, a trailing vortex of perfume that arrives just behind as they enter the room. These are their degrees of freedom, a topic of mystery and inscruitableness that no doubt my stilted prose and poor spelling is doing nothing to dispel. So know this: The total degrees of freedom (DFT) equals n-1, the regression degrees of freedom (DFR) equals 1, and the error degrees of freedom (DFE) equals n-2. We then construct the mean square regression (MSR) as SSR/DFR and the mean square error (MSE) as SSE/DFE.

Footnote: The way I remember the regression degrees of freedom equals one is that a line is defined by two things (slope and intercept) so subtract one. Viola! DFR = 1.

We Just Say Bingo

All the pieces are now on the board. It's time to see if we can make a bingo.

We test if the slope of our regression line is zero. That is, our null hypothesis is H0: β = 0. We need to know if β is zero, for if β is zero then there is no linear relationship between x and y -- every x is just as good at predicting a given y as any other x, and that is wholly uninteresting. There may be a nonlinear relationship, but linear regression is mute on such matters.

Footnote: Note we're testing β not β-hat. Always always remember we are interested in what we can say about the population, not the sample.

As in any t-test, the quantity of interest is the ratio: statistic / standard error of the statistic. Ergo we test:

    t = β-hat / se(β-hat)

where se(blerg) is my notation for "standard error of blerg."

To give you a glimpse of the brass ring, what we seek to show is that instead of the t-test above, we can alternatively test the slope of the regression line using an F-test. Instead of testing the ratio β-hat / se(β-hat) using a t-test, we test the ratio MSR / MSE using an F-test.

I now show the equivalence of the t-test and F-test in linear regression, and how you get one from the other. We begin by noting two useful facts:

    Useful fact #1: se(β-hat) = √ ( MSE / SSxx )

    Useful fact #2: SSR = β-hat ⋅ SSxy

A derivation of these fact here would derail my train of thought, approaching the station as it were, so I derive them elsewhere (see Appendix A and B).

Now we climb in earnest. Insert Useful Fact #1 into the expression for the t-test:

    t = β-hat / √ MSE / SSxx

Square both sides:

    t2 = β-hat2 / [ MSE / SSxx ]

The SSxx in the denominator of the denominator flips up to the numerator like Olga Korbut on the uneven parallel bars:

    t2 = β-hat ⋅ β-hat ⋅ SSxx / MSE

Note I have expanded β-hat2 as β-hat ⋅ β-hat. But the normal equations tell us β-hat = SSxy / SSxx. Substitute this for one of the β-hat:

    t2 = β-hat [ SSxy / SSxx ] SSxx / MSE

The SSxx cancel. We're left with:

    t2 = β-hat ⋅ SSxy / MSE

Apply Useful Fact #2:

    t2 = SSR / MSE

Divide the numerator by 1. This may seem pointless, but humor me:

    t2 = [ SSR / 1 ] / MSE

We know that MSR = SSR / DFR and DFR = 1. So the pointless division by one allows a final crafty substitution:

    t2 = MSR / MSE

The quantity MRS / MSE is a variance ratio, and a variance ratio is described by an F-distribution. So we can write:

    t2 ~ F

and we are done.

Postscript

What does t2 ~ F mean? The quantity β-hat / se(β-hat) has a t-distribution and we can test the value we using a t-test. However, we can square the value we obtain and test that value using an F-test. Alternatively, we can compute MSR/MSE for it will give us the same number as the square of β-hat / se(β-hat). By "test the value" I mean compare the value to a number we look up, either in a table at the back of a book or using fancy statistics software. To do so you will need the degrees of freedom. For a t-test, the degrees of freedom is n-2. The F-test requires two: the "numerator" dof = 1 and the "denominator" dof = n-2.

You may have noticed I didn't have much to say about α-hat. It turns out the y-intercept is just not all that interesting. Sometimes it is, and there is a test for it, but that is a story for another day.

Equivalence of the t-test and F-test in linear regression, and how you get one from the other.

LabKitty Skull Logo


Appendix A -- Deriving Useful Fact #1

We require to demonstrate:

    se(β-hat) = √ ( MSE / SSxx )

In words: the standard error of the estimate of the regression slope is equal to the square root of the mean square error divided by the x sum of squares.

It's possible to derive this result by symbol pushing. I suspect some of you would enjoy that. Instead, I'm going to derive it with my brain. I don't claim this is a proper derivation, but you will likely find it easier to internalize and remember. What is more important in mathematics? Symbol pushing or understanding? You would not have come to LabKitty if you did not already know the answer to that question.

"Standard errror" is the standard deviation of a statistic. Standard deviation is the square root of variance, so let us traffic in variance and take the square root at the end.

An ordinary variance is the spread around the mean. Presently, this is the spread above and below y-bar in the vertical direction. We might write this like σ2y to so indicate. However, we need to work a regression line into the story.

The regression line soaks up a bunch of variance, so what we really need is the idea of the variance of y after taking into account the dependence of y on x. Some authors invented a nifty symbol to indicate this: σ2y.x (note the dot separating y and x so the notation doesn't look like some mutant cross-correlation something).

Remembering we calculate a variance σ2y like SSyy / n, or, to be pendantic SSyy / n-1, or, to be irritatingly pedantic, SSyy / DFyy. I pray it does not take much convincing that we calculate σ2y.x as SSE / DFE. (Remember, SSE is the squared error of y above/below the regression line analogous to SSyy being above/below the mean.)

We already have a name for SSE / DFE -- that is precisely MSE. So you might think se(β-hat) = √ MSE. But something is missing from this idea: We need to account for the x spread of the data. Sure, σ2y.x is the variance of y after taking into account the dependence of y on x, but x needs a say in the matter. It would be meaningless to compare a value of σ2y.x taken from data spread over a millimeter in x to that spread over a light year.

The x data spread is quantified as SSxx. Ergo, we normalize MSE by SSxx. Ergo, se(β-hat) = √ MSE / SSxx.

QED, more or less.

Appendix B -- Deriving Useful Fact #2

We require to demonstrate:

    SSR = β-hat ⋅ SSxy

I have already done so elsewhere, so I shant reproduce those results here. I'm told making linky in your website blog posting increases its Google page ranking. However, I was told including relevant search terms in the post text (e.g., equivalence of the t-test and F-test in linear regression, and how you get one from the other) does also, although it presently appears to be doing blerg for mine. WHEN IS IT LABKITTY'S TURN, GOOGLE? WHEN?? LabKitty runs upstairs crying, slams bedroom door.

Equivalence of the t-test and F-test in linear regression, and how you get one from the other.

1 comment:

  1. No theory in it. could not understand well. phewww.

    ReplyDelete