LabKitty: A Primer on ANOVA

A LabKitty Primer is an introduction to some technical topic I am smitten with. They differ from most primers as to the degree of mathematical blasphemy on display. I gloss over details aplenty and mostly work only the simplest cases. I'll show mistakes and dead ends and things any normal person would try that don't work. Also, there's probably a swear word or two. Mostly I want to get you, too, excited about the topic, perhaps enough to get you motivated to learn it properly. And, just maybe, my little irreverent introduction might help you understand a more sober treatment when the time comes.

Recently, I've been reading various and sundry explanations of ANOVA because reasons. These are bugging me and I can't quite put my finger on why. The people on the other end of the word processor all seem nice. They're competent enough, I guess. Good intentioned. The material is accurate. Nicely formatted. Pleasing fonts. Still, it's all so, I dunno, labor intensive. Too many equations. Too much effort.

The logic of ANOVA should just fall out of it. Effortlessly. Like a toddler out of Michelle Duggar. You shouldn't have to work for it, is my point. Instead, all I encounter are rats nests of variance partitions and degree of freedom Sudoku and something called R. Sure, you can memorize the formulas. That'll get you though your stats course. Probably. But wouldn't it be nice to Understand with a capital U? The logic of ANOVA should just fall out of it. Effortlessly. Like a toddler out of Michelle Duggar (that joke is bound to offend somebody, so I'm going to go ahead and double down on it up front. NO FEAR).

Let's see if we can't derive ANOVA using a minimal amount of thinking. Clearly, we are well on our way.

The ANOVA of no ANOVA

Suppose I give you a sample, and ask you to tell me the variance of the population the sample was drawn from. No problemo, you say. You just compute the sample variance. Boom! You now have an estimate of the population variance.

However, consider two other ways of attacking the problem. These will seem kind of dumb, but bear with me.

First, you could divide up the sample into a bunch of "subsamples," find the variance of each subsample, then average the results. That is, we compute the mean of each subsample, then compute the sum of the squares of the values in each subsample about their respective mean, sum, and divide by the number of data points in the subsample (or, if you want points for style, divide by the number of data points minus one, because doing so gives you an "unbiased" estimate, which statistics wonks get excited about). Take the average of the result obtained for each subsample. Viola! Another method for estimating the population variance.

Footnote: This game of making samples out of samples turns up in other applications. One variation is resampling randomly with replacement, which allows us to make an infinite number of subsamples. This is the idea behind the bootstrap method (sometimes called Markov chain bootstrap when it involves time series data). Bootstrap doesn't appear in anything that follows -- I think it's cool so I told you about it. That's pretty much the way LabKitty's brain works.

An even simpler way to compute the population variance is to work with the subsample means. Recall the standard error of the mean: Sx = sigma/√n. You're probably familiar using this when dealing with a single mean, which you divide by the standard error to get a Z-score in preparation for inference testing. Presently, however, we have a collection of means (one from each subsample) which we treat like data. We compute the standard deviation of the means about the grand mean (that is precisely what the "standard error" of a statistic is -- the standard deviation of the statistic). Instead of using the data to estimate σ and dividing by √n to get S_x, we're computing S_x directly from a collection of means which we then multiply by √n to estimate σ. Squaring σ then gives us an estimate of σ², the population variance. Easy peasy.

Here's a little graphic showing an example our two dumb methods in action.

I've divided the data into four subsamples. In the top plot the dots are the data, the horizontal lines are the mean of each group, and the vertical lines show the variation of the data about each mean. (I spread out the data horizontally so you can more easily see what each group contains -- you're probably used to seeing data like this plotted vertically for each group, or perhaps summarized using a box plot. Whatever.) In the lower plot, the big dots are the group means and the horizontal line is the grand mean. The population variance estimated by each method is shown in the title. The data were drawn from a Gaussian distributed-population having a mean of ten and variance of one (I used Matlab's randn function). As you can see, both methods come pretty close to nailing the answer.

Both methods work, is the take-home. But why would anyone do this?

Believe it or not, we're one small step away from ANOVA.

The payoff comes when the subsamples (as I have called them) aren't arbitrary partitions of our original data but instead result from some kind of meaningful grouping. Perhaps we defined our subsamples according to some property of the data. For example, we categorized our subjects into primates, ungulates, rodents, and snowglobes then measured their land speed. Alternatively, perhaps the subjects in each group are the same but we manipulate each group differently. For example, we fed each subsample group a different diet then measured the plumpness of each subject after six weeks.

Suppose we strike gold with our subsampling. Suppose the data look like this:

The mean of one of the groups (the first one) is wonky. (This is a Good Thing. In science, we like when a treatment or a clever way of looking at the data demonstrates some effect, because that's how you get grant money.)

Now comes the critical question: What happens when we apply our two variance calculation methods to this data?

Nothing changes in Method #1 because it computes variance using the respective subsample mean. It doesn't matter what that mean is -- 10, 5, -π, eleventy million. Method #1 is blind to the subsample mean.

Method #2 is a different story. It treats the subsample means as data, and estimates the population variance by examining the spread of these values around the grand mean. Method #2 is not blind to the subsample mean. Method #2 is going to notice something is up.

The numbers bear this out (see plot titles). Using the data shown in the first figure, the two methods gave population variance estimates of 1.05 and 0.99. Using the new data, the two methods give values of 1.05 and 38.5 (numerically, the data in the second figure is the same as the data in the first figure except I've simply offset the first group by subtracting five from each point). Now, Method #2 gives a much larger estimate of the population variance than Method #1. Hopefully you can convince yourself a similar thing would happen if group two or three or four were the wonky group instead of one, or if some combination were.

Hmm. How odd. When all the group means are the same, both methods tend to give the same estimate of the population variance. But when at least one of the group means differ, the two variance estimates also differ. We have a tool for identifying group differences.

Well, we almost have a tool for identifying group differences. We still need some way of dealing with the randomness that infects our data (and so our calculations), like we deal with randomness in any statistical method. If we to rerun the analysis with a different sample, the numbers would be different but the overall effect should (hopefully) still be apparent. That's just the nature of sampling. What to do?

The crux of the problem is deciding if the two variance estimates are the same. In math, there's two ways to show this is the same as that. We can show the difference between this and that is zero. Or, we can show the ratio of this to that is one. We use the latter. It turns out the ratio of two variances has a statistical distribution called the F distribution. We can play the same game with variance ratios using the F distribution as we play with, say, means using the t-distribution: We might observe a ratio larger than one by chance, so we define how often we're willing to be wrong (i.e., willing to make a Type-I error). This defines a critical value of F, which we compare to our computed ratio, either using software or table-lookup at the back of the book. If our computed value exceeds the critical value we reject the null hypothesis. We conclude the means of the groups are not all the same.

Congratulations, you have just invented ANOVA.

Bring the Noise

Technically, you have just invented one-factor ANOVA (there's one factor that's different between our groups -- species, diet, whatever. Ergo, one factor ANOVA.) There's lots of other ANOVAs out there (e.g., two factor, n-way, nested, balanced, unbalanced, randomized blocks, Latin square, the Model I, the Model II, the Model III, equal replication, proportional replication, disproportional replication, no replication, Kruskal-Wallice, Tukey (yes, the FFT guy), Neuman-Keuls, Scheffé, Dunnett, Bonferroni, Nemenyi). Luckily, I don't have to get into any of these because Primer. However, it would be irresponsible if I didn't at least connect you to the way ANOVA is presented in textbooks, even though that will besmirch our elegant and simple derivation with crass and confusing minutiae. Such is life.

First, we need to introduce proper terminology. There will be no more of this "subsample" talk. The feature used to segregate the data (e.g., species) is called a factor and each value of the feature (e.g., primate, ungulate, rodent, snowglobe) is called a level of the factor. Because Method #1 uses numbers from within a single group (albeit repeating these calculations for each one), we call the it the Within Groups variance or simply Within. Because Method #2 compares means across groups, we call it (wait for it) the Between Groups variance or simply Between (yes, it should really be "Across" or "Among," but statisticos aren't much for proper grammar).

Unfortunately, you'll come across authors who use completely different terminology. Some use the term treatment instead of factor. Also, instead of Between, some use Treatment or Model. Finally, instead of Within, some use Error. This requires a little explaining: Let x_gi be the i-th data point in the g-th treatment group. It's just a number. But think about what went into it. The number "wants" to be equal to the treatment mean, but dark forces conspire to yank it about. So we can write: x_gi = u_g + ε_gi, where μ_g is the mean of the g-th treatment group and ε_gi is a random error that does the yanking. Recall the Within Groups measurement (Method #1) is blind to the group mean. Therefore, it only measures ε_gi -- the error. Ergo, some authors call it Error instead of Within.

Footnote: Some authors use Residual instead of Error.

Footnote: Okay, since I let the subscript genie out of the bottle, I better mention some authors don't write x_gi = μ_g + ε_gi, they write x_gi = μ + τ_g + ε_gi, where μ is the grand mean over all the data and τ_g is the effect of the g-th treatment. The first approach is called a cell means model and the second a factor effects model. Although there's subtle differences between the two approaches, for all intensive porpoises today you may consider them to be equivalent.

Footnote: Sigh, more terminology. You are bound to run into fixed effects and random effects models. Long story short: In a random effects model, you pick the factors but not the levels of them. A fixed effects model determines, say, whether people are fatter in New York versus London versus Circle Pines. A random effects model examines whether "city" has an effect on weight, and you happened to pick New York, London, and Circle Pines at random.

The number of thingies that go into each calculation also has a name: degree of freedom (dof). Recall you have to knock down a dof by one for each previously-calculated statistic you reuse in a calculation. For example, if ten data points go into computing a variance, the calculation only has nine dof because we reused the mean in the calculation. I assume you've run into this before.

Finally, we need a formal statement of the null hypothesis. Simple enough: Our null hypothesis is that the mean of all the groups is the same. In symbols: H0: μ₁ = μ₂ = ... = μ_k, where k is the number of treatments (yes, even though we measure variances, we are testing means). Method #1 and Method #2 both estimate same thing -- the population variance. Both should give the same estimate as long as there are no treatment effects. That is, both should give the same estimate assuming the null hypothesis is true. If they don't, we start suspecting something is wonky. And if they really really don't (i.e., F > critical value), we conclude that we should reject H0.

Footnote: Testing the null hypothesis is aka the omnibus test. If we find something is askew, we naturally ask which treatment group(s) were the culprit. This question opens a spectacular can of ANOVA worms called post-hoc testing. Perhaps someday I will tell you about post-hoc testing. That day is not today.

God. Damn. It.

So far all we've suffered is a bunch of rebranding, capitulation to an all-powerful shadowy statistics cabal and their preferred terminology. Unfortunately, our woes don't end there. Textbook ANOVA presentations invariably arrange calculations to make it appear like you're working with a ratio of two variance averages rather than a ratio of two population variance estimates. Even though the numbers are the same (it comes down to grouping terms differently). Textbook ANOVA presentations usually also rearrange calculations into a form that minimizes the number of calculator button presses required to get to the answer. These equations have even less connection to population variance estimates. I suppose the speed-up was worth the obfuscation back when we were racing the Russkies to an A-bomb or the moon and all the calculations were being done by hand. But for whatever reason, this approach persists in textbooks to this day.

Yes, it makes no sense. No, it's not my fault. But we have to deal with it. Sausage for the Sausage God.

The best way I can show you this weirdness is to show you an example. We may as well drag Wikipedia into things. Here's example data from their page on one factor ANOVA. There are three treatment groups:

   treatment 1: 6 8 4 5 3 4
   treatment 2: 8 12 9 11 6 8
   treatment 3: 13 9 11 8 7 12

Here's our two methods working over this data:

As shown in the titles, our two population variance estimates are 4.53 and 42. Let's examine the calculations that generated these numbers.

For Method #1, we compute three population variance estimates, one from each group, then average them. The means of the three groups are μ₁ = 5, μ₂ = 9, and μ₃ = 10. The population variance estimate is given by σ² = Σ_i (x_gi − μ_g)² / n-1, where n = 6. (note we're estimating a population parameter and so divide by n-1 rather than n to obtain an unbiased estimate). Applying this formula, we obtain 16/5, 24/5, and 28/5. The average is (16/5 + 24/5 + 28/5) / 3 = 4.5. This number is what the Wikipedia page calls "mean square within" (MS_W).

For Method #2, we obtain a second estimate of the population variance using the standard error of the three group means, S_x = σ / √n. Rearrange this and square to obtain: σ² = n ⋅ S²_x , where S²_x is the population variance estimate of the three means. The grand mean is (5 + 9 + 10) / 3 = 8. We compute variance using σ² = Σ_i (x_gi − u_g)² / n-1, but now the data are the three group means so n = 3. Applying this formula we obtain S²_x = 14/2, so σ² = n ⋅ S²_x = 6 ⋅ 14/2 = 42. This number is what the Wikipedia page calls "mean square between" (MS_B).

Dividing the second by the first, we obtain a variance estimate ratio of 42/4.5 = 9.3. Using the F-distribution table in the back of my trusty copy of Zar, I find a critical value of 3.68 for an α of 0.05 (the degrees of freedom are 2 and 15 -- see Footnote). Because 9.3 > 3.68 we reject H0. At least one of these treatments ain't like the others (my money is on treatment 1).

Footnote: The degree of freedom for the numerator of F is always one less than the number of treatments. In this example, that number is 3 − 1 = 2. The degree of freedom of the denominator is the total number of degrees of freedom minus the degree of freedom of the numerator. In this example, that number is (6⋅3 − 1) − 2 = 15.

Footnote: You may have noticed our variance estimates have the form of a sum divided by the number of thingies in the sum. In other words an average. Or, in yet other words, a mean, specifically a mean sum of squares. This is where the term "mean square" statistics jocks bedizen all over their ANOVA calculations comes from, even though it's really a variance estimate.

Here's a link to the Wikipedia page which shows the standard textbook ANOVA calculations in detail. The two tables therein are particularly illuminating by which I mean not illuminating, since that is the point I'm trying to make about standard textbook ANOVA calculations.

And there you have it. The sensible way and the dopey way to do ANOVA. Decide for yourself which is which.

Recommended Reading

I usually conclude a LabKitty Primer with a few leads into the serious literature. I'm at a loss here because I don't much care for any of it (cf. Introduction). I started to feel bad about this (i.e., being mean to statistics wonks) but then I remembered something. I remembered the dirty little secret of ANOVA.

I glossed over an issue during our time together today and hoped you wouldn't notice. But it's time to come clean: Why bother with this mess? We already had a test for testing differences in means: the t-test. If we have more than two groups, we'll just t-test them two at a time. What's the big deal? Wherefore art thou, ANOVA?

The textbook response is to point out t-tests start to break down if you have more than two groups because of Type-I error. Each time you conduct a t-test, there is a chance you will reject a true null hypothesis by mistake. Not because you did anything wrong -- it's simply the nature of the beast.

So if you have three groups A, B, and C, you have a total of three pairing (A-B, A-C, and B-C) and three t-tests. If you have four groups A, B, C, and D there are a total of six pairings and six tests. And so forth. The chance of making at least one Type-I error goes up like one minus one minus alpha raised to the blah blah power, where blah blah is the combination of somany groups taken two at a time. Dog and cats living together. Mass hysteria.

This is one of the Great Bamboozles of statistics. The Type-I error thing is small potatoes. There are entire scientific fields that violate the multiple comparisons dealio every day -- and I mean burns the village, sows the fields with salt, listens to the lamentations of the women kind of violation. Take fMRI for example. Those cool color pictures you've seen of brains doing stuff? They're based on t-tests of the individual pixels in the image (technically "voxels" not pixels, but the distinction isn't important for my ranting). Functional MRI doesn't deal with three or four of a dozen comparisons, it deals with tens of thousands. Guess which statistical test it uses. That's right: t-test. The t-test is modified, but it remains a t-test. There is not one published fMRI study that uses ANOVA. Go and look for one. Go on. I dare you.

Footnote: Here statistics wonks will begin to vibrate about all the wondrous things ANOVA can do that a t-test can't (e.g., two factor, n-way, nested, balanced, unbalanced, randomized blocks, Latin square, the Model I, the Model II, the Model III, equal replication, proportional replication, disproportional replication, no replication, Kruskal-Wallice, Tukey (yes, the FFT guy), Neuman-Keuls, Scheffé, Dunnett, Bonferroni, Nemenyi). To which I reply: Oh, grow up. If you need this kind of firepower, learn GLM (a story for another time).

The whole ZOMG MULTIPLE T-TESTS!!11! panic that opened Pandora's box and loosed ANOVA upon an unsuspecting world always stuck in my craw. Back in undergraduate stats, I was keeping up pretty well up until ANOVA arrived. I think we all were. But the mathozoids couldn't leave well enough alone. So I'm not feeling particularly charitable today. Today, you textbook authors can go stand in the corner.

I'll let you know when you're done being punished.

Wednesday, April 12, 2017

A Primer on ANOVA

No comments:

Post a Comment