LabKitty: Why is the Negative Binomial called the Negative Binomial?

One bumps into the negative binomial distribution fairly early in probability and statistics. Some tormentor is flipping a coin or whatever, and we seek to know how many flips it takes to get to the center of the Tootsie Pop. It's not terribly complicated, the neg nomy, which is a welcome relief, probability frequently making about as much sense as a Mozart opera.

However, the question came up the other day: where does the moniker come from? That is to say, why is the negative binomial called the negative binomial? Sure, the "binomial" part makes sense if you look at the formula, but there's really nothing negative about it. What gives?

Here is a question that can surely be answered by Google.

A two-minute Google search after which we can all get on with our lives.

Two minutes.

Tops.

This can only mean one thing: I am now filled with rage.

Great googly moogly. One would think you'd type "why is the negative binomial called the negative binomial" at Google and out would pop the answer. Because X, where X is "unicorns" or the earned income credit or whatever. But no. Instead, you get a bunch of Oracle of Delphi amicus briefs. Be plain, good son, and homely in thy drift, Friar Laurence admonished Romeo (his admonishment, ironically, neither plain nor homely in drift). And apparently lost on those who dwell in the Venn diagram intersection of "knows the answer" and "writes it on the interwebs."

Alas, it once again falls to LabKitty to be the god of the gaps. The Internet would revel in our double suicides, you and I, and we're not going to give them the satisfaction. Note to the harried: I provide a TL;DR at the end of all this, if you'd like to skip over the 5,000 word explanation I sweated blood to provide to you free-of-charge. Sniff.

First, a brief recap of the negative binomial. Three word mnemonic: repeat until n. We have independent Bernoulli trials with probability of success p. We seek the probability that the n-th success occurs on trial k. A plain language example: We have a coin. Let p be the probability of the coin landing heads (for a "fair" coin, p = 0.5). We ask: how many flips does it take to get n heads? For example, if n = 1, we are asking how many flips does it take until heads shows up for the first time. Easy peasy.

Converting all this talk of numbers into numbers we get the negative binomial distribution:

   P( X = k ) = _k–1C_n–1 pⁿ q^k–n ; k = n, n+1, ...

Here, q = 1–p and _k–1C_n–1is the combination of k–1 things taken n–1 at a time. The latter is the binomial coefficient, which you get from a button on your calculator or Pascal's triangle or compute longhand: _fooC _bar= foo! / [ bar! (foo–bar)! ]. It's often written using a big bracket but I don't know how to do that in HTML so I've used the big-C notation. X is the quantity of interest -- the number of trials until n-th success -- and it is a capital letter because it is a random variable. The number that the formula poops out given k is the probability that X will equal k. It's impossible to get n successes with fewer than n attempts; that's why k starts at n and goes up.

Footnote: It's easy to make a mistake writing down the formula, what with all the k's and n's and whatnot. Two sanity checks: (1) in any _fooC _bar you must have foo ≥ bar (you can't take a combination of more things than there are things, (2) each trial is either a success or a failure, so the exponents on p and q must add to k, the number of trials that occurred. If the expression you've written down violates either of these, something is wrong.

Footnote: to understand where the above expression comes from, note it's just k–1 successes in the first n–1 trials and a final success which terminates the experiment. It's easier to see this if you separate out a factor of p:

   P( X = k ) = { _k–1C_n–1 p^n–1 q^k–n} p

However, you usually find the first version of the formula in textbooks.

As pointed out before the jump, although the binomial coefficient appears here, there doesn't really seem to be anything negative about it. What gives?

To answer that, we need to use a different form of the distribution. Instead of asking "how many trials does it take to get n successes?" we ask "how many failures happen before n successes?" Pause and convince yourself that these descriptions are interchangeable. For example, if it took 10 trials to get 6 successes, that is the same thing as saying there were 4 failures before we got 6 successes.

When recast from this perspective, the distribution takes on the following form:

   P( X = k ) = _n+k–1C_k pⁿ q^k ; k = 0, 1, ...

I probably should have used a different variable name here (Y instead of X, say). Before, X was the number of trials before n successes. Now, X is the number of failures before n successes. But we're describing the same situation (in fact, you can get the new X by subtracting n from the old X) and I think you're smart enough to keep things straight.

Footnote: Just to make things even more confusing, some sources (including Wikipedia) define this form using successes before failures. So their p and q are reversed.

The important change here is the values that k now takes on (aka the "support" of the distribution). In the first version, k started at n; in the new version k starts at zero. This is handy when dealing with infinite series, as it often makes expressions easier to simplify. We won't use that factoid here, but it's why many authors prefer the second version.

We will return to this expression in a moment. But first, a brief refresher on the binomial formula and binomial series.

A Brief Refresher on the Binomial Formula and Binomial Series

You hopefully recall that if we have an expression (x + y)ⁿ where n is a positive integer, there is a handy trick for multiplying out the mess. That trick is the binomial formula. It looks like this:

   (x + y)ⁿ = Σ_k=0,n  _nC_kx^k y^n–k

The sum on k goes from zero to n -- my odd notation is more of my HTML stupidness (the n is usually written as a superscript on sigma). As before, _nC_k are binomial coefficients: _nC_k = n! / [ k! (n–k)! ]. The expression works for any value of x and y and for any positive integer n. It's just polynomial multiplication.

In what follows we don't need (x + y) -- we'll need (x + 1) or, equivalently, (1 + x). So we may as well do that now. Setting y = 1 in the above we get:

   (1 + x)ⁿ = Σ_k=0,n _nC_kx^k

The y terms vanished because they are all equal to one.

A cool thing about the binomial formula (discovered by Newton) is that -- slightly modified -- it works for any n, not just positive integers. Let's write this as:

   (1 + x)^r = Σ_k=0,∞ _rC_kx^k

I'm using "r" now instead of "n" to emphasize the fact we're raising the expression to a real power. The obvious modification is the sum now goes to infinity. For that reason we call this the binomial series rather than the binomial formula. The not-so-obvious modification is that _rC_kare generalized binomial Coefficients:

_rC_k= [ (r)(r–1)(r–2)...(r–(k–1)) ] / k!

This formula gives the same thing as r! / [ k! (r–k)! ] when r is a positive integer, which is how we defined C earlier. But if r is not a positive integer then r! isn't defined so the old definition doesn't work and we have to use the generalized form. Most authors write the last term as r–k+1 but I like it my way because it tells you at a glance how many terms you multiply together.

Footnote: An important difference between the binomial formula and the binomial series is that the formula works for any x but the series only works for | x | ≤ 1. You may recognize this as the radius of convergence which I will not get into here except to point it out. When we finally get around to applying the binomial series, our x will be a probability so indeed we will have | x | ≤ 1 because probabilities are always between zero and one.

The Negative Binomial Series

Let's flip the LHS and RHS in our binomial series with y = 1:

   Σ_k=0,∞ _rC_kx^k = (1 + x)^r<

This might seem silly to write as a separate step, but it emphasizes how we typically use these kinds of expressions. We have some ugly infinite summation and we would like to find a simple expression that represents it. The useful thing is that we're allowed to tweak various parameters and the expression remains valid as long as we do the same thing to both sides. For example, we can replace x with –x:

   Σ_k=0,∞ _rC_k–x^k = (1 + (–x))^r = (1 – x)^r

Or, we can replace r with –r:

   Σ_k=0,∞ _–rC_kx^k = (1 + x)^–r

This muy importante series is called the negative binomial series. That's a sensible moniker, as the series clearly has binomial and negative thingys in it.

We combine both tweaks to get an expression we'll use later:

   Σ_k=0,∞ _–rC_k(–x)^k = (1 – x)^–r

We are now ready to explain what this has to do with the negative binomial distribution.

What this has to do with the Negative Binomial Distribution

Recall our second form of the negative binomial distribution, the one describing number of failures before n successes:

   P( X = k ) = _n+k–1C_k pⁿ q^k ; k = 0, 1, ...

First, observe that _n+k–1C_k = (–1)^k _–nC_k . There is nothing deep about this; it's simply a happy coincidence of arithmetic. Here, have an example using n = 5 and k = 3:

_5+3–1C₃ = ₇C₃
= [ (7)(7–1)(7–2) ] / 3! = 210 / 6 = 35

(–1)³_–5C₃
= (–1) [ (–5)(–5–1)(–5–2) ] / 3!
= 210 / 6 = 35

Using this coincidence, we rewrite the negative binomial distribution as:

   P( X = k ) = (–1)^k_–nC_k pⁿ q^k ; k = 0, 1, ...

Now, write down the normalization condition for the distribution, which is the sum of the probabilities of all the possible outcomes (these had better sum to one and I will show they do so keep your fur on):

Σ_k=0,∞P( X = k )

= Σ_k=0,∞ (–1)^k_–nC_k pⁿ q^k

= pⁿ Σ_k=0,∞ _–nC_k (–q)^k

In the last step, I moved pⁿ outside the summation (because it doesn't depend on k), and I combined (–1)^k and q^k to get (–q)^k.

Now the punchline. Look at the part in bold in the expression above. It's a negative binomial series! We'll deal with what it sums to in just a moment. But it's a negative binomial series, is my point. Negative. Binomial. Series.

THAT's why the negative binomial distribution is called the negative binomial distribution.

Finishing Up

I suppose I should simply stop here and gloat, but we can do one more thing. What does the series sum to? Here, "q" plays the role of "x" and q is a probability so | q | ≤ 1. Therefore, the series converges, and as noted earlier it converges to (1 – q)^–n. Ergo:

pⁿΣ_k=0,∞ _–nC_k (–q)^k
= pⁿ(1 – q)^–n= pⁿp^–n= p⁰= 1

or Σ _k=0,∞P( X = k ) = 1, as required. We have satisfied the normalization condition. The sum of all possible values of any probability distribution must sum to one. And ours does (yay!).

Epilogue

And so, from this day forth, if Googling "why is the negative binomial called the negative binomial?" does not bring you to LabKitty (and I don't mean buried down in the results with the North Korean domain name squatters and the diet pills) then the Internet Gods are unjust and undeserving of your worship. Ya hear that, Google? [ LabKitty shakes paw at Google. ]

TL;DR: The negative binomial is called the negative binomial because if you rewrite the PMF in failures-before-success form, applying the normalization condition generates a negative binomial series.

Friday, July 11, 2014

Why is the Negative Binomial called the Negative Binomial?

No comments:

Post a Comment