There was a time when I had committed to memory the formula for the Gaussian distribution (that time was the predawn hours leading to my Stats 101 final). And not the normalized Gaussian. No, sir. The Full Monty, complete with non-zero mean and variance.I sorta kinda remember it to this day, but I always seem to be misplacing a square root or factor of two somewhere.
To refresh our memory, here is the Gaussian in all its poorly-formatted glory:
fX(x) = 1/√2π σ exp(−½ [ (x − μ) / σ ]2 )
Describer of all things, the Gaussian distribution is. Hewer of stone, drawer of water. The classic bell-shaped curve. The peak is centered at the mean, denoted μ, and the spread on either side is the variance (or, taking the square root: standard deviation) denoted σ. However, a question arises: Where does the formula come from? There's lots of equations that give a bell-shaped curve. Why is the Gaussian what it is? What's the exponential doing in there, for Cat's sake? And what about the rest of it?
It is here that the mathozoids will cluck their tongues and adjust their pince-nezzes and smirk that the Gaussian is Simply and Properly Derived using moments or characteristic functions or the weak law of large numbers or whatever. To which I have two things to say. Number one: Shut up. And number two: Shut up.
Do you want to hear my charming math fable or not?
A Charming Math Fable (or not)
Suppose I hand you a coin and ask you to assign a probability to the two sides, heads and tails. You know that each probability must lie between 0 and 1 and that the two probabilities must sum to one, but beyond that you have no further information. Perhaps you would agree a reasonable choice in this case is to assign equal probability to each side. That is, p(heads) = p(tails) = 1/2, where p(wawa) is our way of writing the probability of the outcome wawa (pronounced "waa-waa").
The great mathematician Laplace called this approach to assigning probability The Principle of Insufficient Reason. (Although it may have been the great mathematician Lagrange or the great mathematician Legendre -- I do so often confuse them, what with their names all beginning with the same letter. But it was certainly not l'Hopital, for l'Hopital was a charlatan and a cheat. But that is a story for another time. And, yes, I know there's supposed to be diacritical frufru on l'Hopital: I'm omitting it to demonstrate my contempt.)
The Principle of Insufficient Reason (henceforth PoIR) which we used to assign probabilities to a coin having two possible outcomes is equally applicable to assigning probabilities to a die having six or a roulette wheel having 37 (38 in the colonies) or some general event having a million. In the absence of further information, we count the number of outcomes (call it "n") and assign to each possible outcome the probability 1/n.
Mathematicians call a distribution in which the same probability is assigned to each possible outcome a Uniform distribution. You have likely encountered this beast in your travels.
We now ask the question: What if we aren't completely ignorant about the coin or the die or the roulette wheel? Suppose we are asked to assign probabilities to the six faces of a die. We immediately reach for the PoIR and assign each face a probability of 1/6. But wait! Suppose we are told that someone flipped this die 100 times and that the mean of the 100 flips was 4.5 (or 2 or 3.15 or whatever). Can we still use the PoIR? Well, unless they tell us the mean is 3.5, the answer is no (Stop here and convince yourself that applying the Uniform distribution over the set { 1,2,3,4,5,6 } gives a mean of 3.5. Ergo, if the mean is not 3.5, this cannot be the distribution.)
Is there a rational way to make use of such additional information? If so, how?
This seemingly-innocent question of how to use a priori information when assigning probabilities created one of the biggest rifts in the history of science. On one side are the Frequentists -- those who claim the only valid majick to assigning probabilities is counting outcomes and dividing by n. On the other side are the Bayesians -- ethereal wizards who believe a priori knowledge literally alters the mathematical fabric of the universe. Their feud rages to this day. Rarely a lunch goes by when you don't see some poor Frequentist cornered in the faculty lounge by a gaggle of Bayesian nogoodniks, pocket protector a-kimble, its little garter-snake arms slapping ineffectively as they circle for the kill. The Bayesians will return to their offices only to find their livestock and graduate students violated in retaliation.
Dogs and Cats. Montagues and Capulets. Airbus and Boeing. We must add to this tragic roll call: Frequentists and Bayesians. It was as if all would lead to sorrow, so long as man refused to forget the past. Thus did King Arthur lament as the forces of Mordred gathered about his battle-camp.
Enter Claude Shannon.
It is 1948. While working at Bell Labs, Shannon publishes a paper that will change the world as we know it. In his A Mathematical Theory of Communication he connects probability with entropy. Shannon is not the first to do so -- that honor belongs to Ludwig Boltzmann, the brilliant physicist whose work in thermodynamics would eventually drive him to madness and suicide. But Shannon goes further. He generalizes the PoIR: The Principle of Insufficient Reason is replaced with entropy maximization.
For a moment, Laplace's corpse stirs in its grave. Then all is still. At last, Pierre has found peace.
Footnote: a purist might argue that Edwin Jaynes and not Claude Shannon should be credited with the maximum entropy principle. I'm just streamlining the narrative, like Peter Jackson leaving Tom Bombadil out of LoTR. It ain't like Blogger is paying me by the word (believe it or not).
Now we can proceed.
In any problem with an unknown probability distribution, Shannon tells us, the universe demands we assign the distribution having maximal entropy that is consistent with any given constraints. In mathematical lingo, this is called a constrained optimization problem.
It can be shown that the maximal entropy distribution having only the constraints that probabilities must be between 0 and 1 and must sum to 1 is the Uniform distribution. In this case, the Law of Maximum Entropy is equivalent to the Principle of Insufficient Reason.
It can be shown that the maximal entropy distribution having the constraints that probabilities must be between 0 and 1 and must sum to 1 and has a specified mean is the Exponential distribution.
We are almost done.
It can be shown that the maximal entropy distribution having the constraints that probabilities must be between 0 and 1 and must sum to 1 and has a specified mean and a specified variance is the Gaussian distribution.
It can be shown that. It can be shown that. It can be shown that. Does that leave you unsatisfied? Alas, going further requires calculus. Like the making of laws and sausages, you may prefer to avert your eyes from such things. But coming this far and going no further would be like the time Donny Kerabatsos drove the guys into the city to enjoy the forbidden fruit of professional love and made LabKitty wait in the car.
By "entropy" Shannon meant "information entropy" which is exactly the same as or has absolutely nothing to do with the concept of entropy used in physics depending on who you talk to. According to Shannon, the information of a message is a measure of its probability of occurrence. A message you can predict in advance (probability = 1) contains zero information. If I already know what you're going to say, there's no point in you saying it (cf. right-wing talk show hosts, left-wing talk show hosts, my mom). Shannon showed this must be expressed as -log(p), where p is the probability of the message (0 ≤ p ≤ 1, so log(p) ≤ 0, so we include a negative sign out front to make the quantity positive because negative information seems weird).
A probability distribution describes the probability of a collection of values, so we need to think about a collection of messages. Shannon defined the entropy of a collection of messages (S) as their average information:
S = − Σk pk ⋅ log(pk)
where the index (k) is over all possible values of the message (note this expression is just expected value aka "average").
Footnote: The use of log to define information likely seems bizarre to the casual reader. Briefly: Shannon assumed information should be positive, additive, and the information of a sure message should be zero. The logarithm is the only function that satisfies these requirements. You can use any log you want -- it just changes the units. In base-2 the unit is the bit (for "binary unit"), in base-e the nat (for "natural log"), and in base-10 the Hartley (for "Nina Hartley," the famous adult film star). See any information theory textbook for a more thorough explanation.
We get an expression for entropy in the continuous case just as you might think we would: we simply trade summation for integration. Hoping you don't notice I'm glossing over just what the heck a continuous collection of messages means, and much more besides, we obtain:
S = − ∫ f(x) ⋅ log(f(x)) dx [1]
I have switched from probability mass function pk to probability density f(x), which I should write as fX(x) but I think you get the gist. Also, anytime I leave limits off an integral -- which is always because they are a pain in the formatting ass -- read it as "integration over all possible values of x." In [1] those limits are (−∞,+∞).
We need to find f(x) that maximizes S, subject to the following constraints:
(1) ∫ f(x) dx = 1 (aka normalization)
(2) ∫ x ⋅ f(x) dx = μ
(3) ∫ (x − μ)2 ⋅ f(x) dx = σ2
In these expressions, μ and σ2 are the specified mean and variance, respectively.
We could just start mashing equations at this point, but there's a slick trick we can use here to simplify life. The mean of the distribution just shifts it left or right, and shifting f(x) left or right has no effect on [1]. (Try it! Substitute u = x − c in [1] and you'll get the exact same expression written in terms of u.) This means we can ignore constraint (2) -- the particular value of μ has no effect on the entropy.
Time to break out the Lagrange multipliers. Step 1: Form Lagrangian. There's two constraints, so there will be two multipliers
L = − ∫ f(x) ⋅ log(f(x)) dx
+ λ1 [ ∫ f(x) dx − 1 ]
+ λ2 [ ∫ (x−μ)2 ⋅ f(x) dx − σ2 ]
Step 2: Set all partial derivatives equal to zero. I'll only show the first, because the derivative wrt the multipliers just recovers the constraints. The weirdness here is we have derivatives of integrals. We will assume f(x) is well-behaved so we can move differentiation inside the integral as needed:
∂L/∂f = ∫ [ −log(f(x)) − 1 + λ1 + λ2 (x−μ)2 ] dx
Footnote: Although ∂L/∂f looks like a standard partial derivative -- and we can indeed treat it as such to obtain a solution -- it's technically a functional derivative (f is a function not a variable). Putting this animal on firm mathematical ground requires the machinery of variational calculus which opens quite the can of worms. Fortunately, LabKitty recently went berserk and wrote a 10,000 word introduction to the subject which you should read if you're not doing anything else this weekend.
Setting ∂L/∂f = 0, and dragging out the standard trick integral is identically zero ergo the integrand is, we defuse much nastiness and obtain:
− log(f(x)) − 1 + λ1 + λ2 (x−μ)2 = 0
or, taking exponents of both sides:
f(x) = exp(−1 + λ1 + λ2 (x−μ)2 )
Defining constants a = −1 + λ1 and −b2 = λ2 we can write this as f(x) = a ⋅ exp(−b2(x−μ)2).
Apply constraint #1 (and a table of integrals). ∫ f(x) dx = 1, so:
a ∫ exp(−b2(x−μ)2) dx = a√π / b = 1 [2]
Apply constraint #3 (and a table of integrals). ∫ (x−μ)2 ⋅ f(x) dx = σ2, so:
a ∫ (x−μ)2 ⋅ exp(−b2(x−μ)2) dx
= a√π / 2b3 = σ2 [3]
Solving [2] and [3] simultaneously, we find a = 1/√2πσ and b = 1/2σ2. Substitute these expressions into f(x) and witness the Gaussian emerge like a greasy calf wriggling free of its mother's birth canal:
f(x) = 1/√2πσ exp(− ½ [ (x−μ) / σ]2 )
Viola!, as dyslexic magicians say.
Epilogue
Describer of all things, the Gaussian distribution is. Hewer of stone, drawer of water. (Note my use here of the clever literary device known as the bookend, in which a phrase used in the beginning of the piece is repeated to open the Epilogue. Now, we simply need to steer the narrative to a conclusion, say by mentioning where the Gaussian comes from physically.) Physically, the Gaussian comes from the Central Limit Theorem, which states that quantities resulting from the additive effects of a large number of random variables will tend to cluster about some central value and exhibit a bell-shaped spread to the left and right of the peak. We find the CLT borne out by experience, a cornucopia of reality described or nearly-described by the Gaussian: the heights of students in your class, the absorption spectrum of photoreceptors, the airspeed velocity of unladen swallows.
Note carefully: Additive effects. If the random effects boinking your system are multiplicative, nature hands you a log normal, with concomitant mnemonic: the log of a log normal variable is normal.
Whether your random effects are additive or not, and what all else, is of course between you and your data. When subtlety prevails, there are tests for testing whether data is normal, the statistics literature full-to-bursting with articles on such things written by an ebubulence of mathegonical faculty desperate for employment and tenure, some of whom willing to make up words to prove a point. Alas, mercifully, a topic I will leave for another day. For I fear I have tested your patience to near breaking with my math fable, charming though it may have been (or not). Here I simply admonish you to pause and reflect, to nurture that small voice that says tomorrow I shall try again.
Time to close, in any event, it is time to close. For this absinthe has made me logy, and the Sirens are calling from the bed chamber.
To refresh our memory, here is the Gaussian in all its poorly-formatted glory:
fX(x) = 1/√2π σ exp(−½ [ (x − μ) / σ ]2 )
Describer of all things, the Gaussian distribution is. Hewer of stone, drawer of water. The classic bell-shaped curve. The peak is centered at the mean, denoted μ, and the spread on either side is the variance (or, taking the square root: standard deviation) denoted σ. However, a question arises: Where does the formula come from? There's lots of equations that give a bell-shaped curve. Why is the Gaussian what it is? What's the exponential doing in there, for Cat's sake? And what about the rest of it?
It is here that the mathozoids will cluck their tongues and adjust their pince-nezzes and smirk that the Gaussian is Simply and Properly Derived using moments or characteristic functions or the weak law of large numbers or whatever. To which I have two things to say. Number one: Shut up. And number two: Shut up.
Do you want to hear my charming math fable or not?
A Charming Math Fable (or not)
Suppose I hand you a coin and ask you to assign a probability to the two sides, heads and tails. You know that each probability must lie between 0 and 1 and that the two probabilities must sum to one, but beyond that you have no further information. Perhaps you would agree a reasonable choice in this case is to assign equal probability to each side. That is, p(heads) = p(tails) = 1/2, where p(wawa) is our way of writing the probability of the outcome wawa (pronounced "waa-waa").
The great mathematician Laplace called this approach to assigning probability The Principle of Insufficient Reason. (Although it may have been the great mathematician Lagrange or the great mathematician Legendre -- I do so often confuse them, what with their names all beginning with the same letter. But it was certainly not l'Hopital, for l'Hopital was a charlatan and a cheat. But that is a story for another time. And, yes, I know there's supposed to be diacritical frufru on l'Hopital: I'm omitting it to demonstrate my contempt.)
The Principle of Insufficient Reason (henceforth PoIR) which we used to assign probabilities to a coin having two possible outcomes is equally applicable to assigning probabilities to a die having six or a roulette wheel having 37 (38 in the colonies) or some general event having a million. In the absence of further information, we count the number of outcomes (call it "n") and assign to each possible outcome the probability 1/n.
Mathematicians call a distribution in which the same probability is assigned to each possible outcome a Uniform distribution. You have likely encountered this beast in your travels.
We now ask the question: What if we aren't completely ignorant about the coin or the die or the roulette wheel? Suppose we are asked to assign probabilities to the six faces of a die. We immediately reach for the PoIR and assign each face a probability of 1/6. But wait! Suppose we are told that someone flipped this die 100 times and that the mean of the 100 flips was 4.5 (or 2 or 3.15 or whatever). Can we still use the PoIR? Well, unless they tell us the mean is 3.5, the answer is no (Stop here and convince yourself that applying the Uniform distribution over the set { 1,2,3,4,5,6 } gives a mean of 3.5. Ergo, if the mean is not 3.5, this cannot be the distribution.)
Is there a rational way to make use of such additional information? If so, how?
This seemingly-innocent question of how to use a priori information when assigning probabilities created one of the biggest rifts in the history of science. On one side are the Frequentists -- those who claim the only valid majick to assigning probabilities is counting outcomes and dividing by n. On the other side are the Bayesians -- ethereal wizards who believe a priori knowledge literally alters the mathematical fabric of the universe. Their feud rages to this day. Rarely a lunch goes by when you don't see some poor Frequentist cornered in the faculty lounge by a gaggle of Bayesian nogoodniks, pocket protector a-kimble, its little garter-snake arms slapping ineffectively as they circle for the kill. The Bayesians will return to their offices only to find their livestock and graduate students violated in retaliation.
Dogs and Cats. Montagues and Capulets. Airbus and Boeing. We must add to this tragic roll call: Frequentists and Bayesians. It was as if all would lead to sorrow, so long as man refused to forget the past. Thus did King Arthur lament as the forces of Mordred gathered about his battle-camp.
Enter Claude Shannon.
It is 1948. While working at Bell Labs, Shannon publishes a paper that will change the world as we know it. In his A Mathematical Theory of Communication he connects probability with entropy. Shannon is not the first to do so -- that honor belongs to Ludwig Boltzmann, the brilliant physicist whose work in thermodynamics would eventually drive him to madness and suicide. But Shannon goes further. He generalizes the PoIR: The Principle of Insufficient Reason is replaced with entropy maximization.
For a moment, Laplace's corpse stirs in its grave. Then all is still. At last, Pierre has found peace.
Footnote: a purist might argue that Edwin Jaynes and not Claude Shannon should be credited with the maximum entropy principle. I'm just streamlining the narrative, like Peter Jackson leaving Tom Bombadil out of LoTR. It ain't like Blogger is paying me by the word (believe it or not).
Now we can proceed.
In any problem with an unknown probability distribution, Shannon tells us, the universe demands we assign the distribution having maximal entropy that is consistent with any given constraints. In mathematical lingo, this is called a constrained optimization problem.
It can be shown that the maximal entropy distribution having only the constraints that probabilities must be between 0 and 1 and must sum to 1 is the Uniform distribution. In this case, the Law of Maximum Entropy is equivalent to the Principle of Insufficient Reason.
It can be shown that the maximal entropy distribution having the constraints that probabilities must be between 0 and 1 and must sum to 1 and has a specified mean is the Exponential distribution.
We are almost done.
It can be shown that the maximal entropy distribution having the constraints that probabilities must be between 0 and 1 and must sum to 1 and has a specified mean and a specified variance is the Gaussian distribution.
It can be shown that. It can be shown that. It can be shown that. Does that leave you unsatisfied? Alas, going further requires calculus. Like the making of laws and sausages, you may prefer to avert your eyes from such things. But coming this far and going no further would be like the time Donny Kerabatsos drove the guys into the city to enjoy the forbidden fruit of professional love and made LabKitty wait in the car.
By "entropy" Shannon meant "information entropy" which is exactly the same as or has absolutely nothing to do with the concept of entropy used in physics depending on who you talk to. According to Shannon, the information of a message is a measure of its probability of occurrence. A message you can predict in advance (probability = 1) contains zero information. If I already know what you're going to say, there's no point in you saying it (cf. right-wing talk show hosts, left-wing talk show hosts, my mom). Shannon showed this must be expressed as -log(p), where p is the probability of the message (0 ≤ p ≤ 1, so log(p) ≤ 0, so we include a negative sign out front to make the quantity positive because negative information seems weird).
A probability distribution describes the probability of a collection of values, so we need to think about a collection of messages. Shannon defined the entropy of a collection of messages (S) as their average information:
S = − Σk pk ⋅ log(pk)
where the index (k) is over all possible values of the message (note this expression is just expected value aka "average").
Footnote: The use of log to define information likely seems bizarre to the casual reader. Briefly: Shannon assumed information should be positive, additive, and the information of a sure message should be zero. The logarithm is the only function that satisfies these requirements. You can use any log you want -- it just changes the units. In base-2 the unit is the bit (for "binary unit"), in base-e the nat (for "natural log"), and in base-10 the Hartley (for "Nina Hartley," the famous adult film star). See any information theory textbook for a more thorough explanation.
We get an expression for entropy in the continuous case just as you might think we would: we simply trade summation for integration. Hoping you don't notice I'm glossing over just what the heck a continuous collection of messages means, and much more besides, we obtain:
S = − ∫ f(x) ⋅ log(f(x)) dx [1]
I have switched from probability mass function pk to probability density f(x), which I should write as fX(x) but I think you get the gist. Also, anytime I leave limits off an integral -- which is always because they are a pain in the formatting ass -- read it as "integration over all possible values of x." In [1] those limits are (−∞,+∞).
We need to find f(x) that maximizes S, subject to the following constraints:
(1) ∫ f(x) dx = 1 (aka normalization)
(2) ∫ x ⋅ f(x) dx = μ
(3) ∫ (x − μ)2 ⋅ f(x) dx = σ2
In these expressions, μ and σ2 are the specified mean and variance, respectively.
We could just start mashing equations at this point, but there's a slick trick we can use here to simplify life. The mean of the distribution just shifts it left or right, and shifting f(x) left or right has no effect on [1]. (Try it! Substitute u = x − c in [1] and you'll get the exact same expression written in terms of u.) This means we can ignore constraint (2) -- the particular value of μ has no effect on the entropy.
Time to break out the Lagrange multipliers. Step 1: Form Lagrangian. There's two constraints, so there will be two multipliers
L = − ∫ f(x) ⋅ log(f(x)) dx
+ λ1 [ ∫ f(x) dx − 1 ]
+ λ2 [ ∫ (x−μ)2 ⋅ f(x) dx − σ2 ]
Step 2: Set all partial derivatives equal to zero. I'll only show the first, because the derivative wrt the multipliers just recovers the constraints. The weirdness here is we have derivatives of integrals. We will assume f(x) is well-behaved so we can move differentiation inside the integral as needed:
∂L/∂f = ∫ [ −log(f(x)) − 1 + λ1 + λ2 (x−μ)2 ] dx
Footnote: Although ∂L/∂f looks like a standard partial derivative -- and we can indeed treat it as such to obtain a solution -- it's technically a functional derivative (f is a function not a variable). Putting this animal on firm mathematical ground requires the machinery of variational calculus which opens quite the can of worms. Fortunately, LabKitty recently went berserk and wrote a 10,000 word introduction to the subject which you should read if you're not doing anything else this weekend.
Setting ∂L/∂f = 0, and dragging out the standard trick integral is identically zero ergo the integrand is, we defuse much nastiness and obtain:
− log(f(x)) − 1 + λ1 + λ2 (x−μ)2 = 0
or, taking exponents of both sides:
f(x) = exp(−1 + λ1 + λ2 (x−μ)2 )
Defining constants a = −1 + λ1 and −b2 = λ2 we can write this as f(x) = a ⋅ exp(−b2(x−μ)2).
Apply constraint #1 (and a table of integrals). ∫ f(x) dx = 1, so:
a ∫ exp(−b2(x−μ)2) dx = a√π / b = 1 [2]
Apply constraint #3 (and a table of integrals). ∫ (x−μ)2 ⋅ f(x) dx = σ2, so:
a ∫ (x−μ)2 ⋅ exp(−b2(x−μ)2) dx
= a√π / 2b3 = σ2 [3]
Solving [2] and [3] simultaneously, we find a = 1/√2πσ and b = 1/2σ2. Substitute these expressions into f(x) and witness the Gaussian emerge like a greasy calf wriggling free of its mother's birth canal:
f(x) = 1/√2πσ exp(− ½ [ (x−μ) / σ]2 )
Viola!, as dyslexic magicians say.
Epilogue
Describer of all things, the Gaussian distribution is. Hewer of stone, drawer of water. (Note my use here of the clever literary device known as the bookend, in which a phrase used in the beginning of the piece is repeated to open the Epilogue. Now, we simply need to steer the narrative to a conclusion, say by mentioning where the Gaussian comes from physically.) Physically, the Gaussian comes from the Central Limit Theorem, which states that quantities resulting from the additive effects of a large number of random variables will tend to cluster about some central value and exhibit a bell-shaped spread to the left and right of the peak. We find the CLT borne out by experience, a cornucopia of reality described or nearly-described by the Gaussian: the heights of students in your class, the absorption spectrum of photoreceptors, the airspeed velocity of unladen swallows.
Note carefully: Additive effects. If the random effects boinking your system are multiplicative, nature hands you a log normal, with concomitant mnemonic: the log of a log normal variable is normal.
Whether your random effects are additive or not, and what all else, is of course between you and your data. When subtlety prevails, there are tests for testing whether data is normal, the statistics literature full-to-bursting with articles on such things written by an ebubulence of mathegonical faculty desperate for employment and tenure, some of whom willing to make up words to prove a point. Alas, mercifully, a topic I will leave for another day. For I fear I have tested your patience to near breaking with my math fable, charming though it may have been (or not). Here I simply admonish you to pause and reflect, to nurture that small voice that says tomorrow I shall try again.
Time to close, in any event, it is time to close. For this absinthe has made me logy, and the Sirens are calling from the bed chamber.
No comments:
Post a Comment