LabKitty: A Primer on Principal Component Analysis

A LabKitty Primer is an introduction to some technical topic I am smitten with. They differ from most primers as to the degree of mathematical blasphemy on display. I gloss over details aplenty and mostly work only the simplest cases. I'll show mistakes and dead ends and things any normal person would try that don't work. Mostly I want to get you, too, excited about the topic, perhaps enough to get you motivated to learn it properly. And, just maybe, my little irreverent introduction might help you understand a more sober treatment when the time comes.

Principal Component Analysis (PCA) is a black art of the statistical majicks having a reputation of equal parts usefulness and difficulty. It's a way of simplifying complex data, a way of extracting structure hiding in mess, sometimes even pointing you to an underlying physical process that explains what you measured. And it does so without any assumptions or restrictions, without even knowing what the data are or where they came from. When PCA works, it's a glorious thing to behold.

Yet, the method is cursed with an air of impenetrableness. It's a (relatively) new invention, made possible only with the rise of computers and computational statistics. PCA ain't a t-test or ANOVA -- you can't do it with a pocket calculator and a table at the back of the textbook. So it is we poor punters have been at the mercy of the mathematicians and their inscrutableness when it comes to building user-friendly programs, or explaining anything really.

Well, LabKitty is here to help. The what of PCA is fairly straightforward; it's the how that presents a problem, assuming you aren't fluent in singular value decomposition and eigenvalues and covariance matrices. I dare say PCA is like making a baby: easy to describe in pictures (see, e.g., the Internet) or in cold clinical detail (gametofusion, embryo implantation, growth and development, dilation and contraction, fetus expulsion), but inside the box, so to speak, is some rather unnerving grossocology. As such, I will describe PCA in pictures, and show you how to do it in code (it's just a couple of lines of Matlab) but not have much to say about what happens behind the curtain. If you're looking for mathematical details, you need to look elsewhere.

After having talked about the what and how of PCA, we'll then turn briefly to the why, although if you are reading this you probably already have an application in mind. I'll wrap up things with what can go wrong, and present a few leads into the literature which I have found helpful, although nothing that can't be bested with a tailored Google search.

So, then. I believe someone mentioned baby making. Let's get at it.

PCA - The What

We are dealing with a two dimensional data set. We can think of this as simultaneous multichannel recordings, each measurement going on for some interval of time. For example, we might record the EEG at a collection of places on the scalp for 100 milliseconds after a flash of light is presented. We might record the grayscale values of the pixels in a linear CCD detector while protons collide inside some expensive physics gizmo. We might record the luminosity of the stars in the constellation Virgo for a decade. We might record the rectal temperature of all the ungulates at the zoo once a day for a year. There's really nothing special about a two-dimensional data in time x space. We could use data in time x time or space x space. And PCA can also be extended to more than two dimensions. But time x space is a convenient context in which to present the method.

For illustrative purposes, rather than talking about a specific application let's work with fake data that we have control over. This is easy to generate in Matlab; it's just the outer product of two vectors. Have some code (the double chevrons are my Matlab command prompt):

>> L1 = sin(2*pi*[0:0.1:1]);
>> S1 = sin(8*pi*[0.01:0.01:1]);
>> dataset = L1' * S1;
>> multiplot(dataset, 5.0, 'k', 1);

This produces the following:

My code makes use of a little convenience function I wrote called multiplot which plots an nxm array as multichannel data (n channels with m samples):

function multiplot(A,gain,style,flag)

[ n m ] = size(A);
shift = 10;
hold on;

for ichan = 1:n
if (flag)
text(-5,ichan*shift+0,num2str(ichan));
end
temp = gain*A(ichan,:) + ichan*shift;
plot(temp,style);
end

axis off;
axis tight;

The function takes four arguments: "A" -- the matrix to plot,"gain" -- a multiplier applied to the data, "style" -- a string containing a standard Matlab plotting style such as 'r' (red lines) or 'k*' (black asterisks), and a flag to indicate if the channel numbers should be plotted. Copy and paste this code into a file called "multplot.m" and put that m-file in a directory in your Matlab search path.

Here's a schematic that shows how the data was generated:

Stare at this until you are certain you understand how the data was generated as an outer product: dataset(i,j) = L1(i) ⋅ S1(j). In this example, i goes from one to eleven and j goes from one to one hundred. Even though I've drawn the data as continuous traces, remember these are really a collection of discrete samples. I've superimposed the samples on L1 (the channel positions). I didn't superimpose the samples on S1 because there's too many of them and it would be a mess.

Footnote: There's no hidden reason I used sinusoids for L1 and S1 here; it was just an easy way to make mildy interesting looking data.

In PCA, the two vectors used to generate the data have names. The one that goes down the channels is called the load and the one that goes across time is called the score. (Mnemonic: the "load" goes down, like the mother "lode" (load) in a mine; the "score" goes across, like the scores of the innings in a baseball game.) As above: dataset(i,j) = load(i) ⋅ score(j).

So far, we have data = load x score. In PCA, we ask: Can we go in the other direction? That is, given a 2D data set, can we find a score and load that generates it?

That's it. That's what PCA is about.

Or almost about. There's one more twist. When dealing with real data, it's usually impossible to find a load and a score that generates the data perfectly. We find a load/score that almost generates (aka "models") the data; the difference -- the literal difference obtained by subtracting the model from the data -- is called the residual. We can then seek a load/score that models the residual. This, also, may not perfectly match. So we repeat the process on the residual of the residual. Shampoo, rinse, repeat. Continue until including yet another load/score pair doesn't buy very much additional improvement. We then add together all the results which, hopefully, reproduces the data to some acceptable degree of accuracy.

The product of a load and its corresponding score is called a component. We seek to reconstruct the data using (the sum of) a collection of these components. These components are not unique. The components that make up the smallest possible set that "explains the data" (formally: that minimizes the residual) are called the principal components. Ergo, the technique is called Principal Component Analysis.

That's it. That's what PCA is about.

PCA - The How

In some sense, I have told you everything about PCA. In another sense, I have told you nothing. For there is a large mathematical gap between wanting to explain a data set in terms of its principal components and actually computing them. At first glance it would seem hopeless. If, for example, we were doing Fourier analysis, we would at least have someplace to start -- we know we have to find a collection of sinusoids that describe our data. But loads and scores don't have to be sinusoids; they can be anything. In mathematical lingo, we say PCA uses arbitrary basis functions to model the data. We don't know what those are, we only know we want the smallest set of them. That's not much to go on.

I thought long and hard about this and failed to come up with a simple explanation of the math behind PCA. By "the math behind PCA" I mean explaining it in enough detail that you could go off and do it (this is, I believe, possible with Fourier analysis. A story for another time, perhaps). Explaining it badly would be a double whammy; I would have accomplished nothing and you would feel bad for not understanding it.

So instead, a compromise. I'm simply going to give you Matlab code to do PCA. Later, I'll give a few references that try to explain the math. The pictures and code might suffice for your needs; if not you can attack the details at your leisure.

You might think the Matlab function for doing PCA would be called pca. No such luck. We do PCA in Matlab using singular value decomposition, and that right there gives you some hint there is mathematical weirdness afoot. The good news is this is all pretty straightforward, even if you don't know what SVD is.

Footnote: I think there is a version of PCA provided in the Statistics Toolbox, but the svd function is available in the basic Matlab installation and I've used that so you don't have to pony up any extra scratch if you have Matlab and want to try PCA. I'm pretty sure svd is also provided in Octave and works the same way but I can't test it at present. I had a laptop running Octave but it died recently, so unless one y'all want to mail me a new MacBook, you're on your own.

The heart of matter is the call to svd. Let's have a look at it specifically and how to use what it returns and show some example output. After, I'll give you a big block of code you can cut and paste into your own m-file.

The call to svd goes as follows:

>> [ u,s,v ] = svd(dataset);

As the name implies, dataset is our nchan x npnts data. The meat is the three return values u, s, and v. It is from these that we extract loads and scores and generate the principal components using a bit of matrix gymnastics.

For example, to generate the first component after calling svd, we do the following:

>> c1 = 1;
>> c2 = 1;
>> w = zeros(size(dataset));
>> w(c1:c2,c1:c2) = s(c1:c2,c1:c2);
>> c = u*w*v;

The end result c is what's important here -- it's the first component. If you want the second component rather than the first, you would set c1 = c2 = 2. If you want the sum of the first two components, then set c1 = 1 and c2 = 2 (alternatively, you could get the components individually and add them).

Extracting the load and score for a given component requires a few extra lines of code:

>> load = u(:,c1);
>> temp = w*v';
>> score = temp(c1,:);

Easy peasy. To show you PCA does what I claim, let's apply it to the data we made earlier. Here's a side-by-side plot of our data and a one-component model. Underneath are the load and score of the first component superimposed on L1 and S1:

I'm now plotting the load horizontally because that's the way stuff comes out of Matlab and it saves plot real estate. Note the one-component model nicely reproduces the data. However, the amplitude is off in the load and score. The load is half as large as L1 and the score is twice as big as S1. This is typical of PCA, or at least the implementation I am showing you. The loads returned by SVD are normalized, so unless the load(s) you use to generate the data happen to have unit length, some scaling weirdness will occur. Note this is a nonissue when using PCA on real data because there you don't have fake loads and scores for comparison.

Let's put everything together in one big fat example. The following code generates a data set using two components, then applies PCA to create a two component model and plots the results. With a little luck the loads and scores it extracts will match the vectors we use to create the data (alas, often they will not -- see PCA Gone Bad, further down the page) which would be a nice way to conclude our demonstration:

% LabKitty PCA primer -- two component PCA

L1 = sin(2*pi*[0:0.1:1]);
S1 = 2*sin(2*pi*[0.01:0.01:1]);

L2 = cos(2*pi*[0:0.1:1]);
S2 = cos(2*pi*[0.01:0.01:1]);

dataset = L1' * S1 + L2' * S2;

[u,s,v]=svd(dataset);

% component 1

i1=1;
i2=1;
w=zeros(size(dataset));
w(i1:i2,i1:i2)=s(i1:i2,i1:i2);
component_1 = u *w *v';
load1 =u(:,i1);
temp=w*v';
score1 =temp(i1,:);

% component 2

i1=2;
i2=2;
w=zeros(size(dataset));
w(i1:i2,i1:i2)=s(i1:i2,i1:i2);
component_2 = u *w *v';
load2 =u(:,i1);
temp=w*v';
score2 =temp(i1,:);

% component 1+2

i1=1;
i2=2;
w=zeros(size(dataset));
w(i1:i2,i1:i2)=s(i1:i2,i1:i2);
component_12 = u *w *v';

% plot all the things

subplot(2,2,1);
multiplot(dataset, 5.0, 'k', 0);
title('dataset');
subplot(2,2,2);
multiplot(component_1, 5.0, 'r',0);
title('component 1');
subplot(2,2,3);
multiplot(component_2, 5.0, 'b', 0);
title('component 2');
subplot(2,2,4);
multiplot(component_12, 5.0, 'k', 0);
title('component 1 + 2')

Here's the output, which is a plot of the data, the two components, and the model of the data (the sum of the two components). The 2-component model nicely reproduces the data:<

Finally, here's the two loads and scores PCA generated superimposed on L1/S1 and L2/S2:

fig 05 - 2-component PCA: loads and scores

In this figure, data stuff is plotted in black and PCA stuff related to the first and second components are plotted in red and blue, respectively. I've taken the liberty of scaling the loads and scores so it's easy to see they match: PCA indeed extracted the vectors I used to generate the data. That's pretty cool.

Interlude: A Glimpse Inside the Box

I promised I wouldn't discuss details of PCA mathematics. While you may hold me to my word, every presentation of PCA ever talks about "reducing the dimensionality of the data" so I should probably say a few words about that as a way of explaining what the algorithm does. This will all be high level and abstract, and you can skip it if you just want to get on with the number crunching.

Consider a simple case: We have a set of data points, each having an x coordinate and a y coordinate. Perhaps x and y are velocity and position. Or height and weight. Perhaps they are the value of a test subject's home and the value of their car.

Suppose we collect a bunch of these samples and plot them and we find this:

fig 05 - one-dimensional data measured dumbly

It doesn't take fancy math to tell us we don't really have two dimensional data; the data are one dimensional (perhaps with a bit of noise) and we have simply chosen a bad way to measure it. Instead of velocity and position, a better measurement may have been "energy." Or "age" instead of height and weight. Or "income" instead of home and car value.

Just by plotting the data we've revealed its true dimensionality, but in the real world things are more complicated and we need an algorithm. Enter PCA. Not only would it tell us we are dumb, it would calculate the proper "data axes" we should have used (these are the loads). Reducing dimensionality from two dimensions to one may not seem like much of a gain for so much pain. However, picture instead the line in three dimensions, or four, or ten, or a hundred, and the simplification becomes substantial. Next, imagine not a line in two dimensions, but a plane in three dimensions (or four or ten or a hundred). Here the true dimensionality of the data is two, even though we may be stupidly measuring it in three or four or ten or a hundred. It will take only two principal components to describe our data.

Line to plane, plane to, um, whatever its called (I think the term is "hyperplane"), the song remains the same. Our data has lower dimensionality and we have simply made a poor choice of coordinates. That being said, it may well be that our data is genuinely blobular and cannot be reduced to a simple explanation in terms of a few components. If so, we may either (a) be happy with the approximation provided by PCA or (b) give up and go home. Some days you eat the bear, some days the bear eats you.

Footnote: You may remember something about eigenvalues and eigenvectors representing a matrix in a special coordinate system that simplifies things. Is that what PCA does? The short answer is yes, but PCA does a spectral analysis on the covariance matrix of the data, not the data itself. That's as far as I'll explain, for going further raises a murder of testy questions. For example: What's a covariance matrix? More embarrassing questions quickly follow like: Why didn't we compute it? If you'll check the code you'll note we passed the raw data to svd and not some covariance matrix derived from it. Which makes no sense, because the data matrix isn't square, and eigenvalues are only defined for a square matrix. For answers to these questions, and much else besides, I will refer you to the literature.

As our imaginary example here demonstrates, PCA provides economy in describing data. In the extreme case, if we can describe an nxm data set in terms of a single component, we have reduced storage/transmission requirements from n⋅m to n+m, a substantial savings in bandwidth. I don't know if PCA is literally used in signal processing algorithms (IIRC, in image compression techniques such as JPEG it is not), but I'm not a EE. Maybe a DSP wonk can throw in here.

More importantly, simplifying data assists in the interpretation of data. If we had never heard of "energy" or "wealth" or "age" our analysis may have inspired us to create these concepts. Conversely, if these concepts were familiar to us, then our PCA results would point us to these quantities hiding in our data. In the perfect PCA scenario, we are able to model our data using a small number of components, and all the loads and scores have a physical interpretation.

However, alas, the world is not perfect and PCA goes astray more often than we might like to admit. In closing let's take a brief look at some of the difficulties that can be encountered.

PCA Gone Bad

Things worked well enough in our examples but, as you might expect, once you start trying PCA on real world data any number of things can go wrong. While I can't possibly catalog every possible eventuality here, I will mention two common problems. These concern the number of components generated in your analysis, which can be either too many or too few.

You may have noticed the first component in the examples seems to capture the overall look-and-feel of the data, whereas the second (and higher, if we had bothered to compute them) feel more like a correction. In PCA lingo, we say the first component explains most of the variance of the data. A problem arises when the first component doesn't explain most of the variance, and you find you need too many components to produce a good match between the model and the data. There is no hard and fast rule on what "too many" is, but the more components you need, the more likely it is people will start to look at your model funny. A good guideline is that each component better have an underlying justification in terms of physics or biology or chemistry or whatever, otherwise you're just doing curve fitting.

Footnote: the amount of variance explained by a component is usually reported. It's straightforward to calculate: (1) subtract the component from the data, (2) square and add up all those differences (3) divide that number by the variance of the data (the square of the difference between each value and the mean of the data), (4) multiply by 100 to get a percent.

PCA can also provide you too few components. The algorithm searches for basis functions that explain variance. That's all it knows. As such, PCA can be maddeningly ruthless in sniffing out loads and scores that are efficient but unhelpful. There can be two plump loads staring out of the data that have a clear physical interpretation and PCA may well ignore them, or tweak them, or combine them into a single giant mutant component that has no physical meaning but explains lots of variance. You can often tell your office mates are using PCA by their frequent cries of "come on!" and "what the hell is this?".

Sometimes it's possible to work around PCA's variance fetish by allowing it to generate one or two additional components and then rotating the loads to see if something explainable emerges. In essence, you let PCA reduce the dimensionality but your eyeballs have final say on the results. It's not difficult to code (it's literally multiplication by a rotation matrix of sines and cosines) but the process can be time consuming, especially in anything more than two dimensions. YMMV.

Further Reading

Googling PCA will get you an avalanche of stuff on PCA. This is not surprising, for that is what Google is supposed to do. However, winnowing these results into useful resources is a different kettle of fish. Only one site on the entire Internet explains PCA clearly (wink, wink), yet that author punts when it comes to details of the math.

The problem is there's an entire swamp of linear algebra crap you have to wade through before getting to what you want to know. Numerical Recipes explains the SVD algorithm in gory detail, albeit it requires eleven (!) pages, and this only at the end of an entire chapter on computational linear algebra. Bronson's Schaum's outline on Matrix Methods covers SVD (and gives you lots of solved problems), but only in Chapter 21 of a twenty one chapter book. There is no "royal road" to PCA, as far as I know.

Your best bet may well be a good multivariate statistics book. I've found Kachigan's Multivariate Statistical Analysis: A Conceptual Introduction a very approachable presentation of PCA, albeit in the guise of factor analysis. There's a stats book to suit every taste, and you just need to find one that speaks to you.

However, if you must have more right this very instant, here's a few pdfs I rather like. (I tend to go with pdfs for reference because I can print them out and mark them up and read them with coffee and while pooping. Perhaps I have shared too much.) With the usual disclaimer that LabKitty is in no way affiliated with any of these people:

www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf

www.utdallas.edu/~herve/abdi-awPCA2010.pdf

One last pdf I'll point you to that's sorta fun is a summary of how Robert Bell and Yehuda Koren won the million dollar Netflix challenge using PCA (and much else besides) to devise a movie recommendation algorithm. If that don't convince you PCA is useful, I don't know what will.

buzzard.ups.edu/courses/2014spring/420projects/math420-UPS-spring-2014-gower-netflix-SVD.pdf

On a final note, there is a technique related to PCA called Independent Component Analysis. My introduction to ICA was having my entire dissertation research dismissed by some math freak at a conference because I used PCA and not ICA. The episode left a bad taste, so if you want to learn ICA you're on your own.

In closing, remember: PrinciPAL component, not princiPLE. Don't be a noob.

Wednesday, June 24, 2015

A Primer on Principal Component Analysis

No comments:

Post a Comment