LabKitty: Polling -- How do it Know?

Because we are currently living in ~~the Season of the Bitch~~ an election year, you can't swing a dead cat* without hitting a prediction of some sort. These predictions usually come in the form of a poll, or at least the respectable ones do (my Uncle Randall predicting Obama is going to cancel the election so lizard people can extract our vital juices is a different kind of prediction). For example, a recent poll on the CNN website claims Hillary Clinton and Bernie Sanders are tied in the lead-up to the Nevada primaries. Overall, 48% of likely caucus attendees say they support Clinton, 47% Sanders... the article reads (Source: CNN Digital Dashboard 2/17/2016). However, further down the page we find this:

The CNN/ORC Nevada Poll was conducted by telephone February 10-15 among a random sample of 1,006 adult residents of the state. Results among the 245 likely Republican caucusgoers [sic] have a margin of sampling error of plus or minus 6.5 percentage points. For results among the 282 likely Democratic primary voters, it is plus or minus 6 percentage points.

Such a caveat should give any thinking person pause. How can you possibly predict the behavior of millions of voters with just 1,006 phone calls? Isn't this just guesswork? (short answer: No, with a but). Can the result be totally wrong? (short answer: Yes, with a however). What is this voodoo? And what the heck is a margin of sampling error?

Put such questions to a mathozoid, and they will drone on about the null hypothesis and Type-I error and confidence intervals and the standard deviation.

Ask LabKitty and you will get enlightenment.

So, ask away.

We would like to understand how knowing the preferences of a relatively small number of voters can accurately predict the outcome of an election. More generally, we would like to understand how knowing the behavior of a relatively small number of anything allows us to predict the behavior of some larger group. In math lingo, the group we measure/talk to/poll is called the sample, the larger group from which the sample is drawn is called the population, and the number we get is called a statistic. The terminology isn't terribly important, but it appears in what follows a few times so I thought I would define things up front to avoid confusion.

As often happens in math, it's helpful to begin by looking at a toy version of the problem. So, instead of thinking about a sample of 1,006 registered voters out of a population of millions, consider a population of just four voters and we sample two of them. Furthermore, let's assume the actual voter preference is split right down the middle: 50% of the voters are going to vote for Candidate-A and 50% will vote for Candidate-B. Clearly, we don't know this in a real poll -- it's the very number we want to determine! The idea is to work backwards. Assume we know the answer and then examine how close polling comes to finding it. The results we obtain are valid no matter what the split is -- 50/50, 60/40, 90/10, whatever. However, we have to assume something to get started, so let's assume 50/50.

I'll represent our population graphically like so:

The two red guys are our two Candidate-A voters and the two blue guys are our two Candidate-B voters. Now, let's draw every possible sample of two voters from this population. There are six of these, which I draw below:

The two voters included in a given poll have checkboxes drawn on them. The number at the right is the sample result (i.e., the voting preference calculated in that poll). I've expressed this as a preference for Candidate-A but I could have just as well used Candidate-B. Since the sample size is two, the poll result can be 2/0, 1/1, or 0/2, resulting in a calculated voter preference for Candidate-A of 100%, 50%, or 0%, respectively. Easy peasy.

Now, let's make a histogram of our polling results:

The x-axis is voter preference and the y-axis is the number of polls in which that preference occurred. One poll found a 0% preference for Candidate-A, four polls found a 50% preference (the correct answer), and one found a 100% preference. Note more polls give the correct number than give an incorrect number. This is super important (which is why I typed it in bold). It's the first fundamental lesson of polling. (There will be two additional lessons before we are done.)

The idea of "taking every possible poll" is just a way of thinking about the problem. In practice, you or I or CNN doesn't take every possible poll, we just take one. How do we know "which" poll we got? We don't. However, here's the thing: Because there are more polls that give the correct number than polls that do not, any one poll is most likely to be one that gives the correct number. Think of writing each potential poll result on a slip of paper and putting the slip of paper into a hat. In our example, there would be one slip written with 100%, one with 0%, and four with 50% (the correct answer). If you reach into the hat and pull one out at random, your gambler intuition should be telling you that you will probably pull out a slip of paper with the correct answer. Call it what you will -- the Law of the Mean, chance, or just plain dumb luck -- it's why you can read the minds of millions of voters with just a relative handful of phone calls. It's why polling usually works.

Note: usually. Can our poll be wrong? Absolutely. In our toy poll, the chance of being right is four times out of six. Although that's 2/3 of the time, as far as professional polling is concerned that's pretty crappy. But our example was just a toy. There's ways to generate much better odds of being right, and real polls usually are. Read on.

Size Matters

Let's start working our way back to a real poll from our toy example. Suppose our population now contains 1000 voters (instead of four) and we sample 10 voters (instead of two). Again, let's assume an actual 50/50 voter split. We want to play the same game as before, constructing every possible sample and looking at how many give the correct answer versus how many do not. However, there's a problem. In contrast to the toy example, I can no longer show you every possible poll. There are approximately 2.6E23 (26 followed by 22 zeros) different ways of picking 10 people from a population of 1000. I can't possibly draw all of these. Heck, a computer can't even compute all of these. Assuming we could compute a billion polls per second, constructing all 2.6E23 would take more than eight million years!

Footnote: You're probably wondering where the number 2.6E23 comes from. The number of ways to pick a sample of s whatevers out of a population of p whatevers is given by the binomial coefficient p! / s! (p-s)!, where ! indicates factorial. Plug in p = 1000 and s = 10 and you get 2.6E23 plus change. This gets into a topic called combinatorics and, yes, it's counterintuitive how fast the numbers get huge (or YUGE, to use the parlance of the times).

The point being, we must change tactics. So, instead of examining every possible poll, let's examine a large number of polls picked at random -- say 1000. Even though 1000 does not exhaust all 2.6E23 possibilities, hopefully you would agree it should give us some sense of what to expect.

Here's the histogram -- the results of 1000 polls in which a sample of ten voters was picked at random from a population of 1000 split equally (50/50) for Candidate-A and Candidate-B:

histogram of 1000 random polls; sample size 10; 50/50 voter split

Two important features emerge. First, the x-axis resolution is finer than the histogram we obtained from polling two people in a population of four. There, we only had three categories (0%, 50%, and 100%). Here, with a sample size of 10, we obtain 11 histogram categories (0%, 10%, ... , 90%, 100%). This will become important later when we look at sneaky ways to reduce the chance of being wrong.

Second, we can see the polls still center around the real result (the actual Candidate-A preference of 50%). As before, polling "wants" to give us the right answer. Once again, we play the pick a poll out of the hat game. If you squint at the histogram, you'll see that in our collection of 1000 random polls, about 250 fell into the 50% category. Thus, we're correct about 1/4 of the time. Hey, wait a sec. This is worse than our toy poll! We seem to be going backwards! What, if anything, have we accomplished?

We shouldn't be too quick to dismiss our new results -- even though the toy poll was correct 2/3 of the time with a sample of only two, that sample size represented half the population. Sampling ten people from 1000 is only 1% of the population, so a comparison is kind of apples and oranges. Still, no matter how you slice it, only being right 1/4 of the time is pretty crappy odds. However, there is a different kind of improvement obtained by increasing the sample size. Here -- look what happens when we increase the sample size from 10 to 50:

histogram of 1000 random polls; sample size 50; 50/50 voter split

We get a bump in the histogram peak -- the correct answer now shows up in about 1/2 of the polls. That's still not so great, all things considered, but we're going to see how to fix that in the next section. More importantly, the predictions cluster more tightly around the correct result (I'm still using 11 categories here so we can compare this histogram to the previous histogram even though the larger sample size (50) would permit a finer x-axis resolution). With a sample size of 50, almost no polls predict a voter preference less than 40% or greater than 60%. Thus we have the second fundamental lesson of polling: The bigger the sample size, the more confident you can be that you won't conclude something completely boneheaded.

Not being a bonehead is nice. Still, it's not as nice as being right. The good news is there's a trade-off between how often you are wrong and how bad you are wrong. We can trade in clustering for a better hat, so to speak. Read on.

Fear of a Bad Hat

To sum up our story so far, we found polling relies on the quirk that there are more ways to get the correct answer in a poll than ways to get a wrong answer, so any one random poll is likely to be correct. However, a nasty feature of gambling is you can do everything right and still be wrong. People lose their shirts in casinos every day. And in any real-world poll, there's always a chance you're going to end up looking stupid, in a "Dewey defeats Truman" sort of way. Unless you sample every person in the population it's unavoidable, and sampling every person defeats the purpose of polling. (FYI: If you sample every person in the population, that's called a census and the number you obtain is called a parameter.)

There are really two interrelated questions here: 1) are you wrong? and, 2) how wrong are you?

The first question we have bumped into already; it's the pick a poll out of the hat game. We might attack this by increasing the sample size. That helps, but, as our simulations have shown, a larger sample isn't a magic bullet. Additionally, real polling costs money and large samples cost more than small samples. As such, pollsters have come up with an easier (read: free) way of increasing the odds of being right: You make a less-precise prediction.

It's not rocket science. The more precise your prediction, the more likely it is you will be wrong. The less precise your prediction, the less likely it is you will be wrong. For example, if I say exactly 50% of the voters prefer Candidate-A, then I'm going to have to accept a larger chance of being wrong than if I say 40-60% of voters prefer Candidate-A. Heck, if I say between 0% and 100% of voters prefer Candidate-A, I will be correct 100% of the time. Silly, yes, but it demonstrates the trade-off between the precision of a prediction and the chance of it being wrong.

In short, we lump histogram bars together. By enlarging our definition of "correct," we increase the likelihood of picking a correct poll from the hat. Thus, the third (and final) fundamental lesson of polling: The less specific your prediction, the greater the chance you will be right.

Let's have a look at lesson three in action. Here again is a histogram of 1000 simulated poll results using a sample of 10 from a population of 1000 voters. To mix things up, I increased the actual voter preference for Candidate-A to 80%:

histogram of 1000 random polls; sample size 10; 80/20 voter split

As you should expect by now, the polls center around the true voter preference. If you squint, you may be able to see that the tallest bar on the histogram occurs in the 80% group and has a height of 306. If we claim voter preference for Candidate-A is exactly 80%, we have a chance of being correct of about 1/3 of the time (306/1000). Instead, if we make a less specific prediction -- say, that voter preference is between 70% and 90% -- we lump the three tallest bars together and our chance of being correct increases to (208+306+251) / 1000 or about 3/4.

Footnote: I don't expect you to be able to read bar heights off the plot. I'm looking them up using the MATLAB program I wrote that generates the histogram (see Appendix).

But why stop there? Keeping with our election motif, all that really matters is whether the voter preference is greater than 50% or not. If you summed up all the histogram bars residing in the "greater than 50%" categories, you would find 969/1000 or >95% of polls predict victory for Candidate-A. Now, our prediction is almost certainly right.

Again the lesson: Increase the slop and you increase the chance of being right. In this example, the "slop" is not really slop at all -- in an election, nobody really cares what the exact percent is; what's important is who won. However, polling (or, to use the general term, sampling) is a cornerstone of statistical analysis, and the amount of slop we can tolerate before a prediction is rendered unhelpful depends on the application at hand.

Footnote: The fancy words for this wrongness business are p-value and margin of error, which you can usually find in small print somewhere on a published poll. The p-value is just the chance of being wrong. A p-value of 1% is pretty standard. You pick the value beforehand, then use a magic formula that tells you how big the sample size must be. The margin of error (sometimes called the "margin of sampling error" or sometimes just the "margin") is the precision (or lack thereof) of the prediction and is usually specified as a percent range. The margin also affects the sample size for a desired p-value, and it goes into the formula as well.

Footnote: Because use used a sample size of 10, our histogram categories have been literally 0% (zero Candidate-A voters found out of 10 voters sampled), 1% (one out of 10), and so on. In a real poll -- for example, the CNN phone poll which is based on 1,006 phone calls -- you wouldn't include a bazillion categories in the histogram even though that's the true resolution of the data (i.e., 1/the sample size). Instead, each histogram category would represent a range of responses, say 0-5%, 5-15%, and so on. The point being, there is some lumping built into most polls from the very start. This is often all you need for an acceptable margin of error. You usually don't have to resort to ridiculousness like summing up all histogram bars >50%.

One final note. For polling to work, your sample must truly be selected at random. Many times when a poll is wrong what screwed the pooch was a bias that snuck into the data and tainted the results. For example, it's illegal to poll cellphones (IIRC), so phone polls go to landlines. Landlines are overwhelmingly owned by older voters, so phone polls tend to leave out younger voters. Conversely, Internet polls famously attract mongolian hordes like 4chan or le Reddit Army which tend to skew the numbers, often on purpose for comic effect. (Who among us has not reveled in the electoral success of Dickbutt?) These are two obvious offenders, but the seemingly innocent phrase "selected at random" opens a monster can of worms in statistics. The take home is a poll can be worthless from the get-go. You must train your spidey sense to suss out potential polling weirdness.

Epilogue

You now have a basic understanding of how polling works -- how you can predict the behavior of millions of voters with just a few phone calls or exit surveys. Random sampling. Margin of error. p-value. Histograms. Lumping. The three great lessons. Combinatorics. Yuge numbers. Please clap.

Still, we have only scratched the surface. If you would like to learn more, including gory details of the equations that generate this mess, consult any statistics textbook. We used Zar's classic Biostatistical Analysis in grad school, which I don't really recommend. For a gentler introduction, I like Perry Hinton's Statistics Explained, which you can find cheap on Amazon. However, as this is the Internet, y'alls prolly gonna go to YouTube or Khan or Google up some pdfs for free. The Year of the Scavenger, as they say. Fair dinkum. Pick whatever reference works for you.

Sashay on the boardwalk, scurry to the ditch.

* Please do not swing a cat, dead or otherwise.

APPENDIX -- MATLAB CODE

Here without much in the way of annotation is the MATLAB code I used to generate the random polling histograms. Do with this information what you will.

repeat = 1000;
sample_size = 10;
real_preference = 0.50;

all_polls = zeros(repeat,1);

for poll = 1:repeat
this_poll = zeros(sample_size,1);
for response = 1:sample_size
if (rand < real_preference)
this_poll(response) = 1;
end
end

all_polls(poll) = mean(this_poll);

end

categories = [0 10 20 30 40 50 60 70 80 90 100];
hist_data = hist(100*all_polls, categories)
bar(categories,hist_data);
axis tight;
grid on;

Friday, March 4, 2016

Polling -- How do it Know?

No comments:

Post a Comment