Tuesday, September 23, 2008

Why Barack Obama Will Win Pennsylvania

A seriously long post to make up for months of not posting at all.

I’m originally from Pittsburgh, Pennsylvania (as most of you probably know), so who wins my home state means a quite a bit to me. Also, I’ve been having an ongoing argument with my parents (who still live in Pittsburgh) trying to persuade them of my belief that Barack Obama will win the Keystone state.

Now, polls have the state very close. Pollster.com says its 47.4% to 44.7% in Obama’s favor. RealClearPolitics has it as 47% to 44.5%, and fivethirtyeight.com says 46.8% to 43.6%. All together, it’s close. And despite Obama’s recent uptick in the national (and several state) polls, he doesn’t appear to be gaining much ground in PA.

This being the case, why am I so confident of Obama’s chances in Pennsylvania? Two words: registration advantage.

Let me back up a bit and go over some fundamental statistical concepts (feel free to skip down if this gets boring). Correlation is a measure of the strength of the relationship between two sets of numbers. For example, there is a strong correlation between the height and weight of an adult human male (female too). If you know someone’s height, you can guess their weight with relative accuracy. If two sets of numbers have a perfect correlation, that means if you know one, then you know for certain what the other is. Here’s a simple example. I give test to 100 students. The test has 10 questions on it, and every student gives an answer to every question on the test. There is a perfect correlation between the number of right answers a student gives and the number of wrong answers that same student gives (i.e. if I say Student A got 4 right, can you tell me with certainty how many she got wrong?).

In math terms, we measure perfect correlation as the number 1. We measure no correlation as the number 0. Everything in between is some correlation (negative correlation also exits, but its not really relevant here – indeed the above example of the test scores is actually an example of negative correlation, but I digress), with higher numbers meaning stronger predictive relationship between the two sets of numbers.

Ok, enough with the stats for a moment. In 2004, there was an exceedingly strong correlation between the number of people registered as Democrats, by county, in Pennsylvania and the number of votes cast for the Democratic candidate for President, John Kerry. In fact, the correlation was approximately 0.98 (that’s really high!). What that means is, that after a bit of fancy math (essentially conducting a bi-variate regression), we can take the registration numbers in each county and predict, with a high degree of accuracy what the vote totals look like. For those of you who are interested, the regression equation for the Democrats in 2004 was:

Votes for Kerry = (0.6986 x Dem Registration) + 2295.1

What that just means is that if you tell me that, in November 2004, there were 92,922 Democrats registered in Erie County (there were), then the regression equation would suggest that Kerry should receive 67,210 votes in Erie county. In fact, he got 67,921 votes. Not too shabby (there are some counties which don’t work as well, which I will come to later).

You might see where I’m going with this, but before I get there, let me mention a few other things. The Republicans also have a regression equation that helps us predict vote totals from registration numbers. The correlation, for the GOP, is not quite as strong as the Democrats’, but, at 0.97, its still quite strong. Here’s the Republican 2004 equation:

Votes for Bush = (0.7694 x GOP Registration) + 2594.3

Those of you who remember your high school math will recognize that decimal in both equations as the slope of the line. In this context the slope can be thought of as the number of votes each candidate can expect from every 1 voter registered in his or her party – in other words, its kind of like turnout. In 2004, the GOP turned out nearly 77% of its party members to the Democrats 70%. It’s not precise, of course, because there are independents and third party registrants who end up voting for the major candidates. Nevertheless, the slope is certainly a good measure of relative turnout, if not absolute turnout – in 2004, the GOP got more bang for its registration buck (the number after the plus sign is the intercept – the way to conceptualize this is that the intercept is the number of votes you’d get in a county that has zero registrants in your party).

Before turning to how all this can be used to predict the outcome of the 2008 election, it is important to remember that one year of data, no matter how robust, could be just an aberration. Have no fear. I looked at 2000 as well. Though the two elections have some significant differences, overall, the basic outline is the same. The correlation between registration and votes is 0.98 for the Democrats and 0.96 for the Republicans; so again, registration is a good predictor of vote tallies. The biggest difference is that turnout was much lower for both parties. Here are the regression equations for 2000:

Votes for Gore = (0.6056 x Dem Registration) + 3330.2
Votes for Bush = ( 0.6298 x GOP Registration) + 3490.7

From 2000 to 2004, both parties increased their turnout, getting more votes per registered voter than they had in the previous election (though the GOP did better in both years), but overall, the predictive strength of the equations is very good for both parties in both elections.

One last thing, and then I promise we’re getting to the good stuff. As you’ve no doubt figured out by now, my intention is to apply the 2004 regression equations to the 2008 registration numbers to get a projection for the county by county vote in Pennsylvania, but before I do that, let’s take a quick look at how this method would have fared four years ago. Taking the 2000 regression equation and then applying it to 2004 registration data would have resulted in the following projected statewide vote tallies:

Kerry – 2,413,651
Bush – 2,378,521

And the actual vote tallies were:

Kerry – 2,938,095
Bush – 2,793,847

“Gosh those projections were off,” you’re probably thinking right now, and of course you’d be right. The 2000 regression equations assume 2000 turnout levels. But of course turnout in 2004 was much higher for both parties, so the actual vote totals are higher than the projections. But forget the absolute numbers for a moment. Let’s look at the percentage of the two-party vote. First the projection:

Kerry – 50.4%
Bush – 49.6%

Now the actual:

Kerry – 51.2%
Bush – 48.8%

Not bad, not bad at all. And don’t forget, the key thing is getting the order right. In other words, based on the number and distribution of registered voters in 2004, we would have accurately predicted a slim Kerry win.

Ok, the moment you’ve been waiting for…let’s take a look at the projection of votes resulting from an application of the 2004 regression equations onto the 2008 (as of September 9) registration numbers:

Obama – 3,162,681 (54.6%)
McCain – 2,627,249 ( 45.4%)

Whoa, mama! A nine point win! What the heck? How could that be? Well, here’s what happened. Since 2004, the Democrats have added 320,000 additional registered voters. That’s what accounts for Obama’s projected 200,000 vote improvement over Kerry. And what accounts for McCain actually doing worse than Bush? Well, since 2004, the number of GOP registered voters has dropped by 215,000. That means that even though the GOP gets more “bang for their buck,” more votes per registered voter, with a net loss of more than 500,000, that’s a deep hole to climb out of.

But now you’re thinking, “Yeah, but 2008 isn’t 2004, and besides aren’t some of those new ‘Democrats’ really Republicans who just switched so they could vote in the primary, and what about all those Democrats who voted for Clinton but who’ll never vote for Obama and…” All good points. Let’s address them one at a time.

First, how many of those new “Democrats” are actually just Republicans who switched in order to vote in a more meaningful primary. Well, between November 2007 and April 2008, the number of registered GOP voters fell by 60,000. The Dem rolls increased by much more than that, but let’s assume that all 60,000 of those former GOP voters simply switched over. Though that seems somewhat unlikely, it does jibe with the exit polls from the primary which suggested about 3% of voters in the Democratic primary were Republican. In any case, let’s be generous and assume that all 60,000 of those people should be treated like Republican registered voters, not Democratic registered voters. With these new registration numbers the projection looks like this:

Obama – 3,119,457 ( 53.8%)
McCain – 2,674,854 (46.2%)

Ok, but what about Clinton! Presumably some of those “Democrats” who voted in the Primary for Senator Clinton should really be treated as Republicans, right? Sure, I buy that to some degree. Fortunately, we can take the primary results and allocate some of those Clinton supporters who are, after all registered Democrats, and have the regression equation treat them as if they were registered Republicans. But how many? Well, polls seem to suggest that Obama is “having trouble” winning over anywhere between 10% and 20% of former Clinton supporters. So let’s start at 15% and see where that gets us.

Just to be clear, we’re taking 15% of the Clinton voters in every county and telling the regression equation to treat them as Republicans rather than Democrats. Basically, we’re giving the GOP a registration boost of about 190,000, and taking away the same from the Democrats (a net gain for the GOP of 380,000 voters). Also, we’re keeping in our switch of 60,000 faux Democrats, despite the fact that we’re certainly double counting (some of those 60,000 are also in the 190,000 Clinton voters) so that brings the total net registration gain for the GOP to 500,000. Not only that, we’ll assume that McCain can turn out these lapsed Democrats out at the same rate as actual registered Republicans, another generous assumption. What happens to our projection now?

Obama – 2,985,846 (51.4%)
McCain – 2,822,006 (48.6%)

Obama still wins, albeit by much smaller margin. Finally, one last tweak. There are four counties, large counties, in which the regression equation either greatly overestimates or underestimates the turnout per registered voter for one or both parties. In Allegheny county, for example, in both 200 and 2004 the Bush actually earned more votes than there were registered republicans. On the other hand, in Delaware County, in both 2000 and 2004 Bush’s vote total was only a bit more than half the total number of registered Republicans. There are a few of these counties for Democrats too. After adjusting the projection for county-specific affects, here is the last projection:

Obama – 3,018,099 (51.7%)
McCain – 2,818,630 (48.3%)

So…why am I confident that Barack Obama will win Pennsylvania? Because right now, he holds a registration advantage of over 1.1 million voters. Because in the past two elections, the correlation between registrations and vote tallies approached perfection. Because even if 20% of people who voted for Clinton in the primary decide to act more like Republicans than Democrats then Obama still wins (in fact, with my assumptions, the “Clinton voter as Republican” rate would have to be 26% to switch the outcome). Because this whole exercise doesn’t even take into account the fact that in 2004, there were 250,000 registered Democrats in Philadelphia alone who didn’t vote!

The polls suggest Pennsylvania will be close, and I think it will be. But close isn’t going to do it for John McCain. When we get closer to the election, I’ll write a post about what vote totals to look for on election night to see if McCain has a chance to pull off a major upset.

For those of you who made it to the end of this post, congratulations. I hope it was worth your time.



Registration data and election results come from PA Dept. of State.

No comments: