One of my favorite scenes from the West Wing is when Joey Lucas, their pollster, tells Josh not let the White House get lead around by bad poll numbers. Not because data isn’t important or that the math doesn’t add up but because people are often complicated liars who don’t know what they want all the time.
How does this relate to data science?
Most voter targeting models use a binary classifier (with a bit of secret sauce running under the hood) to predict the probability of that a person in the state’s voter file will support or oppose the intended candidate.
If we wanted to generate some example data, this is what the first five rows of a voter file might look like + the support column as a predicted column – meaning that the value 0 = non support and 1 equals support because the decision to vote in an election usually results in a binary choice.
Most predictive models for voting work by polling people in the voter file to predict their likely candidate preference and use attribute like age, gender, party id and other x_factors to predict what other people, who share similar characteristics to the polled sample, might do.
There’s a more complicated way to do this that generates a direct probability for each individual voter but for the purposes of this post, let’s keep this hypothetical restricted to a [0, 1] style outcome.
The important thing to remember is that the target support column:
- Is something that we can only predict by polling people.
- Something that almost impossible to validate after the fact, due to the fact that we have a secret ballot in this country.
- Something that changes constantly throughout the course of the election and therefore extremely prone to error.
Let’s circle back to the West Wing episode where Lucas accuses the voters of lying. Now lying is a bit of a stretch in my opinion, but there are certainly issues that come up when collecting opinions from the pubic.
- Response rates are sharply declining for pollsters, limiting the distribution of people who participate in polls used for targeting.
- One example of this problem: younger voter who have cell phones & aren’t picking up for unfamiliar numbers.
- Because there’s a limited sample of respondents, not all geographical regions & demographics may be represented in the pool.
If there’s one common data adage that applies to predicting voter behavior, it’s: “garbage in, garbage out.” You can have any many MIT trained data scientists working for your campaign but if the public is uncooperative, there’s a limit to how useful your model can be.
The good news is that not all hope is lost. Many believe that the increased volume of robocalls has discouraged people from picking up the phone for pollsters.
The volume of robocalls has skyrocketed in recent years, reaching an estimated 3.4 billion per month. Since public opinion polls typically appear as an unknown or unfamiliar number, they are easily mistaken for telemarketing appeals.https://www.pewresearch.org/fact-tank/2019/02/27/response-rates-in-telephone-surveys-have-resumed-their-decline/
But, in light of the onslaught, Congress just passed a new law in 2019 that puts restrictions on robocalls. My hope is that people will be more responsive once the law takes full effect.