November 8, 10:20pm: “Hillary Clinton has an 85% chance to win.” – New York Times
November 9, 2:52am: Donald Trump is elected President of the United States.
Over the last month, part of my daily routine has been visiting RealClearPolitics to get an unbiased prediction of who was going to be the next Commander-In-Chief of The United States. Every time I refreshed the site, which aggregates dozens of the most statistically rigorous national polls, the consensus was unanimous: Clinton was going to win.
Yet, here we are a week later. Donald Trump is now President-Elect Trump.
The immediate implications of the inaccurate polls run deeper than a stunned electorate. As a society, we are making more and more of our most important decisions based on data. When that data is wrong, the consequences are severe. It means multimillion dollar marketing campaigns gone awry. It means major product flops.
Several articles have attempted to suss out the bias. Collectively, they point to inaccurate voter turnout, imprecise survey methods, inaccurate usage of poll data, fear of being associated with a controversial candidate, people being hard to reach, and so on.
All of these are important, but when you have a whole industry off the mark, you have a deeper bias than survey design. That bias is that big data, no matter how rigorous, is still always flawed. I call this the Big Data False Certainty Bias.
The problem with data science is also its strength: collecting data on features of past events, building a model from them, and then making predictions of the future. The more that the future is like the past, the more predictive these models are. The more novel it is, the less predictive. Therefore, the underlying bias of all big data is the assumption that future will repeat the past.
This bias is insidious because it gives us a false certainty. In the case of the election, people visiting sites like RealClearPolitics or FiveThirtyEight thought they were eliminating all bias when they actually weren’t.
What we just witnessed in this election is exactly what conventional data science is not well suited for; events that diverge from patterns of the past. As we move into a world of increasing complexity, we can expect more of this.
There has been no larger champion of the Big Data False Certain Bias than Nassim Taleb, investor, author of The Black Swan and Professor of Risk Engineering at NYU. Not surprisingly, the election results were not a surprise for him. For months, he has been taking to Twitter to publicly condemn Nate Silver and his company, FiveThirtyEight, one of the leading data journalism media companies covering the election:
Taleb calls these incredibly influential, rare events where the future parts way from the past – like Trump’s victory – Black Swan events. The name is based on the fact that, at one point in history, everyone thought there were only white swans. They had never seen anything else, so they had 100% certainty based on the data. However, with one single observation of a black swan when Australia was discovered, this certainty went to 0%.
To bring home the gravity of Black Swan events, Taleb talks about the “turkey problem: “Consider a turkey that is fed every day, “Every single feeding will firm up the bird’s belief that it is the general rule of life to be fed every day by friendly members of the human race… On the afternoon of the Wednesday before Thanksgiving, something unexpected will happen to the turkey. It will incur a revision of belief.”
So how can we eliminate this Big Data False Certainty Bias?
First, we can learn to plan for the unexpected and hedge against it. At a basic level, you can do this by being more humble about your data model’s conclusions. They are never100% certain.
Second, you can also ask yourself a series of simple questions with every big decision you make based on data:
1. Is the future going to look like the past?
2. If yes, what are the levers of what could cause that change?
3. How much could those levers impact the end result?
For example, in this election, one of the levers was turnout of rural and suburban voters. As a researcher, you could ask yourself, “If the participation rate of these voters increases by 4%, would that impact the model?”
If the answer is, “If just 4% more of these voters show up, that could change everything,” then you’d know that you need to find ways to test if those people are going to actually come out and vote.
With this method, you don’t have to know the answers, you just have to know the questions.
Once you know the key questions, you can do behavioral studies (in the business world) that track what people actually do rather than just what they say they’re going to do. The problem with just asking people what they think is that we humans lie to ourselves and others in order to present a version of ourselves that will be viewed favorably by others. People lied with Brexit, and they lied with Trump. They lie when brands ask them what they prefer. They even lie when their spouses ask them hard questions.
Bottom line: This year’s presidential election is a wake-up call for anybody dealing with big data. Big data without awareness of its bias leads to big mistakes.