If casting predictions is your bread and butter, you know how hard it is to be spot on. Luckily, in most cases it does not matter when we happen to be a bit off target because the implications are modest at best. This is why every prediction comes with a margin of error or a confidence interval. Still, when Trump defied the odds of poll predictions on election night and edged out the victory, I felt deeply troubled. Stats let me down on this important occasion and it was tough to take.
On the one hand, I felt troubled because I could not conceive how someone would pay attention to what has been said during the presidential race and still decide to vote for Trump. Clearly, this was a false generalization of my friends’ and my own personal perception of the political landscape. On the other hand, I was simply not expecting Trump to win given all the available poll data and the projected electoral votes. I knew that there where many toss-up states, but how could Hillary possible lose in so many that it might matter at crunch time? The reason is as simple as it is devastating: The results across the toss-up states are not statistically independent.
Voting across swing states is not quite like tossing a coin
Statistical independence of observations is an important assumption of many common approaches to data analysis. However, it is often not met and if you fail to incorporate the correlation of error terms, you may end up with the surprise election of Trump. Arguably, it might be not be as bad all the time, but the only positive I can see in it is that we try to learn and improve. Let’s do a simple example to see how it works. Assume that we toss a fair coin 10 times. How likely is it that we get 8 heads or more? To answer this question, we can use the binomial distribution where p(head) = .5, n(trials) = 10 and x(head) >= 8. It turns out that p(x) is .055 so a result like one candidate winning 8 out of 10 toss-up states is quite unlikely. Nevertheless, the probability of ~5% indicates that in 20 elections, this could happen once simply by chance. So was Trump just very lucky that day?
Yes, but not just lucky. There was a systematic bias in the polls that worked in his favor in the end. You may think of this bias in the following way: If results in Wisconsin are better for Trump than expected, then it becomes more likely that the results in Pennsylvania will be in his favor as well. When the results across states are interdependent to some degree, then the ~5% are only the lower bound estimate of an upset. Thus, as soon as a prediction model accounted for the correlations of errors across states and modeled bias as a “random effect” at the state level, it suddenly became more likely that Trump could win the election, up to 30%. The model by Nate Silver did that and outperformed many others. Again, one might say since he was already ahead of the crowd for the Obama elections. The advantage of the approach is very nicely illustrated by a blog post at simplystatistics.org (http://simplystatistics.org/2016/11/09/not-all-forecasters-got-it-wrong/ ), so there is no need for me to reiterate that.
Assumptions of statistical independence in neuroimaging
Perhaps ironically, the neuroimaging community was already hit hard by the “cluster failure” earthquake of headlines that resulted from a PNAS paper by Eklund et al. (http://www.pnas.org/content/113/28/7900.abstract ) earlier in 2016. If you managed to miss headlines such as “a bug in fMRI software may invalidate 15 years of brain research” or “40,000 neuroscience papers on fMRI might be wrong”, let me briefly introduce the problem and summarize results and implications of the paper. There are basically two popular ways to tell if there is a significant response in the brain. This step is important because fMRI has a low signal-to-noise ratio so it is not as if you were simply looking at a scale and you get a proper result in an intuitive unit. A better analogy might be measuring the weight of something very light like a pillow with your bathroom scale: Whereas the reading is inaccurate if you just put on the pillow, you can subtract your weight from your weight when carrying the pillow and get a better readout of the true weight of the pillow. Likewise, in fMRI, you have to put the signal in perspective and use relative signal change, which is small in magnitude. When you have such noisy data, you have to apply statistics to determine if you can trust a signal peak as an indication of the underlying brain activation.
Peaks of unlikely height may indicate that you found a mountain
The first way is to look at the strength of the signal relative to its variability. In statistical terms, we may ask: Is the strength of the brain response to food pictures so high that it is unlikely to occur if there was no association? Now you apply a criterion called alpha level that adjusts the likelihood that you might be wrong in drawing this conclusion, which is typically set to 1 out of 20 times. If you run multiple tests, you inevitably increase your chances to get false-positive results, so you need to adjust for that to keep the overall error rate in check. The Eklund et al. paper shows that if you do that, you are on the safe side. You might turn out to be a little too conservative, which means that you conclude more often than necessary that there is no brain activation at a small spot in the brain where there actually is something, but these false negatives are typically a minor concern only.
An extensive plateau might also indicate that you found a mountain
The second way to determine brain activation is potentially more problematic. Here, the assumption is that brain activation is distributed across multiple nearby spots in the brain called voxels (3d pixels). This makes sense because the distributed nature of the (hemodynamic) brain response is well known. To calculate how many voxels have to be activated in concert, you have to set an initial threshold of signal strength that they have to surpass before you even consider them. The important message of the Eklund et al. paper is that this does not work, particularly when the initial threshold is too low. Why did common fMRI software packages fail to account for that?
The first assumption is that the spatial smoothness of the fMRI signal is constant over the brain, and the second assumption is that the spatial autocorrelation function has a specific shape (a squared exponential)
Eklund et al. (2016), p. 3; http://www.pnas.org/cgi/doi/10.1073/pnas.1602413113
Local and global correlation patterns are intertwined
The first assumption is incorrect as you can see in the figure. I also put the images next to the results of the presidential election for Ohio split by county. The data is provided by CNN, an awesome collection of results you should check out. The Ohio map nicely illustrates the differences between urban and rural country that is evident throughout the US: Hilary won the urban parts of Ohio whereas Trump won the rural parts. Athens County is an outlier, but the home of Ohio University is known to be a Democratic stronghold much like other counties with big state universities (another error term one might consider). At the bottom of the figure, you can see that spatial smoothness differs considerably throughout the brain. Much like population density hints at voters’ preference at the level of the county, smoothness in fMRI data is heavily influenced by gray matter density or anatomy in general. The challenge for prediction becomes clear at this point. If I had to predict the result of a county in Ohio, I would say: “Trump!”, and do well. However, if I had the additional information that it is an urban county, the expectation from other observed urban counties might convince me to say: “Hillary!” regardless of the neighboring evidence in favor of Trump. However, current fMRI analyses do not really capture the intertwined nature of local and global contributions yet, which likely contributes to the “cluster failure” phenomenon.
Long-range autocorrelations: from coast to coast
The second reason for cluster failure is that the spatial autocorrelation function does not follow the simple assumed shape. In general, one would expect that autocorrelation decreases with increasing distance between the voxels. However, there is a systematic deviation of the assumed decrease of spatial correlations (a squared exponential) and the observed function, which has heavier tails than predicted. In other words, there is a markedly stronger spatial autocorrelation for long distances in the brain. This observation could partly be due to a scanner artifact because it is also seen when phantoms, not brains are being scanned as Eklund et al. pointed out.
Nevertheless, spatial autocorrelation may also arise from long-range structural and functional connections within brain networks. For example, the default mode of the brain spans quite a distance in the brain and shared signaling dynamics within spatially distributed brain networks might violate the independence assumptions of distant voxels in the brain. In my understanding, this resembles the case of the election models that struggled to predict the outcome of the presidential election because they did not model error terms across states correctly. When you look at the election results, for example, it is fair to hypothesize that the East Coast and West Coast share a common bias that is not attributable to spatial proximity. Hence, if you fail to model the potential interdependence of errors, an unexpected ensemble of shared outcomes might strike you by surprise.
Evidently, false certainty is not a good thing when it leads to a bad voter turnout on top. There are more lessons that we can take from this upset. For example, how can we as neuroscientists help in improving the prediction of behavior based on a single measure of intention? Why are point estimates deceptive at times? Stay tuned for the second part of the saga and follow neurocopiae along the trail. But for now, we can conclude that good statistical modeling is very hard, but indispensable at the same time if we want to reduce the number of shocking surprises in the future. And I don’t need another election like that to be honest.