One more week has passed since I posted the first part of my take on the presidential upset in the US elections. First of all, I want to say that I was pleasantly surprised to see that it received good attention and was picked by the editors of scienceseeker.org (thanks!). Once you wake up on the wrong side of the error bar, you start to wonder if there is any chance to do better next time. What worked in Trump’s favor, has also led to erroneous estimations of brain activation clusters in fMRI research. Correlated errors are omnipresent in data, but hardly present in statistical models regardless of the domain and I covered this aspect in Part 1. In the second part of my post, I will deal with two more statistical issues that surfaced after the election, but are not echoed in common practice data handling in neuroscience: 1) misconceptions about what the margin of error truly reflects and 2) the gap between a sample (what you got) and the underlying population (what you want to get at).
Despite its name the standard error does not account for all sorts of “standard” errors
The definition of the standard error seems plain and clear. On Wikipedia, the first sentence simply reads “The standard error (SE) is the standard deviation of the sampling distribution of a statistic1, most commonly of the mean.” That sounds like we are pretty much covered, right? Well, we are covered, yet only for one sort of error, namely the sampling error of a test statistic. We will stick to the mean because we typically care most about the mean, but it equally applies to other statistics of interest. Every time we run a study, we need to sample from a population if we cannot sample the whole population at once. Imagine that we ask 1,000 inhabitants of Florida who they intend to vote for. Let’s say we get 460 responses for Hillary, 440 for Trump and 100 for third party. This would be the corresponding bar plot of the results:
You might miss the error bars, so how do we get them when we have “counts” on the y-axis? The simplest way to calculate the standard deviation (SD) and the corresponding standard error of the mean (SEM) here is by splitting up the variable into three “dummy” variables. A dummy variable will be assigned the value 1 in case the individual responded that he/she would vote for Hillary, Trump, or third party, respectively, and 0 otherwise. We can use that to get percentage scores per category. In our example, we get SEMs of .016 for Hillary, .016 for Trump, and .009 for third party. Assuming that the obtained percentages are representative of the population, what the SEM covers is the following: if we were to repeat the same survey n times again, ~68% of the averages of the samples would be within the SEM. Often, we extend this margin of error and use confidence intervals to be on the safe side. To obtain a 95% symmetric confidence interval, we multiply the SEM by 1.96. The numbers come from the area under the curve of the normal distribution, but we need not be concerned with it at this point. For visualization, I have resampled the survey data 10 times and plotted the results here:
I added colored reference lines so that it is easier to see when the 95% confidence interval of the resample includes the initial value of the reference sample. We can see that despite the 2% advantage for Hillary in the initial sample, Trump would win the state in 1 out of 10 resamples and tie in 1 one more instance. This is exactly why a 2 point margin is still considered to be a toss-up situation because there is not enough evidence to predict the outcome with sufficient certainty so that the state could be safely called for one party or the other.
How many uncertainties are contained? I am not sure
Still, it is important to reiterate that when we talk SEM, we are only talking about the uncertainty that is introduced by randomly resampling from a bigger population, where the initial sample is assumed to be representative of the population. If our initial estimate was off because sampling was systemically biased to begin with or there is a bias in the reporting of the voter’s preference (e.g. landline phone sampling vs. online polling was identified as one), the margin of error might fall short and leave your flanks open for attack. In other words, the standard error refers to the error introduced by resampling from a population, but not the error that is involved in selecting the sample unless this bias were the same across (real) resamples. The latter source of error also made the aggregated polls (as provided by realclearpolitics.com, for example) fail to some extent because there is was a small but decisive difference in the likelihood of taking part in a landline phone survey versus casting the vote on election day. The Upshot posted a great analysis in advance of the election that is well worth a read. As Andrew Gelman (one of the pollsters involved) pointed out in his blog post, the key difference between the sample and population of voters in Florida was party registration. When the sample has more registered Democrats (by chance? by distorted sampling routines?) than a representative sample, you may eventually carry over this bias to the population level. Among other things, this is what they took into account to make a prediction and it turned out to be quite accurate since their model’s prediction was Trump +1 instead of Clinton +1-3 (vs. Trump +1.3 in the election). So we see that sampling bias can shift a result in one direction or the other independently of our comforting margin of error. In fact, problems 1 and 2 that I have sketched in the beginning of my post most frequently go hand in hand. If you falsely believe that the error bars got you covered, you might become complacent in selecting your sample. As a result, your research might only apply to the bubble that you live in.
How representative are samples in neuroscience? That’s hard to tell because we don’t care a lot about it
Historically, many techniques in neuroscience have been very laborious and expensive so that they could not be employed at a huge scale. Consequently, small samples have dominated research for a long time leading to low power for most phenomena that cognitive neuroscientist are interested in. A second reason is the widely held believe that many aspects of brain function are preserved even across species (just a bunch of neurons!) so that research can be generalized to a reasonable extent regardless of the setting.
A third reason comes from the practice in psychology to recruit undergraduate students to conduct research on topics of general psychology such as learning, memory, and motivation. Certainly, we all need to do that and how many different ways can there possibly be? Thus, it is not surprising that I cannot think of an fMRI study where the results of the sample were modeled onto the level of the population based on known sample characteristics similar to the survey data that included too many Democrats to be representative for Florida. Notwithstanding, many high-profile papers claim to solve important health problems one blob at a time. A fourth reason is the practice of inclusion and exclusion criteria for fMRI studies. Of course, we need some criteria for safety reasons. However, I recall that one my reviewers objected the publication of a paper because lifetime depression was not strictly exclusionary and similar points have been made repeatedly. The rationale is simple and pervasive: We know that study xyz showed differences in the “motivational circuit” in depressed patients vs. controls, so that might be a confound in the current study. I agree that it might be, but it is virtually impossible to demonstrate that it is not in a single study. But if we routinely exclude ~20% of the adult population based on such a criterion + ~15% due to a history of substance-use disorder + ~10% left-handers and so on and so on, representativeness becomes not much more than an illusionary ambition.
It is not the lack of solutions since there are great suggestions how to deal with these types of problems. For example, the power of recruitment modeling was nicely illustrated by Earl Hunt, but obviously, we are not there yet to employ them for good. Perhaps we have to understand our little bubbly world first. Nevertheless, we have to start putting these issues on the table and raise awareness in the neuroscience community. Otherwise, we may never understand how so many “brains” could end up voting for Trump if we simply focus on rationality being expressed in artificially preselected settings.
More work lies ahead of us: Big data needs careful attention
To summarize, I have discussed how spatial autocorellations (part 1) can help to make an unlikely upset become very real and why the margin of error only covers a subset of potential errors that can occur in real-life data. Personally, I find it alarming that few studies take sample recruitment into account in trying to understand the function of the brain. Arguably, there is an intricate variability even in the most general mental processes within a given population and everyone who uses facebook or twitter can easily make this observation. There is one more problem that I think is highly prevalent and misleading at times, namely the approach of building global accounts of brain function from a number of (thresholded) parcels, but I will save this for another day.
Now that I took the chance to talk about some of my thoughts on statistical modeling in real life at length, I am curious to hear about your perspective as well. What can we learn from other domains of data analysis? Where did you notice that your models perhaps fall short of the hidden complexities of real data? Please use the comments section or email me to weigh in and let me know about your thoughts as well.