Amping up control? Bad research practices and poor reliability raise concerns about brain stimulation

There is a lot of buzz around brain stimulation, but new problems start to surface. Neurocopiae reviews news on bad practices and poor reliability.

It hasn’t been a very good week for proponents of the popular brain stimulation method called transcranial direct current stimulation (tDCS). tDCS is a non-invasive technique that uses electrodes to deliver weak current to a person’s forehead. Numerous papers have claimed that tDCS can enhance mood, alleviate pain, or improve cognitive function. Such reports have sparked interest in tDCS at a broader scale. When you enter tDCS in the youtube search, you will find DIY tutorials on how to assemble a device so that you can amp up your brain at home. Including enthusiastic reports of the resulting changes in brain function. To put it in Richard Dawkins’ words: Science? It works, bitches. In particular, it works when you know what the outcome should be.

Yes, we certainly want to believe that cognitive enhancement could be so easy to attain. Strap a battery onto your forehead and thoughts will flow free just like current. However, two recent studies published in PLOS One and NeuroImage cast new doubts on the over-enthusiastic early reports. Coincidentally, last year in April, the “cadaver study” caused a lot of bad press for tDCS already. György Buzsáki of New York University and his colleague demonstrated that hardly any of the applied current entered the brain. The researchers reported that up to 90% of the tDCS current had been redirected. The skin on the skull basically acted as a shunt. Whereas these results could be challenging the hypothesized mechanism, the cadaver study appears to be unpublished to date. Hence, it is still “only” a report from a conference that was heavily discussed because Science magazine covered the story. One could simply dismiss concerns at this stage as premature if there were no other reasons for concern. So back to the two new studies.

Questionable science and reproducibility in electrical brain stimulation research

In PLOS One, Héroux et al. report on an online survey of researchers and an audit of 100 randomly selected publications. In the survey, they found that only half of the researchers reported being able to routinely reproduce published results. Also, whereas 61% of researchers said they used power calculations to design studies, only 6 out of 100 published papers reported such calculations. Notably, 43% of researchers also said that other researchers “adjust statistical analyses in order to optimise the results” and even 25% admitted that they did so themselves as well. Results for other bad practices leading to inflation of published effect sizes such as selective reporting of outcomes or omission of experimental conditions were comparable. Not surprisingly though, the audited literature did not indicate that this is common practice. This seems to be in line with the perception of Vincent Walsh of University College London, who was quoted in the Science post on the cadaver study saying that the tDCS field is “a sea of bullshit and bad science—and I say that as someone who has contributed some of the papers that have put gas in the tDCS tank”.

Test-retest reliability of prefrontal tDCS effects on functional connectivity

Yes, this is just a survey of researchers. It might be as biased as the literature on tDCS is. No hard evidence. We need test-retest data on tDCS to know if it works or not, I guess. Wait no longer, we have it now, thanks to Jana Wörsching of LMU Munich and colleagues. In this case, the authors decided not to fuel the controversy on tDCS effects in the title. In fact, the publication comes from a study “designed [as] a pilot for further [test-retest] experiments with larger sample sizes” so I guess the disappointingly low reliability estimates came as a surprise to the researchers as well. Nevertheless, the main result continues to pile misery on an already troubled field of research:

Analyses of individual [resting-state functional connectivity] MRI responses to active tDCS across three single sessions revealed no to low reliability, whereas reliability of RS-fcMRI baselines and RS-fcMRI responses to sham tDCS was low to moderate.

Wörsching et al., 2017,

Although this is a small study with only 10 subjects per condition, Wörsching et al. made good use of the repeated sessions of active vs. sham stimulation to estimate the test-retest reliability:

To our knowledge, this is the first study investigating the [test-retest] reliability of prefrontal tDCS-induced modulation in RS fcMRI. For this purpose, effects of active or sham tDCS on RS fcMRI were measured on three different days in the same healthy subjects. In a first step, RS fcMRI at baseline and post tDCS was determined at an individual level. In a second step, reproducibility of intra-individual baseline and post-tDCS RS-fcMRI was tested using voxel-wise intra-class correlations, enabling comparisons between baseline RSfcMRI reliability and reliability following active-tDCS or sham-tDCS intervention.

Wörsching et al., 2017,

This figure, based on Table 1, provides a good summary of the main results (y axis = reliability, x axis = seed regions in the prefrontal cortex that should be affected by the stimulation):


In the upper “sham stimulation” panel, we see that the median voxel-wise intra-class correlation coefficient (ICC) is mostly between .3 – .5, which corresponds to a low to moderate reliability of the measurement. Among other applications, the ICC is commonly used to quantify the degree to which observations from the same participant resemble each other compared to observations from other participants. In this case, when the ICC is 0, it indicates that the measurements do not capture anything specific to that particular person. As an example, imagine several individuals tossing coins and recording the outcomes. As long as the coins are similarly fair, the ICC will be indistinguishable from 0 in the long run because the outcomes are only governed by a global stochastic process and not substantially influenced by individual differences in tossing a coin. Back to plot. We see that there is also very little difference between pre and post stimulation runs, so the ICC are about the same. The magnitude is still not great, but the reported values are in line with previous studies on functional connectivity indices.

Now, the lower “active stimulation” panel shows that the ICCs drop to ~0 after tDCS. In other words, knowing the stimulation effect from one testing session in a participant tells you virtually nothing about the stimulation effect of a second testing session in the same participant. This is bad news because it suggests that the effects of tDCS cannot be reproduced in the same person. Such reproducibility would seem, however, to be a prerequisite for any fruitful application at the individual level. Suppose this was a painkiller and your response to the drug on day 1 would not be indicative of your response to the drug on day 2 or day 3. Although we know from previous research that some of the effects of the intervention should depend on your individual characteristics. Moreover, subgroup analyses based on “responders” and “non-responders” or any other individual characteristic seem very problematic in light of these results. That doesn’t mean that they aren’t a common way to make a “negative” result appear more favorable as the survey suggests.

“I always knew this is the rotten apple!”

Again, we could point the finger at tDCS researchers and mock them because their fancy new technique does not seem to work very well, despite the overwhelming support by positive findings reported in the literature. Yet, problems with reliability are far from specific to the tDCS field. Yes, we see that tDCS disrupts the (low to moderate) reproducibility of the functional connectivity outcomes used in the Wörsching et al. study. So tDCS does something at the very least, something that we can’t capture with a reasonable connectivity analysis.

But how often do we see comprehensive assessments of the reliability of new techniques in human neuroscience at all? These studies are commonly seen as less interesting and they won’t help you get a paper in a fancy journal unless you trash the whole field. If you only trash the reliability of your newly developed paradigm or method, it might be difficult to claim that the effects somewhere around the edge of the significance threshold truly warrant any enthusiastic conclusions that attract future citations.

Still, in differential psychology, everyone would expect you to conduct and report such basic assessments first before you make any strong claim about individual differences. My take-home from the tDCS story is that basic checks of the reliability of every method in neuroscience should always come first and that they should be reported as well. They are important. Now we have kids building tDCS devices based on youtube tutorials and we know very little about what it does and what it does not do reliably. The call for more focus on reliability of outcomes might seem trivial, but think of your favorite fMRI studies and how often you have seen a discussion of the reliability of the method itself. It’s not a lot.


If you need more cognitive stimulation, here is one more example of the enthusiastic media coverage:

One thought on “Amping up control? Bad research practices and poor reliability raise concerns about brain stimulation”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s