In DepthNeuroscience

Brain scans are prone to false positives, study says

See allHide authors and affiliations

Science  15 Jul 2016:
Vol. 353, Issue 6296, pp. 208-209
DOI: 10.1126/science.353.6296.208

To neuroscientists, functional magnetic resonance imaging (fMRI) has delivered both unmatched insights and occasional doses of embarrassment. In the 1990s, fMRI revolutionized the field by providing a way to study brain activity in human subjects. More recently, however, a handful of studies have pointed to flaws in how fMRI researchers analyze their data and in some of the assumptions underlying the technique, which infers brain activity from changes in blood flow. Media coverage of both the field's more sensational “this is your brain on politics”-type findings and its woes (such as the “voodoo correlations” fracas a few years ago) haven't helped its credibility.

Now, the field is buzzing about an analysis published online 28 June in the Proceedings of the National Academy of Sciences (PNAS). Anders Eklund, an electrical engineer at Linköping University in Sweden, and colleagues examined statistical methods in three software packages commonly used to analyze fMRI data. They found that certain common settings in the software gave rise to a false positive result up to 70% of the time. In the context of a typical fMRI experiment, that could lead researchers to wrongly conclude that activity in a certain area of the brain plays a role in a cognitive function such as perception or memory.

Some other neuroscientists note that these pitfalls in fMRI interpretation have been known for decades, and savvy researchers know how to avoid them. But Geoffrey Aguirre, a neuroscientist and neurologist at the University of Pennsylvania, says the new study has demonstrated the problem “more clearly than anyone had done before, using a larger data set than has been used before.” When Eklund and colleagues posted their paper on arXiv.org last November, they suggested that their findings raised questions about all 40,000 fMRI studies published in the last 2 decades, but they've reduced the number after a closer look at the literature. In a blog post last week, Eklund's co-author Thomas Nichols, a statistician at the University of Warwick in Coventry, U.K., estimated that about 3500 studies used the problematic software settings—still an alarming number.

A new study has renewed the debate over the use of statistics in functional magnetic resonance imaging research.

PHOTO: JAMES KING-HOLMES/SCIENCE PHOTO LIBRARY

The problem arises from a fundamental challenge in fMRI research. In a typical experiment, subjects do a task inside the scanner, something to engage their memory, attention, social skills, or whatever interests the researchers, while the scanner monitors their brain activity. Then, the researchers have to determine whether the patches of activity picked up by the scanner are really related to the experiment or just a random fluctuation.

One common solution is a statistical method called cluster-wise inference, which considers both the strength of activity at spots throughout the brain as well as the size of those spots. Cluster-wise inference is built into the three popular software packages analyzed in the study—statistical parametric mapping (SPM), the fMRI of the brain Software Library (FSL), and analysis of functional neuroimages (AFNI). All enable researchers to set a few parameters to adjust how the analysis is done and how stringent it will be. The problem, Eklund and colleagues say, is that when one of these parameters—something called the cluster-defining threshold (CDT)—is set too low, the analysis is more likely to result in a false positive.

In the new study, the researchers drew on several public databases that contain fMRI data collected while subjects were resting in the scanner, not engaged in any particular task. The researchers analyzed those data as if they were running a typical fMRI experiment, looking for regions of brain activation related to a task. The team simulated nearly 3 million fMRI experiments. Based on the statistical threshold they'd set, they expected to get a false positive result (that is, a positive hit for task-related activity even though there was no task) 5% of the time. Instead, depending on the software and the settings, up to 70% of the results were positive. They also identified a bug in the AFNI software package that had existed for 15 years and may have contributed to false positives. The team alerted the developers last year, and the bug has been fixed.

The pitfalls identified in the PNAS paper are nothing new, says Karl Friston, who directs the Wellcome Trust Centre for Neuroimaging at University College London and pioneered the development of statistical methods and software for analyzing fMRI data. Indeed, Friston and colleagues first identified these problems in a 1994 paper. In a technical response published on arXiv.org last month, Friston and Guillaume Flandin, also at the center, argue that the trouble can be avoided by choosing more conservative statistical thresholds. “Scientifically, I have nothing to add,” Friston wrote in an email. But the work has continued to generate discussion elsewhere, from an initial frenzy on Twitter when Eklund and colleagues published their preprint to last month's Human Brain Mapping meeting in Geneva, Switzerland.

Much of the software used in fMRI research hasn't been validated with actual data, as Eklund and colleagues have done, says Russell Poldrack, a neuroscientist at Stanford University in Palo Alto, California. “You'd hope that when we build a whole [scientific] field that the fundamental tools would have been validated with real data, not just theory and simulation,” he says. “It took 20 years to happen.”

In this case, it seems that both researchers and the software developers deserve some of the blame. “[CDT] is a parameter that the user must set, but they are presented with a default that they can accept, and most do,” says Nichols, who is on the development team for two of the software packages, SPM and FSL. Both packages are being modified to discourage users from picking the problematic settings.

What this means for the last 2 decades of fMRI research isn't entirely clear. “Just because the statistical significance of a particular finding is overestimated doesn't automatically mean the scientific findings of the paper are wrong,” Nichols says. In other words, even if 3500 papers used the problematic methods, the number of invalid scientific findings is certainly less.

It will largely be up to the original labs to reanalyze the work—if they choose to. “We would hope that researchers would be interested to know whether their previous claims stand, but realistically there is very little incentive (and lots of disincentives) to show that one's previous results are wrong,” Poldrack says.

The new work is yet another reason fMRI researchers should share their data more freely than they do, adds Poldrack, who in 2010 established the OpenfMRI repository, one of the sources for the data used in the new study. Data sharing not only enables researchers to try to replicate one another's findings, but also makes it possible to reevaluate old studies when a methodological flaw comes to light. “Without widespread data sharing,” Poldrack says, “it's basically impossible to know whether any particular finding is robust to these issues.”

  • * Greg Miller is a science and technology journalist based in Portland, Oregon.

View Abstract

Navigate This Article