Sources of data pollution: ill-posed problems

Janine Bijsterbosch Organizer
Washington University in St Louis
St Louis, MO 
United States
Ty Easley Co Organizer
Washington University in St Louis
St Louis, MO 
United States
Tuesday, Jun 25: 4:00 PM - 5:15 PM
Room: Grand Ballroom 104-105 
Brain-behavior neuroimaging research is at an unprecedented inflection point. With the increasing availability of sufficient data, expanding computing resources, and advances in computational approaches, a thoughtful discussion on data pollution is especially timely to educate existing and new members in the field. Recent shifts in the brain-behavior data landscape have altered benchmarks, standards, and constructs, bringing new ethical questions, skill needs, and research paradigms to the fore. This symposium is intended to disseminate important caveats, highlight opportunities for future research, and ultimately improve the insights gained from neuroimaging research.


1. Articulate the impact of sources of data pollution, namely: ill-defined phenotypic constructs, analytical flexibility, biased datasets, and disease heterogeneity.
2. Consider and address potential areas of data pollution in the design and execution of current and future research.

Target Audience

This symposium will be of interest to anyone performing brain-behavior modeling, regardless of the domain (clinical versus basic science) and methodology (ranging from linear regression models to deep learning).  


1. Establishing the joint reliability bottleneck for reproducible neuroscience

Biomarkers of behavior and mental health continue to remain out of reach for cognitive and clinical neuroscience. Suboptimal reliability of functional magnetic resonance imaging (fMRI) has been cited as a primary culprit for the poor reproducibility of brain-based biomarker discovery, leading to unfeasibly large sample-size recommendations. In response, steps are being taken towards optimizing MRI reliability and increasing sample size, but this will not be enough. We show that optimizing biological measurement reliability and increasing sample sizes are necessary but insufficient steps for biomarker discovery; this focus overlooks the ‘other side of the equation,’ namely that human neuroscience studies need to optimize the reliability of behavioral assessments as well. Through a combination of simulation and empirical studies using neuroimaging data, we demonstrate that the joint reliability of both brain and behavioral measurements should be optimized to ensure biomarkers are reproducible and accurate. Even with the best-case scenario - that is, high-reliability neuroimaging measurements and large sample sizes - we show that behavioral data (e.g., symptoms, cognitive measurements, surveys, objective markers of behavior) often have test-retest reliability levels that are suboptimal for the discovery of reproducible brain-behavior associations and biomarkers. Developing new assessments continue to be critical for improving the validity, specificity, and reliability of our characterization of the brain, behavior, and mental health, but in the short term, other solutions can also be pursued. Specifically, we emphasize the power of using existing assessments in ways that optimize their reliability, for example aggregating across repeated measurements or following established guidelines for improving behavioral data quality. These improvements are becoming increasingly feasible with recent innovations in data acquisition (e.g., web- and smart-phone-based administration, ecological momentary assessment, burst sampling, wearable devices, multimodal recordings). We demonstrate that these relatively simple changes to study design can improve behavioral measurement reliability and achieve better biomarker discovery for a fraction of the cost engendered by enormous samples. Although the current study has been motivated by ongoing developments in neuroimaging, prioritizing reliable measurements of behavior can transform human neuroscience and broader scientific and clinical endeavors focused on the brain and behavior.


Aki Nikolaidis, Child Mind Institute New York, NY 
United States

2. How analytical flexibility affects neuroimaging results

The analysis of neuroimaging data is complex and flexible, consisting of many steps, with multiple possible choices in each step. New methods and approaches are consistently being developed, increasing analytical flexibility, often without agreement on optimal choices. How does this analytical variability affect results in practice? This talk will describe the Neuroimaging Analysis Replication and Prediction Study (NARPS;, in which seventy independent analysis teams tested the same pre-defined hypotheses with the same fMRI dataset, and the variability of their methods and results were examined. The implications of the findings and potential solutions will be discussed, along with related studies in other modalities and fields. 


Rotem Botvinik-Nezer, The Hebrew University of Jerusalem Jersusalem, N.A. 

3. Demographic Sampling Inequalities Hinder Generalization of Neuroimaging Studies

Representative samples are crucial for generalizing scientific findings, yet the characteristics of study samples have often been largely overlooked. Neuroimaging studies, adopting methods from psychological science for participant recruitment, may face similar challenges seen in psychological research, with a prevalence of young, educated individuals from economically affluent, industrialized, and urban areas. Our meta-research, in conjunction with others, highlights three significant issues in neuroimaging study samples: under-reporting or non-reporting of many sample characteristics in published studies, disparities in samples across countries, and the ignorance of heterogeneity within countries. This talk will also explore the underlying reasons for these patterns, discuss the implications of these findings, and propose potential solutions. 


Chuan-Peng Hu, Nanjing Normal University Nanjing, N.A. 

4. The many sources of disease heterogeneity

While phenotype constructs, analytical choices, and sample bias do add to the heterogeneity of a disorder, true variability of the disorder is an important hurdle to overcome to understand its etiology. Efforts to parse the true variability in mental health disorders (such as subtyping) have been stymied by just how wide ranging and complex the sources of heterogeneity in the disorders are. This includes sources of heterogeneity coming from clinical, neurobiological, and genetic domains amongst many others. The relationships between these sources of heterogeneity are not necessarily one-to-one, the presence of many-to-one or possibly many-to-many has been suggested. My work has shown the presence of a many to one relationship between neurobiological and clinical sources of heterogeneity in depression. However, I will discuss beyond just my work for a comprehensive overview of the many sources of heterogeneity in mental health disorders and touch on possible ways for future work to overcome hurdles caused by this heterogeneity. 


Kayla Hannon, Washington University in St Louis St Louis, MO 
United States