The effects of data leakage on connectome-based machine learning models

Presented During:

Tuesday, June 25, 2024: 12:00 PM - 1:15 PM
COEX  
Room: ASEM Ballroom 202  

Poster No:

1463 

Submission Type:

Abstract Submission 

Authors:

Matthew Rosenblatt1, Link Tejavibulya1, Rongtao Jiang1, Stephanie Noble2, Dustin Scheinost1

Institutions:

1Yale University, New Haven, CT, 2Northeastern University, Boston, MA

First Author:

Matthew Rosenblatt  
Yale University
New Haven, CT

Co-Author(s):

Link Tejavibulya  
Yale University
New Haven, CT
Rongtao Jiang  
Yale University
New Haven, CT
Stephanie Noble  
Northeastern University
Boston, MA
Dustin Scheinost  
Yale University
New Haven, CT

Introduction:

Understanding individual differences in brain-behavior relationships is a central goal of neuroscience. As such, machine learning approaches using neuroimaging data, such as functional connectivity, have grown increasingly popular in predicting numerous phenotypes. The reproducibility of such studies is hindered by data leakage, where information about the test data is introduced into the model during training (1). Although leakage is never a correct practice, quantifying the effects of leakage in neuroimaging data is important due to its pervasiveness. Here, we evaluate the effects of leakage on functional connectome-based machine learning in four large datasets for the prediction of three phenotypes.

Methods:

We obtained resting-state fMRI data from the Adolescent Brain Cognitive Development (ABCD) Study (2) (N=7822-7969), the Healthy Brain Network (HBN) Dataset (3) (N=1024-1201), the Human Connectome Project Development (HCPD) Dataset (4) (N=424-605), and the Philadelphia Neurodevelopmental Cohort (PNC) Dataset (5,6) (N=1119-1126). Resting-state functional connectomes were formed using the Shen 268 atlas (7). Throughout this work, we predicted age, attention problems, and matrix reasoning from functional connectivity using ridge regression (8) with 5-fold cross-validation.

We evaluated a gold standard model, which included covariate regression, site correction, and feature selection within the cross-validation scheme and was split accounting for family structure. We also evaluated four other categories of models. First, several alternative analysis choices that do not contain leakage were included as a reference point, such as omitting site correction, covariate regression, or both. Second, feature leakage involves selecting features in the combined training/test data instead of only in the training data. Third, covariate-related forms of leakage in this study included correcting for site differences and performing covariate regression in the combined training and test data (i.e., outside the cross-validation folds). Fourth, subject-level leakage was evaluated in the forms of family leakage and repeated subjects leakage. For family leakage, the family structure of the data was ignored, where leakage may occur if one family member is in the training set and another in the test set. For subject leakage, a percentage of the participants were randomly repeated in the dataset, mimicking the possible mishandling of repeated measurements datasets.

Results:

We first analyzed leakage in HCPD and found that leaky feature selection and subject leakage (20%) most inflated performance, but leaky covariate regression deflated performance (Figure 1). Other forms of leakage, including family leakage and leaky site correction, had little to no effect on performance (Figure 1). Results were similar when considering all the datasets, where leaky feature selection (Δr=0.03-0.52, Δq2=0.01-0.47) and subject leakage (20%) (Δr=0.06-0.29, Δq2=0.03-0.24) led to the greatest performance inflation (Figure 2). Notably, weaker baseline models were more affected by feature leakage. Leaky covariate regression was the only form of leakage that consistently deflated performance (Δr=-0.09-0.00, Δq2=-0.17-0.00). Family leakage (Δr=0.00-0.02, Δq2=0.00) and leaky site correction (Δr=-0.01-0.00, Δq2=-0.01-0.01) had little effect. We repeated the analyses with support vector regression (8) and connectome-based predictive models (9) and saw similar results.
Supporting Image: fig1.png
Supporting Image: fig2.png
 

Conclusions:

Concerns about reproducibility in machine learning can be partially attributed to leakage (1). Some forms of leakage greatly affected the results. But, other types did not affect predictions, which means that published results with these forms of leakage likely remain valid. Since the effects of leakage vary greatly, the best practice remains to avoid data leakage altogether through the careful development and sharing of code, alternative validation strategies (lock box (10), external validation), and model information sheets (1).

Modeling and Analysis Methods:

Classification and Predictive Modeling 1
Connectivity (eg. functional, effective, structural) 2
Methods Development
Multivariate Approaches

Keywords:

FUNCTIONAL MRI
Machine Learning
Multivariate
Statistical Methods
Other - reproducibility; data leakage

1|2Indicates the priority used for review

Provide references using author date format

1. Kapoor, S. & Narayanan, A (2023). Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4, 100804.
2. Casey, B. J. et al (2018). The Adolescent Brain Cognitive Development (ABCD) study: Imaging acquisition across 21 sites. Dev. Cogn. Neurosci. 32, 43–54.
3. Alexander, L. M. et al (2017). An open resource for transdiagnostic research in pediatric mental health and learning disorders. Sci Data 4, 170181.
4. Somerville, L. H. et al (2018). The Lifespan Human Connectome Project in Development: A large-scale study of brain connectivity development in 5-21 year olds. Neuroimage 183, 456–468.
5. Satterthwaite, T. D. et al (2014). Neuroimaging of the Philadelphia neurodevelopmental cohort. Neuroimage 86, 544–553.
6. Satterthwaite, T. D. et al (2016). The Philadelphia Neurodevelopmental Cohort: A publicly available resource for the study of normal and abnormal brain development in youth. Neuroimage 124, 1115–1119.
7. Shen, X., Tokoglu, F., Papademetris, X. & Constable, R. T. (2013). Groupwise whole-brain parcellation from resting-state fMRI data for network node identification. Neuroimage 82, 403–415.
8. Pedregosa, F. et al (2011). Scikit-learn: Machine learning in Python. The Journal of machine Learning research 12, 2825–2830.
9. Shen, X. et al (2017). Using connectome-based predictive modeling to predict individual behavior from brain connectivity. Nat. Protoc. 12, 506–518.
10. Hosseini, M. et al (2020). I tried a bunch of things: The dangers of unexpected overfitting in classification of brain data. Neurosci. Biobehav. Rev. 119, 456–467.