Fast Forward Selection in the Presence of Missingness with Application to UKB Imaging Confounds

Poster No:

1512 

Submission Type:

Abstract Submission 

Authors:

Lav Radosavljevic1, Fidel Alfaro Almagro2, Thomas Maullin-Sapey3, Stephen Smith1, Thomas Nichols1

Institutions:

1University of Oxford, Oxford, Oxfordshire, 2WiN FMRIB - University of Oxford, Oxford, State/Province, 3University of Bristol, Bristol, Bristol

First Author:

Lav Radosavljevic  
University of Oxford
Oxford, Oxfordshire

Co-Author(s):

Fidel Alfaro Almagro  
WiN FMRIB - University of Oxford
Oxford, State/Province
Thomas Maullin-Sapey, Dr.  
University of Bristol
Bristol, Bristol
Stephen Smith  
University of Oxford
Oxford, Oxfordshire
Thomas Nichols, PhD  
University of Oxford
Oxford, Oxfordshire

Introduction:

The UK Biobank (UKB) imaging pipeline [1] produces >4000 Imaging Derived Phenotypes (IDPs), but also over 10,000 quality control (QC) and diagnostic variables that may play a confounding role in associations between IDPs and other outcome variables (non-IDPs). In previous work [2] we systematically identified ≈1000 confounds that explained non-trivial variance in IDP/non-IDP associations. Our previous approach to identifying UKB imaging confounds selects confounds, one-by-one, that explain a high proportion of variance across all IDPs and/or a very high proportion of variance in individual IDPs [2]. This has the disadvantage of jointly selecting highly correlated confounds, leading to redundancy in the confound set.To avoid redundancy in the selected confounds, we propose using a forward stepwise selection procedure, at each step greedily selecting the next confound variable that increases R2 (proportion of variance explained) the most over IDPs, ΔR2. However, under varying missingness, stepwise regression over 1000's of IDPs is computationally burdensome, due to the need to fit a unique linear model for each IDP. Using non-stochastic single imputation, such as mean or one-shot imputation, as a pre-processing step solves the problem of computational time, but biases estimates of R2 and therefore the selection process.
Supporting Image: Confounds_Forward_Selection.png
   ·Illustration of the sparse linear relationship between confounds and IDPs that is assumed in the selection process.
 

Methods:

We propose a fast stepwise forward selection approach using a corrected estimate of R2 calculated on mean-imputed IDP data. Mean imputation eliminates the need to fit multiple unique linear models, greatly speeding up computation, and the corrected estimate gives us results similar to the R2 estimates using Complete Cases (CC).
Under a missing completely at random (MCAR) assumption, it can be shown that mean imputation deflates R2 by a factor of 1-pmiss, where pmiss is the rate of missingness. We thus propose the following correction:

Rcor2 = Rimp2/(1-pmiss).

The proposed forward selection algorithm is fast and is implemented using a single matrix operation at each step. In contrast, the equivalent stepwise selection approach using CC requires one continuously updated linear model per IDP, which means that it requires 4000 different pseudo-inversions of the model design matrix in each step. Since matrix inversion is the computational bottleneck of this problem, we can expect our approach to be up to 4000 times faster than the naive implementation using CC.

Results:

To verify that our post-imputation correction is a sensible approach to approximating variance explained by UKB imaging confounds in IDPs, we compare the estimates of R2 obtained using CC with estimates obtained using our method. The figure below shows a scatter plot of the missingness rate for each IDP plotted against the percentage of relative difference between the estimates of the two methods. As we can see, for the vast majority of IDPs, the two estimates are within 5% of each other, which is sufficiently close for our purposes. There are a few larger deviations for some IDPs, most probably caused by violations of the MCAR assumption.
Supporting Image: scatter_R2.png
   ·Scatterplot of the relative difference between the two estimates, against the percentage of missing data. We see good agreement at different levels of missingness.
 

Conclusions:

Our proposed method is sufficiently fast to be used for forward selection on real IDP and confound data. It has the advantage of being neutral to the rate of missingness in each IDP, meaning that it does not systematically inflate/deflate the estimates of R2. This is not the case when using uncorrected estimates of R2 on IDP data that has been imputed, where (depending on the imputation method) higher missingness will lead to inflation/deflation of the R2 estimate. One major limitation is the assumption of MCAR, which is in general not feasible for UKB data. Further work on this topic involves finding computationally fast corrections for R2 that are robust to Missing at Random (MAR) and possibly Missing not at Random (MNAR).

Modeling and Analysis Methods:

Methods Development 1
Motion Correction and Preprocessing 2
Multivariate Approaches

Keywords:

Data analysis
Multivariate
Statistical Methods

1|2Indicates the priority used for review

Abstract Information

By submitting your proposal, you grant permission for the Organization for Human Brain Mapping (OHBM) to distribute your work in any format, including video, audio print and electronic text through OHBM OnDemand, social media channels, the OHBM website, or other electronic publications and media.

I accept

The Open Science Special Interest Group (OSSIG) is introducing a reproducibility challenge for OHBM 2025. This new initiative aims to enhance the reproducibility of scientific results and foster collaborations between labs. Teams will consist of a “source” party and a “reproducing” party, and will be evaluated on the success of their replication, the openness of the source work, and additional deliverables. Click here for more information. Propose your OHBM abstract(s) as source work for future OHBM meetings by selecting one of the following options:

I am submitting this abstract as an original work to be reproduced. I am available to be the “source party” in an upcoming team and consent to have this work listed on the OSSIG website. I agree to be contacted by OSSIG regarding the challenge and may share data used in this abstract with another team.

Please indicate below if your study was a "resting state" or "task-activation” study.

Other

Healthy subjects only or patients (note that patient studies may also involve healthy subjects):

Healthy subjects

Was this research conducted in the United States?

No

Were any human subjects research approved by the relevant Institutional Review Board or ethics panel? NOTE: Any human subjects studies without IRB approval will be automatically rejected.

Yes

Were any animal research approved by the relevant IACUC or other animal research panel? NOTE: Any animal studies without IACUC approval will be automatically rejected.

Not applicable

Please indicate which methods were used in your research:

Functional MRI
Structural MRI
Diffusion MRI
Computational modeling

For human MRI, what field strength scanner do you use?

3.0T

Provide references using APA citation style.

Alfaro-Almagro,et al. (2018). Image processing and Quality Control for the first 10,000 brain imaging datasets from UK Biobank. NeuroImage, 166(April 2017), 400–424. https://doi.org/10.1016/j.neuroimage.2017.10.034

Alfaro-Almagro, et al. (2021). Confound modelling in UK Biobank brain imaging. NeuroImage, 224, 117002. https://doi.org/10.1016/j.neuroimage.2020.117002

UNESCO Institute of Statistics and World Bank Waiver Form

I attest that I currently live, work, or study in a country on the UNESCO Institute of Statistics and World Bank List of Low and Middle Income Countries list provided.

No