1. Identifying sources of population covariation in large datasets to protect against model bias

Sarah Yip Presenter
Yale University
New Haven, CT 
United States
 
Monday, Jun 24: 9:00 AM - 10:15 AM
Symposium 
COEX 
Room: Grand Ballroom 101-102 
Large scale data collection initiatives, such as the UKB and ABCD study, are poised to provide unprecedented insights into our fundamental understanding of brain development within the context of both health and disease. Given the breadth of ongoing large-scale data collection, both in terms of the number of variables collected and the number of individuals studied, such datasets are further candidates for artificial intelligence (AI) models. However, given the potential for AI models to embed biases arising from under-representation of diverse populations in training data, significant caution should be taken when applying such approaches to neuroimaging and other forms of data. To illustrate this point, we present recent data demonstrating that basic demographic and social determinants of inequity were the primary drivers of day-to-day experiences of hardship during the COVID-19 pandemic. Specifically, using a multivariate pattern-learning approach of >17,000 variables collected from 9,267 families in ABCD to identify baseline predictors of pandemic experiences, as defined by both child and parent report, we find that non-White and/or Spanish speaking families had decreased resources, escalated likelihoods of financial worry and food insecurity. In contrast, those with higher pre-pandemic income and presence of a parent with a postgraduate degree experienced reduced COVID-19 related impact.
More recently, we leveraged a deep learning framework (conditional variational autoencoder) in conjunction with the entirety of ABCD behavioral data (n=11875, p=8902) to identify sources of interindividual differences. We find distinct dimensions of diversity driven by factors of socioeconomic status and other environmental factors that can be broadly categorized as social determinants of health. One underlying source of variation reflects material poverty and its health correlates while another captures densely populated living and its disproportionate effects across ethnic groups. Other key stratifications capture privilege via measures of education and income tied to healthy home environments and through European ancestry and desirable neighborhoods in terms of location and air quality. Cognitive ability specifically relating to executive function was related to variation across dimensions. By beginning to untangle the intricate web of such complex associations, we hope that our findings can guide future studies toward relevant covarying diversity measures to be included in brain-behavior modeling efforts when investigating a phenotype of interest. Collectively, these results demonstrate the import of considering basic diversity factors in data-driven analyses of large datasets. Coupled together with other work demonstrating that basic individual difference factors may bias brain-behavior models, they further suggest that, if not explicitly considered, such diversity factors will likely have hidden effects within AI models of neuroimaging data, opening up the potential for significant bias.

1. Yip SW, Jordan A, Kohler RJ, Holmes A, and Bzdok D. Multivariate, Transgenerational Associations of the COVID-19 Pandemic Across Minoritized and Marginalized Communities. JAMA psychiatry, 2022. 79(4): 350-358. PMC8829750
2. Greene AS, Shen X, Noble S, Horien C, Hahn CA, Arora J, Tokoglu F, Spann MN, Carrión CI, Barron DS, Sanacora G, Srihari VH, Woods SW, Scheinost D, and Constable RT. Brain–phenotype models fail for individuals who defy sample stereotypes. Nature, 2022. 609(7925): 109-118.