Evaluating AI-powered Assessments of Neuroimaging Research

Poster No:

1586 

Submission Type:

Late-Breaking Abstract Submission 

Authors:

Brock Pluimer1, Apeksha Sridhar1, Ishtiaq Mawla1, Sarah Hennessy2, Eric Ichesco3, Rishab Iyer4, Anson Kairys3, Roshni Lulla5, Helen Wu5, Richard Harris1

Institutions:

1University of California, Irvine, Irvine, CA, 2University of Arizona, Tucson, AZ, 3University of Michigan, Ann Arbor, MI, 4Princeton University, Princeton, NJ, 5University of Southern California, Los Angeles, CA

First Author:

Brock Pluimer  
University of California, Irvine
Irvine, CA

Co-Author(s):

Apeksha Sridhar  
University of California, Irvine
Irvine, CA
Ishtiaq Mawla  
University of California, Irvine
Irvine, CA
Sarah Hennessy, PhD  
University of Arizona
Tucson, AZ
Eric Ichesco  
University of Michigan
Ann Arbor, MI
Rishab Iyer  
Princeton University
Princeton, NJ
Anson Kairys  
University of Michigan
Ann Arbor, MI
Roshni Lulla  
University of Southern California
Los Angeles, CA
Helen Wu  
University of Southern California
Los Angeles, CA
Richard Harris  
University of California, Irvine
Irvine, CA

Late Breaking Reviewer(s):

Giulia Baracchini  
The University of Sydney
Sydney, New South Wales
Naomi Gaggi, PhD  
New York University Grossman School of Medicine
Rockaway Park, NY
Wei Zhang  
Washington University in St. Louis
Saint Louis, MO

Introduction:

Artificial intelligence tools are transforming academic research workflows, with platforms like Elicit emerging as automated assistants that enhance systematic review efficiency and quality assessment (Whitfield et al., 2023). While promising for literature analysis automation, full integration of AI tools into academic settings requires rigorous evaluation against human expert performance (Bolanos et al., 2024). Our group recently assessed how well Elicit could appraise 249 clinical acupuncture papers using guidelines from the Oregon CONSORT STRICTA Instrument (OCSI). We found a strong correlation (ICC = 0.91) between Elicit scores and expert human scoring. Building on this work, we evaluated Elicit's performance in evaluating human fMRI studies using a rubric derived from the OHBM COBIDAS guidelines (Nichols, 2017).

Methods:

We identified 13,635 candidate papers through PubMed, targeting human fMRI studies from 2016-2022. This timeframe captures papers published after the COBIDAS guidelines initial conception in October 2015 while allowing for sufficient citation accumulation. Our search string focused on experimental fMRI studies while excluding reviews and meta-analyses: "(fMRI OR functional MRI OR functional magnetic resonance imaging) AND (experiment OR task-based OR cognitive task) AND (humans[MeSH]) NOT review[Publication Type] NOT meta-analysis[Publication Type] NOT systematic review[Publication Type]". We next implemented stratified sampling by dividing papers into quartiles based on the number of citations. From each quartile, we randomly sampled 50 papers proportionally by publication year to maintain temporal representation. 15 papers were excluded during screening as they were either structural MRI, simulation studies, theoretical works, or hardware papers, leaving 35 fMRI articles for human scoring. We recruited eight scorers with varying levels of expertise to score these papers: three PhD students and five experienced researchers (four PhD holders and one career imaging professional). The same scoring guidelines were given to both Elicit and the human scorers.

Results:

We compared Elicit's evaluations against human expert scoring for 35 fMRI papers. The overall correlation between Elicit and human scores was modest (r = 0.278), with a mean absolute difference of 8.32 points between scoring methods. Both scoring approaches yielded similar overall average scores, Elicit: 68.82, Human: 68.39, with the human scores showing greater variability, Elicit SD = 5.77, Human SD = 10.43. Given the relatively low correlation between scoring methods, we conducted a subgroup analysis comparing PhD students against experienced researchers. The correlation between Elicit and experienced researchers was considerably stronger (r = 0.491) compared to the correlation with PhD student evaluators (r = 0.158).

Conclusions:

Our findings reveal a modest correlation between Elicit AI and human expert evaluations, with stronger agreement observed among experienced researchers compared to PhD students. Juxtaposed against the strong correlation observed in our clinical trial acupuncture study, we attribute the scoring variability observed here to the increased methodological diversity of neuroimaging studies which present additional challenges for standardized assessment. Moving forward, we aim to 1) refine our COBIDAS-derived rubric based on expert feedback and 2) implement a duplicate scoring approach to reduce variability caused by individual differences. Collectively, these results suggest that while AI-assisted quality assessment for neuroimaging literature is feasible, significant improvements in rubric design and AI training are necessary before such tools can reliably supplement human expertise in research workflows.

Modeling and Analysis Methods:

Activation (eg. BOLD task-fMRI) 2
fMRI Connectivity and Network Modeling
Methods Development 1

Keywords:

FUNCTIONAL MRI
Machine Learning
Other - Artificial Intelligence

1|2Indicates the priority used for review

Abstract Information

By submitting your proposal, you grant permission for the Organization for Human Brain Mapping (OHBM) to distribute your work in any format, including video, audio print and electronic text through OHBM OnDemand, social media channels, the OHBM website, or other electronic publications and media.

I accept

The Open Science Special Interest Group (OSSIG) is introducing a reproducibility challenge for OHBM 2025. This new initiative aims to enhance the reproducibility of scientific results and foster collaborations between labs. Teams will consist of a “source” party and a “reproducing” party, and will be evaluated on the success of their replication, the openness of the source work, and additional deliverables. Click here for more information. Propose your OHBM abstract(s) as source work for future OHBM meetings by selecting one of the following options:

I do not want to participate in the reproducibility challenge.

Please indicate below if your study was a "resting state" or "task-activation” study.

Resting state
Task-activation
Other

Healthy subjects only or patients (note that patient studies may also involve healthy subjects):

Patients

Was this research conducted in the United States?

Yes

Are you Internal Review Board (IRB) certified? Please note: Failure to have IRB, if applicable will lead to automatic rejection of abstract.

Not applicable

Were any human subjects research approved by the relevant Institutional Review Board or ethics panel? NOTE: Any human subjects studies without IRB approval will be automatically rejected.

Not applicable

Were any animal research approved by the relevant IACUC or other animal research panel? NOTE: Any animal studies without IACUC approval will be automatically rejected.

Not applicable

Please indicate which methods were used in your research:

Functional MRI
Other, Please specify  -   Artificial intelligence scoring of fMRI papers

Provide references using APA citation style.

Bolanos, F., et al. (2024). Artificial Intelligence for Literature Reviews: Opportunities and Challenges. ArXiv.

Nichols, T., Das, S., Eickhoff, S. et al. Best practices in data analysis and sharing in neuroimaging using MRI. Nat Neurosci 20, 299–303 (2017).

Whitfield, S., et al. (2023). Elicit: AI literature review research assistant. Public Services Quarterly.

UNESCO Institute of Statistics and World Bank Waiver Form

I attest that I currently live, work, or study in a country on the UNESCO Institute of Statistics and World Bank List of Low and Middle Income Countries list provided.

No