Brain-score correlates with large language model performance throughout pre-training

Poster No:

806 

Submission Type:

Abstract Submission 

Authors:

Lang Qin1, Zhiang Yan1, Jianbo Xu2, Bingjiang Lyu3

Institutions:

1Peking University, Beijing, Beijing, 201.AI, Beijing, Beijing, 3Changping Laboratory, Beijing, Beijing

First Author:

Lang Qin  
Peking University
Beijing, Beijing

Co-Author(s):

Zhiang Yan  
Peking University
Beijing, Beijing
Jianbo Xu  
01.AI
Beijing, Beijing
Bingjiang Lyu  
Changping Laboratory
Beijing, Beijing

Introduction:

Large language models (LLMs) have demonstrated remarkable performance across diverse tasks, yet understanding their alignment with human brain activity remains an ongoing challenge. Previous studies explored the correlation between neural activity and LLM hidden states, known as brain-score, by comparing different LLMs with humans on the same language input (e.g., Schrimpf et al., 2021). However, little is known about the evolution of brain-score during LLM pre-training and its relationship to task performance. Here we address this gap by analysing the dynamic changes in brain-score throughout the pre-training of a 6B LLM. Specifically, we regressed model hidden states across layers from a series of checkpoints onto spatio-temporally resolved brain activity recorded via magnetoencephalography (MEG) during speech comprehension, and correlated the brain-score with task performance across pre-training checkpoints.

Methods:

We used an open MEG dataset (Armeni et al., 2022) from 3 subjects, each undergoing 10 one-hour recording sessions while listening to narrative audiobooks (~30 hours total). MEG data were preprocessed, aligned to the word onset, and source-localized using dSPM in MNE. The text of the same audiobooks was input to 42 checkpoints saved during the pre-training of a 6B LLM with 33 layers (first and last layers were excluded in the following analysis). The LLM was pre-trained on 1.6T tokens and checkpoints were saved every 40B tokens (after ~104 iterations). At each checkpoint, the LLM was evaluated by 23 tasks spanning common sense reasoning, math & code, reading comprehension and other language tests. We fitted an encoding model using ridge regression to map LLM hidden states (reduced to 600 PCA components) onto source-localized MEG signals (Caucheteux et al., 2023; Caucheteux & King, 2022; Goldstein et al., 2022). Brain-score was computed as the extent to which LLM hidden states matched brain activity during speech comprehension, calculated for each cortical vertex within an epoch aligned to word onset for each layer and checkpoint. Changes in brain-score relative to the first checkpoint were assessed across pre-training. Additionally, brain-score changes were correlated with LLM task performance across checkpoints.

Results:

We focused on brain-score of the Heschl's gyrus (HG), superior temporal gyrus (STG) and inferior temporal gyrus (IFG), forming a functional hierarchy from low-level to high-level processing for speech comprehension. All regions exhibit high brain score between 0 to 0.4s post-word-onset across checkpoints, particularly in the shallow layers (i.e., 1-10), with HG and STG showing higher scores than IFG (Figure 1A, upper panel). As shown in the lower panel of Figure 1A, brain-score of the shallow layers tend to increase during pre-training while that of the middle layers (i.e., 11-20) tend to decrease with training. In contrast, deep-layer scores (i.e., 21-31) in IFG increased with training, both before and after word onset, suggesting a shared role of prediction based on contextual information in both the brain and LLM. Moreover, we found a similar pattern for the correlation between brain-score and LLM task performance across checkpoints. As shown in Figure 1B, positive correlation is mainly found in the shallow layers for HG and STG, but in the deep layers for IFG, while negative correlation is seen in the middle layers for all three brain regions. These findings suggest that shallow LLM layers resemble low-level processing areas in the brain, while deep layers increasingly align with IFG as pre-training progresses.
Supporting Image: results_.png
 

Conclusions:

Our findings suggest that brain-score not only reflects the alignment between neural responses and model activations but is also associated with the task performance of LLMs, highlighting its potential as a valuable evaluation metric.

Language:

Language Comprehension and Semantics 1
Language Other

Modeling and Analysis Methods:

EEG/MEG Modeling and Analysis 2

Keywords:

Language
MEG
Modeling

1|2Indicates the priority used for review

Abstract Information

By submitting your proposal, you grant permission for the Organization for Human Brain Mapping (OHBM) to distribute your work in any format, including video, audio print and electronic text through OHBM OnDemand, social media channels, the OHBM website, or other electronic publications and media.

I accept

The Open Science Special Interest Group (OSSIG) is introducing a reproducibility challenge for OHBM 2025. This new initiative aims to enhance the reproducibility of scientific results and foster collaborations between labs. Teams will consist of a “source” party and a “reproducing” party, and will be evaluated on the success of their replication, the openness of the source work, and additional deliverables. Click here for more information. Propose your OHBM abstract(s) as source work for future OHBM meetings by selecting one of the following options:

I do not want to participate in the reproducibility challenge.

Please indicate below if your study was a "resting state" or "task-activation” study.

Task-activation

Healthy subjects only or patients (note that patient studies may also involve healthy subjects):

Healthy subjects

Was this research conducted in the United States?

No

Were any human subjects research approved by the relevant Institutional Review Board or ethics panel? NOTE: Any human subjects studies without IRB approval will be automatically rejected.

Yes

Were any animal research approved by the relevant IACUC or other animal research panel? NOTE: Any animal studies without IACUC approval will be automatically rejected.

Not applicable

Please indicate which methods were used in your research:

MEG

Which processing packages did you use for your study?

Free Surfer
Other, Please list  -   MNE

Provide references using APA citation style.

Armeni, K., Güçlü, U., van Gerven, M., & Schoffelen, J.-M. (2022). A 10-hour within-participant magnetoencephalography narrative dataset to test models of language comprehension. Scientific Data, 9(1), 278. https://doi.org/10.1038/s41597-022-01382-7
Caucheteux, C., Gramfort, A., & King, J. R. (2023). Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature Human Behaviour, 7(3), 430-441. https://doi.org/10.1038/s41562-022-01516-2
Caucheteux, C., & King, J. R. (2022). Brains and algorithms partially converge in natural language processing. Communications Biology, 5(1), 134. https://doi.org/10.1038/s42003-022-03036-1
Goldstein, A., Zada, Z., Buchnik, E., Schain, M., Price, A., Aubrey, B., Nastase, S. A., Feder, A., Emanuel, D., Cohen, A., Jansen, A., Gazula, H., Choe, G., Rao, A., Kim, C., Casto, C., Fanda, L., Doyle, W., Friedman, D., . . . Hasson, U. (2022). Shared computational principles for language processing in humans and deep language models. Nature neuroscience, 25(3), 369-380. https://doi.org/10.1038/s41593-022-01026-4
Schrimpf, M., Blank, I. A., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N., Tenenbaum, J. B., & Fedorenko, E. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences of the United States of America, 118(45), e2105646118. https://doi.org/10.1073/pnas.2105646118

UNESCO Institute of Statistics and World Bank Waiver Form

I attest that I currently live, work, or study in a country on the UNESCO Institute of Statistics and World Bank List of Low and Middle Income Countries list provided.

No