Poster No:
1894
Submission Type:
Abstract Submission
Authors:
Panpan Chen1, Chi Zhang1, Bao Li1, Tong Li1, Shuxiao Ma1, Linyuan Wang1, Long Cao1, Bin Yan2
Institutions:
1Information Engineering University, Zhengzhou, Henan, 2Information Engineering Univer, Zhengzhou, Henan
First Author:
Panpan Chen
Information Engineering University
Zhengzhou, Henan
Co-Author(s):
Chi Zhang
Information Engineering University
Zhengzhou, Henan
Bao Li
Information Engineering University
Zhengzhou, Henan
Tong Li
Information Engineering University
Zhengzhou, Henan
Shuxiao Ma
Information Engineering University
Zhengzhou, Henan
Linyuan Wang
Information Engineering University
Zhengzhou, Henan
Long Cao
Information Engineering University
Zhengzhou, Henan
Bin Yan
Information Engineering Univer
Zhengzhou, Henan
Introduction:
Current DNN models for neural encoding of visual stimuli in predicting brain responses to dynamic stimuli face alignment issues and training difficulties, and lack multi-modal information integrationt(Khosla et al. 2020;Wen et al. 2018). To overcome these, this paper proposes a brain-aware multi-modal prompt learning framework, leveraging pre-trained foundation models and fine-tuning with tailored textual and visual prompts for specific ROIs. Our goal is to bridge foundation models with neural encoding tasks, showcasing the potential of prompt-based fine-tuning.
Methods:
We conduct our experiments on the Natural Facial Expressions Dataset(NFED)(Chen et al. 2024). The Brain-Aware Multi-Modal Prompt Learning Visual Encoding (BMPL-VE) model(Figure 1) includes a visual feature extractor, a textual feature extractor, brain-aware multi-modal prompt learning, a feature fusion module, and a voxel-wise mapping module.
In the BMPL-VE model, we treat neural encoding for each brain regions of interest (ROI) as a distinct downstream task. To fine-tune each ROI separately, we introduce brain-aware multi-modal prompt learning. This learning approach incorporates Textual Prompts Learning and Visual Prompts Learning, tailored to each specific ROI. For textual feature extraction, we utilize the pre-trained CLIP4clip(Luo et al. 2021) model. For visual feature extraction, we employ the VideoMAEv2(Wang et al. 2023) model. The visual prompt is derived by mapping the textual prompts through a linear coupling function, which serves as a bridge between the textual and visual modalities.
Finally, the prediction head of the BMPL-VE model is a Multi-Layer Perceptron that performs a linear transformation to map the extracted features to the brain response.

Results:
The BMPL-VE model is evaluated against various fine-tuning models, including the Fully fine-tune(F), AdaptFormer(AF), Visual prompt learning(VPL), as well as two-stage model.
In Figure 2a, we visualize the encoding results of our BMPL-VE model alongside comparative models. Our BMPL-VE model outperforms comparison models across all visual ROIs, particularly in high-level regions like LO, pSTS, TPJ, MT, IPS, v3a, and v3b.
To further validate our method, we compared the performance of BMPL-VE and F-VE models in extracting encoded voxels. Upon comparing Figures 2b and 2c, it becomes evident that BMPL-VE predicts cortical responses with greater accuracy across the entire visual cortex, particularly in high-level regions, showcasing its capability to capture and encode critical brain activities. The prediction accuracy for each ROI was determined by computing the average of the top 300 voxels across the ROIs and across 5 participants. The standard error is depicted by the error bars on the graph. The significant prediction values of 0.27 (p < 0.001) are denoted by the black dashed lines.

Conclusions:
Our study uncovered that by incorporating textual descriptions of dynamic natural stimuli, our approach significantly enhances the representation capacity for high-level visual areas, enabling a more nuanced and accurate capture of the unique characteristics of different ROIs. This fusion of multimodal information not only improves the encoding performance in these regions but also demonstrates the potential of our BMPL-VE model in understanding and representing complex visual information.
Modeling and Analysis Methods:
Methods Development
Novel Imaging Acquisition Methods:
BOLD fMRI 1
Perception, Attention and Motor Behavior:
Perception: Visual 2
Keywords:
Computational Neuroscience
FUNCTIONAL MRI
Vision
1|2Indicates the priority used for review
By submitting your proposal, you grant permission for the Organization for Human Brain Mapping (OHBM) to distribute your work in any format, including video, audio print and electronic text through OHBM OnDemand, social media channels, the OHBM website, or other electronic publications and media.
I accept
The Open Science Special Interest Group (OSSIG) is introducing a reproducibility challenge for OHBM 2025. This new initiative aims to enhance the reproducibility of scientific results and foster collaborations between labs. Teams will consist of a “source” party and a “reproducing” party, and will be evaluated on the success of their replication, the openness of the source work, and additional deliverables. Click here for more information.
Propose your OHBM abstract(s) as source work for future OHBM meetings by selecting one of the following options:
I am submitting this abstract as the outcome of the OHBM-OSSIG reproducibility challenge, having reproduced previously submitted work with the original author(s)’ agreement. I have cited the original work and acknowledged the origin team in the abstract.
Please indicate below if your study was a "resting state" or "task-activation” study.
Task-activation
Healthy subjects only or patients (note that patient studies may also involve healthy subjects):
Healthy subjects
Was this research conducted in the United States?
No
Were any human subjects research approved by the relevant Institutional Review Board or ethics panel?
NOTE: Any human subjects studies without IRB approval will be automatically rejected.
Yes
Were any animal research approved by the relevant IACUC or other animal research panel?
NOTE: Any animal studies without IACUC approval will be automatically rejected.
Not applicable
Please indicate which methods were used in your research:
Functional MRI
For human MRI, what field strength scanner do you use?
3.0T
Which processing packages did you use for your study?
SPM
FSL
Free Surfer
Provide references using APA citation style.
Chen P, et al. (2024). An fMRI dataset in response to large-scale short natural dynamic facial expression videos. Sci Data. 11(1):1247.
Khosla M, et al.(2020). Neural encoding with visual attention. (arXiv:2010.00516). doi:10.48550/arXiv.2010.00516. [accessed 2024 Dec 16]. http://arxiv.org/abs/2010.00516.
Luo H, et al. (2021). CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. (arXiv:2104.08860).
Wang L, et al. (2023). VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. (arXiv:2303.16727).
Wen H, et al. 2018). Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision. Cerebral Cortex. 28(12):4136–4160.
No