Emotion Prediction Using Multimodal Integration of fMRI Signals

Poster No:

1125 

Submission Type:

Abstract Submission 

Authors:

Jihyuk Ahn1, Jinsu Kim1, Hyun-Chul Kim1

Institutions:

1Kyungpook National University, Daegu, Korea, Republic of

First Author:

Jihyuk Ahn  
Kyungpook National University
Daegu, Korea, Republic of

Co-Author(s):

Jinsu Kim  
Kyungpook National University
Daegu, Korea, Republic of
Hyun-Chul Kim  
Kyungpook National University
Daegu, Korea, Republic of

Introduction:

The CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset is one of the most extensive resources for multimodal sentiment analysis and emotion recognition (Zadeh et al., 2018). Recent research has shown that leveraging multiple modalities, such as text, audio, and video, enhances the accuracy of emotion prediction compared to single-modality approaches (Zhang et al., 2023). In this study, we propose a novel deep learning model that predicts emotional responses by integrating the CMU-MOSEI dataset with functional magnetic resonance imaging (fMRI) data (Bradley et al., 1994).

Methods:

In the fMRI data collection process, 10 healthy right-handed adult participants (mean age = 24.6 ± 2.6 years; 4 males, 6 females) watched a total of 120 video clips selected from the CMU-MOSEI dataset. Each clip lasted between 45 and 75 seconds. After watching each clip, participants evaluated their emotional experience using the Self-Assessment Manikin (SAM) scale (Bradley et al., 1994), a non-verbal pictorial tool that measures pleasure (valence), arousal, and dominance on a 9-point scale (Fig 1a). The fMRI data was preprocessed via SPM8.
We developed the Multi-Modal Attention Emotion Model (MAEM) to efficiently process and integrate information from multiple modalities (Fig 1b). Audio, video, and text features were extracted using COVAREP (Degottex et al., 2014), Openface2.0 (Baltrusaitis et al., 2016), and BERT (Devlin et al., 2019), respectively. The fMRI signals obtained from the insula, a critical role in the experience and regulation of emotions (Phan et al., 2002) were used as fMRI features. In the MAEM model, one modality is designated as the base modality for learning inter-modality relationships. This base modality learns internal modality relationships using multi-head self-attention modules, while relationships between the base modality and other modalities are learned through multi-head cross-attention modules. After processing through the attention layers, the tokens are merged into a single vector, which is then input into a Multi-Layer Perceptron (MLP) to predict the emotion valence score. The performance of MAEM was evaluated using subject-wise 5-fold cross-validation.
Supporting Image: fig1_final.png
 

Results:

The performance of different modality combinations was evaluated based on the Pearson correlation coefficient between the predicted and actual valance scores across all test folds (102–118 video clips). Single-modality fMRI showed low performance with an average of 0.08 ± 0.01 and combining one or two additional modalities did not result in significant improvements. However, the highest performance was achieved when all four modalities (audio, text, video, and fMRI) were combined, yielding a correlation of 0.25 ± 0.07 (mean ± standard error). Improvement of fMRI-based MAEM over the MLP baseline model was statistically significant (p < 10-2) (Fig 2a). When the base modality was changed in the MAEM, the video-based MAEM showed the best performance, achieving a correlation of 0.33 ± 0.01. This was higher compared to other base-modality approaches: text-based MAEM (0.27 ± 0.01), fMRI-based MAEM (0.25 ± 0.01), and audio-based MAEM (0.23 ± 0.02) (Fig 2b).
Supporting Image: fig2_final.png
 

Conclusions:

Our findings indicate that integrating multimodal data with fMRI signals enhances the performance of emotion prediction. Notably, the video-feature-based cross-attention module in the MAEM outperformed other base modality approaches, suggesting that video features play a pivotal role in eliciting fMRI signals and capturing individual emotional responses. Future research should explore the prediction performance for arousal and dominance emotion scores, as well as conduct ablation studies to assess the specific contributions of the attention modules.

Emotion, Motivation and Social Neuroscience:

Emotional Perception 2

Modeling and Analysis Methods:

Activation (eg. BOLD task-fMRI)
Classification and Predictive Modeling 1

Keywords:

Emotions
FUNCTIONAL MRI
Other - CMU-MOSEI, Multimodal Sentiment Analysis, Self-Assessment Manikin

1|2Indicates the priority used for review

Abstract Information

By submitting your proposal, you grant permission for the Organization for Human Brain Mapping (OHBM) to distribute your work in any format, including video, audio print and electronic text through OHBM OnDemand, social media channels, the OHBM website, or other electronic publications and media.

I accept

The Open Science Special Interest Group (OSSIG) is introducing a reproducibility challenge for OHBM 2025. This new initiative aims to enhance the reproducibility of scientific results and foster collaborations between labs. Teams will consist of a “source” party and a “reproducing” party, and will be evaluated on the success of their replication, the openness of the source work, and additional deliverables. Click here for more information. Propose your OHBM abstract(s) as source work for future OHBM meetings by selecting one of the following options:

I do not want to participate in the reproducibility challenge.

Please indicate below if your study was a "resting state" or "task-activation” study.

Task-activation

Healthy subjects only or patients (note that patient studies may also involve healthy subjects):

Healthy subjects

Was this research conducted in the United States?

No

Were any human subjects research approved by the relevant Institutional Review Board or ethics panel? NOTE: Any human subjects studies without IRB approval will be automatically rejected.

Yes

Were any animal research approved by the relevant IACUC or other animal research panel? NOTE: Any animal studies without IACUC approval will be automatically rejected.

Not applicable

Please indicate which methods were used in your research:

Functional MRI

For human MRI, what field strength scanner do you use?

3.0T

Which processing packages did you use for your study?

SPM

Provide references using APA citation style.

1. Baltrušaitis, T., Robinson, P., & Morency, L. P. (2016, March). Openface: an open source facial behavior analysis toolkit. In 2016 IEEE winter conference on applications of computer vision (WACV) (pp. 1-10). IEEE.
2. Bradley, M. M., & Lang, P. J. (1994). Measuring emotion: the self-assessment manikin and the semantic differential. Journal of behavior therapy and experimental psychiatry, 25(1), 49-59.
3. Degottex, G., Kane, J., Drugman, T., Raitio, T., & Scherer, S. (2014, May). COVAREP—A collaborative voice analysis repository for speech technologies. In 2014 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 960-964). IEEE.
4. Devlin, J. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
5. Phan, K. L., Wager, T., Taylor, S. F., & Liberzon, I. (2002). Functional neuroanatomy of emotion: a meta-analysis of emotion activation studies in PET and fMRI. Neuroimage, 16(2), 331-348.
6. Zadeh, A. B., Liang, P. P., Poria, S., Cambria, E., & Morency, L. P. (2018, July). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2236-2246).
7. Zhang, H., Wang, Y., Yin, G., Liu, K., Liu, Y., & Yu, T. (2023, December). Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 756–767).

Acknowledgment: This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. RS-2022-00166735 & No. RS-2023-00218987).

UNESCO Institute of Statistics and World Bank Waiver Form

I attest that I currently live, work, or study in a country on the UNESCO Institute of Statistics and World Bank List of Low and Middle Income Countries list provided.

No