Deep Speech-to-Text Models Capture the Neural Basis of Spontaneous Speech in Everyday Conversations

Presented During:

Tuesday, June 25, 2024: 12:00 PM - 1:15 PM
COEX  
Room: Grand Ballroom 101-102  

Poster No:

1053 

Submission Type:

Abstract Submission 

Authors:

Ariel Goldstein1, Haocheng Wang2, Leonard Niekerken2, Zaid Zada2, Bobbi Aubrey2, Tom Sheffer3, Samuel Nastase2, Mariano Schain3, Harshvardhan Gazula2, Aditi Singh2, Aditi Rao2, Gina Choe2, Catherine Kim2, Werner Doyle4, Daniel Friedman4, Sasha Devore4, Patricia Dugan4, Avinatan Hassidim3, Michael Brenner3, Yossi Matias3, Orrin Devinsky4, Adeen Flinker4, Uri Hasson2

Institutions:

1Hebrew University, Jerusalem, Israel, 2Princeton University, Princeton, NJ, 3Google Research, Mountain View, CA, 4New York University School of Medicine, New York, NY

First Author:

Ariel Goldstein  
Hebrew University
Jerusalem, Israel

Co-Author(s):

Haocheng Wang  
Princeton University
Princeton, NJ
Leonard Niekerken  
Princeton University
Princeton, NJ
Zaid Zada  
Princeton University
Princeton, NJ
Bobbi Aubrey  
Princeton University
Princeton, NJ
Tom Sheffer  
Google Research
Mountain View, CA
Samuel Nastase  
Princeton University
Princeton, NJ
Mariano Schain  
Google Research
Mountain View, CA
Harshvardhan Gazula  
Princeton University
Princeton, NJ
Aditi Singh  
Princeton University
Princeton, NJ
Aditi Rao  
Princeton University
Princeton, NJ
Gina Choe  
Princeton University
Princeton, NJ
Catherine Kim  
Princeton University
Princeton, NJ
Werner Doyle  
New York University School of Medicine
New York, NY
Daniel Friedman  
New York University School of Medicine
New York, NY
Sasha Devore  
New York University School of Medicine
New York, NY
Patricia Dugan  
New York University School of Medicine
New York, NY
Avinatan Hassidim  
Google Research
Mountain View, CA
Michael Brenner  
Google Research
Mountain View, CA
Yossi Matias  
Google Research
Mountain View, CA
Orrin Devinsky  
New York University School of Medicine
New York, NY
Adeen Flinker  
New York University School of Medicine
New York, NY
Uri Hasson  
Princeton University
Princeton, NJ

Introduction:

One of the most distinctively human behaviors is our ability to use language for communication during spontaneous conversations. Here, we collected continuous speech recordings and concurrent neural signals recorded from epilepsy patients during their week-long stay in the hospital, resulting in a uniquely large ECoG dataset of 100 hours of speech recordings during spontaneous, open-ended conversations. Deep learning provides a novel computational framework that embraces the multidimensional and context-dependent nature of language (Goldstein et al., 2022; Schrimpf et al., 2021). Here, we use Whisper, a deep multimodal speech-to-text model (Radford et al., 2022) to investigate the neural basis of speech processing.

Methods:

We separately extracted "speech embeddings" from the Whisper encoder network based on continuous speech inputs and "language embeddings" from the decoder network based on transcript inputs. To test whether embeddings extracted from Whisper can capture neural activity during natural conversations, we developed linear encoding models based on speech and language embeddings separately. We built electrode-wise encoding models for each lag ranging from -2000 ms to +2000 ms relative to word onset. To evaluate encoding model performance, we calculated the correlation between predicted and actual neural signal for held-out test words using ten-fold cross-validation.

Results:

We observed encoding patterns indicating a distributed cortical hierarchy of speech processing (Fig 1A, 1B): Electrodes in superior temporal gyrus (STG) and somatomotor areas (SM) demonstrated higher correlations with speech embeddings, whereas higher-level language areas like inferior frontal gyrus (IFG), posterior medial temporal gyrus (pMTG) and angular gyrus (AG) were better correlated with language embeddings.

Furthermore, we observed a spatial distribution of electrodes preferentially engaged in speech production versus speech comprehension. High-level language areas showed a mixed selectivity, indicating a shared neural mechanism between speech production and comprehension (Fig 1C). During speech production, we observed double encoding peaks occurring before and after word onset for some electrodes. We trained encoding models on comprehension data and tested prediction performance on production data. The 'flipped' encoding models learned a comprehension-specific mapping and successfully predicted the neural signal after word onset for speech production (Fig. 1D).

Evaluating encoding models at each lag relative to word onset allows us to trace the temporal flow of linguistic information across speech-related ROIs. We observed a temporal encoding pattern where language encoding in IFG peaks significantly earlier than speech encoding in STG during speech production (Fig. 2A), and vice versa during speech comprehension (Fig. 2B).
Supporting Image: OHBM-figure-1.png
   ·Figure 1. Contrast between speech and language encoding (A,B), speech production and comprehension (C). Training encoding model on comprehension and test on production (D).
Supporting Image: OHBM-figure-2.png
   ·Figure 2. Temporal dynamics of speech production and speech comprehension across different brain areas.
 

Conclusions:

Our encoding models identify a distributed cortical hierarchy where auditory and sensorimotor areas in the brain were better aligned with speech embeddings and high-level frontal and parietal language areas were better aligned with language embeddings (Fig. 1). These findings are in line with established theories about the cortical hierarchy of language processing (Hickok & Poeppel, 2007). At the same time, electrode-wise selectivity for speech or linguistic information was mixed across most brain areas. This mixed selectivity is common in both biological and artificial learning systems that are "directly" fit to the complex structure of their inputs (Hasson, Nastase, & Goldstein, 2020). We identified shared mechanisms between speech production and comprehension (Fig. 1C, 1D) and mapped the temporal flow of information during spontaneous speech production and comprehension (Fig 2A, 2B). This study demonstrates that deep language models are a powerful computational tool to build comprehensive models of speech processing in the brain, without compromising the rich dynamic and contextual qualities inherent in everyday language.

Language:

Language Comprehension and Semantics
Speech Perception 2
Speech Production 1

Modeling and Analysis Methods:

EEG/MEG Modeling and Analysis
Other Methods

Keywords:

Language
Machine Learning
Modeling
Other - speech production; speech comprehension; large language models; multimodal language models; deep learning

1|2Indicates the priority used for review

Provide references using author date format

Goldstein, A., Zada, Z., Buchnik, E., Schain, M., Price, A., Aubrey, B., . . . Cohen, A. (2022). Shared computational principles for language processing in humans and deep language models. Nature neuroscience, 25(3), 369-380.

Hasson, U., Nastase, S. A., & Goldstein, A. (2020). Direct fit to nature: An evolutionary perspective on biological and artificial neural networks. Neuron, 105(3), 416-434.

Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature reviews neuroscience, 8(5), 393-402.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.

Schrimpf, M., Blank, I. A., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N., . . . Fedorenko, E. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45), e2105646118.