Toward a realistic model of speech processing in the brain with\n self-supervised learning

202239 citationsPreprintgreen Open Access

Authors

Juliette Millet · École Normale Supérieure - PSL

Charlotte Caucheteux · École Polytechnique

Pierre Orhan · École Normale Supérieure - PSL

Yves Boubenec · École Normale Supérieure - PSL

Alexandre Gramfort · Commissariat à l'Énergie Atomique et aux Énergies Alternatives

Ewan Dunbar · Laboratoire de Sciences Cognitives et Psycholinguistique

Christophe Pallier

Abstract

Several deep neural networks have recently been shown to generate activations\nsimilar to those of the brain in response to the same input. These algorithms,\nhowever, remain largely implausible: they require (1) extraordinarily large\namounts of data, (2) unobtainable supervised labels, (3) textual rather than\nraw sensory input, and / or (4) implausibly large memory (e.g. thousands of\ncontextual words). These elements highlight the need to identify algorithms\nthat, under these limitations, would suffice to account for both behavioral and\nbrain responses. Focusing on the issue of speech processing, we here\nhypothesize that self-supervised algorithms trained on the raw waveform\nconstitute a promising candidate. Specifically, we compare a recent\nself-supervised architecture, Wav2Vec 2.0, to the brain activity of 412\nEnglish, French, and Mandarin individuals recorded with functional Magnetic\nResonance Imaging (fMRI), while they listened to ~1h of audio books. Our\nresults are four-fold. First, we show that this algorithm learns brain-like\nrepresentations with as little as 600 hours of unlabelled speech -- a quantity\ncomparable to what infants can be exposed to during language acquisition.\nSecond, its functional hierarchy aligns with the cortical hierarchy of speech\nprocessing. Third, different training regimes reveal a functional\nspecialization akin to the cortex: Wav2Vec 2.0 learns sound-generic,\nspeech-specific and language-specific representations similar to those of the\nprefrontal and temporal cortices. Fourth, we confirm the similarity of this\nspecialization with the behavior of 386 additional participants. These\nelements, resulting from the largest neuroimaging benchmark to date, show how\nself-supervised learning can account for a rich organization of speech\nprocessing in the brain, and thus delineate a path to identify the laws of\nlanguage acquisition which shape the human brain.\n

Topics & Keywords

Speech Recognition and Synthesis Neural Networks and Applications

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: arXiv (Cornell University)

DOI: 10.48550/arxiv.2206.01685