"Who" Is Saying "What"? Brain-Based Decoding of Human Voice and Speech

See allHide authors and affiliations

Science  07 Nov 2008:
Vol. 322, Issue 5903, pp. 970-973
DOI: 10.1126/science.1164318


Can we decipher speech content (“what” is being said) and speaker identity (“who” is saying it) from observations of brain activity of a listener? Here, we combine functional magnetic resonance imaging with a data-mining algorithm and retrieve what and whom a person is listening to from the neural fingerprints that speech and voice signals elicit in the listener's auditory cortex. These cortical fingerprints are spatially distributed and insensitive to acoustic variations of the input so as to permit the brain-based recognition of learned speech from unknown speakers and of learned voices from previously unheard utterances. Our findings unravel the detailed cortical layout and computational properties of the neural populations at the basis of human speech recognition and speaker identification.

In everyday life, we automatically and effortlessly decode speech into language independently of who speaks. Similarly, we recognize a speaker's voice independently of what she or he says. Cognitive and connectionist models postulate that this efficiency depends on the ability of our speech perception and speaker identification systems to extract relevant features from the sensory input and to form efficient abstract representations (13). These representations are invariant to changes of the acoustic input, which ensures efficient processing and confers a high robustness to noise or to signal distortion. Relevant psycholinguist models consider abstract entities such as phonemes as the building blocks of the computational chain that transforms an acoustic waveform into a meaningful concept (2, 3). There is also psychoacoustic evidence that the identification of a speaker relies on the extraction of invariant paralinguistic features of his/her voice, such as fundamental frequency (1).

Numerous functional neuroimaging studies have provided important insights on the cortical organization of speech (411) and voice (12, 13) processing. However, the subtraction-based experimental logic and the limited neuroanatomical detail only allow for partial and indirect inferences on what distinguishes the auditory cortical representations of two natural speech or vocal sounds. Furthermore, it remains unclear how a speech sound is transformed into the more abstract entity of “phoneme” or “speaker” identity. Beyond subtraction, results from functional magnetic resonance adaptation suggest the involvement in voice identification of a specialized region in the right anterior superior temporal sulcus (STS) (14). For speech processing, a hierarchical fractioning of cortical regions for sound-based and a more abstract higher-level processing has been suggested [(15) and supporting online text].

In the present study, we investigate speech and voice recognition and abstraction at the level of representation and processing of individual sounds. By combining multivariate statistical pattern recognition with single-trial functional magnetic resonance imaging (fMRI) (1620), we estimate and decode the distinct activation patterns elicited by different speech sounds and directly assess the invariance of the estimated neural representations to acoustic variations of the sensory input.

High spatial resolution (1.5 mm × 1.5 mm × 2 mm) functional images of the auditory cortex were collected while participants (n = 7) listened to speech sounds consisting of three Dutch vowels (/a/, /i/, /u/) recorded from three native Dutch speakers (Fig. 1) (21). Consistent with previous studies (46, 8, 10, 12, 14, 15, 22), all sounds evoked significant fMRI responses in a wide expanse of the superior temporal cortex, including early auditory areas (Heschl'sgyrus) and multiple regions in the planum temporale (PT), along the superior temporal gyrus, the STS, and the middle temporal gyrus. Univariate statistical contrasts, however, yielded only weak response differences (below significance) or no differences between conditions (fig. S2)

Fig. 1.

Experimental design and stimuli. (A) Example of spectrograms of the stimuli from the nine conditions (three vowels × three speakers). Stimuli were presented during the silent intervals of fMRI measurements and were three natural Dutch vowels (/a/, /i/, and /u/) spoken by three native Dutch speakers (sp1: female, sp2: male, and sp3: male). (B) Representation of the vowels based on the first two formants (F1, F2). Each of the conditions was formed by grouping three different utterances from the same speaker. The insert indicates mean value and standard deviation of the fundamental frequency (F0) for each of the speakers.

After this initial analysis, we asked whether the estimation of a multivoxel activation fingerprint of a sound would allow deciphering its content and the identity of the speaker. With a method based on a machine learning classification algorithm (support vector machine) and recursive feature elimination (23, 24), we performed two complementary analyses. We labeled the stimuli and corresponding response patterns either according to the vowel dimension irrespective of the speaker dimension (“vowel learning”) or according to the speaker dimension irrespective of the vowel dimension (“speaker learning”). This led to the grouping of stimuli and responses in the three conditions: /a/, /i/, and /u/ and sp1, sp2, and sp3, respectively. We then examined whether our algorithm, after being trained with a subset of labeled responses (50 trials), would correctly classify the remaining unlabeled responses (10 trials). In all subjects and in all possible pairwise comparisons, the algorithm successfully learned the functional relation between sounds and corresponding spatial patterns and classified correctly the unlabeled sound-evoked patterns, both in the case of vowel learning [/a/ versus /i/ = 0.65 (mean correctness), P = 6 × 10–5; /a/ versus /u/ = 0.69, P = 2 × 10–5; /i/ versus /u/ = 0.63, P = 4 × 10–4 (Fig. 2A)]; and speaker learning, [sp1 versus sp2 = 0.70, P = 3 × 10–5; sp1 versus sp3 = 0.67, P = 8 × 10–5; sp2 versus sp3 = 0.62, P = 2 × 10–5 (Fig. 2B)].

Fig. 2.

Performance of the brain-based decoding of vowels and speakers. Correctness (median value and distribution) of all pairwise classifications when training and testing of the algorithm were based on different subsets of responses to the same stimuli [(A) vowels; (B) speakers] or when training and testing were based on responses to different speakers (C) and vowels (D) (chance level is 0.5).

To investigate layout and consistency across subjects of the spatial patterns that make this decoding possible, we generated group discriminative maps (Fig. 3 and fig. S3), i.e., maps of the cortical locations that contribute most to the discrimination of conditions. Single-subject reliability maps are reported in fig. S4. Discriminative patterns for vowels [red color (Fig. 3B and fig. S3)] were widely distributed bilaterally in the superior temporal cortex and included regions in the anterior-lateral portion of Heschl's gyrus–Heschl's sulcus, in the PT (mainly in the left hemisphere), and extended portions of the STS/STG (both hemispheres). Discriminative patterns for speakers [blue color (Fig. 3C and fig. S3)] were more confined and right-lateralized than those obtained for vowel discrimination. These patterns included the lateral portion of Heschl's gyrus–Heschl's sulcus, located in the posterior adjacenty to a similar region described for vowel discrimination and three clustered regions along the anterior-posterior axis of the right STS, also interspersed with vowels regions (fig. S3). These findings indicate a spatially distributed model for both the representation of vowel and speaker identity (see supporting online text).

Fig. 3.

Cortical discriminative maps and activation fingerprints for decoding of vowels and speakers. (A to C) Group discriminative maps obtained from cortex-based realignment of individual maps. Maps are visualized on the folded (A) or inflated representation of the cortex [auditory cortex detail in (B and C); light gray, gyri, dark gray, sulci] resulting from the realignment of the cortices of the seven participants. A location was color-coded (vowels, red; speakers, blue) if it was present on the individual maps of at least four of the seven subjects. This corresponds to a false discovery rate–corrected threshold of q = 6 × 10–4 for vowels and q = 9 × 10–4 for speakers (21). Outlined regions in (B) and (C) indicate cortical regions that were also included in the group maps obtained with the generalization analysis. (D and E) Activation fingerprints of the sounds created from the 15 most discriminative voxels for decoding of vowels (D) and speakers (E) (single-subject data, subject 1). Each axis of the polar plot forming a fingerprint displays the normalized activation level in a voxel. Note the similarity among the fingerprints of the same vowel [horizontal direction in (D)] or speaker [vertical direction in (E)].

Encouraged by these results, we tested the capability of our algorithm to decipher the brain activity into speech content and speaker identity also in the case of completely novel stimuli (i.e., stimuli not used during the training). We trained the algorithm in discriminating vowels with samples from one speaker (e.g., /a/ versus /i/ for sp1) or one vowel (e.g., sp1 versus sp2 for /a/) and tested the correctness of this discrimination in the other speakers (e.g., sp2 and sp3) or in the other vowels (e.g., /i/ and /u/). With this strategy, stimuli used for training and for testing differ in many acoustical dimensions. An accurate decoding of activation patterns associated with the test stimuli would thus indicate that the learned functional relation between a cortical activation pattern and a vowel (or a speaker) entails information on that vowel (or speaker) beyond the contingent mapping of its acoustic properties. Despite the small number of training samples (20 trials), classification of novel stimuli was accurate in all subjects and in all possible pairwise comparisons, both in the case of vowels [/a/ versus /i/ = 0.66 (mean accuracy), P = 1 × 10–6; /a/ versus /u/ = 0.62, P = 3 × 10–5; /i/ versus /u/ = 0.60, P = 7 × 10–5 (Fig. 2C)] and in the case of speakers [sp1 versus sp2 = 0.62 (mean accuracy), P = 6 × 10–6; sp1 versus sp3 = 0.65, P = 8 × 10–7; sp2 versus sp3 = 0.63, P = 2 × 10–6 (Fig. 2D)]. Although sparser, the corresponding discriminative maps included a subset of the locations highlighted by the previous analyses (outlined regions in Fig. 3, B and C).

The abstract nature of the estimated cortical representations is illustrated in Figs. 3, D and E, and 4. First, we visualized the speaker-invariant cortical representation of vowels and the vowel-invariant representation of speakers for which we used multidimensional displays [fingerprints (Fig. 3, D and E)]. Second, we visualized the relation between the discriminative patterns of activations for the nine conditions using self organizing maps (SOMs) (21), which convert complex relations between high-dimensional items into simple geometric relations. The spatial proximity and grouping of the conditions in the SOM-based two-dimensional display thus reflects the level of abstraction and categorical information entailed in the discriminative patterns of vowels (Fig. 4, A and B) and speakers (Fig. 4, C and D). To investigate which acoustic features in the original sounds drive this neural abstraction, we examined the relative distance between the brain-based representations of the stimuli and their description in terms of typical acoustic features (formants, Fig. 1B). We found that the distances between the cortical representations of the sounds correlated best with a description of the stimulus based on the first two formants (F1, F2) in the case of vowels [r = 0.75, P = 2 × 10–7 (Fig. 4E and fig. S5)] and on the fundamental frequency (F0) in the case of speakers [r = 0.64, p = 2 × 10–5 (Fig. 4F and fig. S5)]. These results provide empirical support for cognitive models of speech and voice processing postulating the existence of intermediate computational entities resulting from the transformation of relevant acoustic features [the (F1, F2) pair for vowels and (F0) for speakers] and the suppression of the irrelevant ones.

Fig. 4.

Visualization of the brain-based representation of the sounds and relation with acoustical features. (A to D) SOM-based display of the discriminative patterns in a single-subject (subject 1, A and C) and in the group of seven subjects (B and D) for vowel (A and B) and speaker learning (C and D). (E and F) Relation between normalized distances of the multidimensional auditory cortical activation patterns and normalized distances of the vowels in the (F1, F2) space of formants (E) and of the speakers in the space of fundamental frequency (F0) (F).

Our findings demonstrate that an abstract representation of a vowel or speaker emerges from the joint encoding of information occurring not only in specialized higher-level regions but also in auditory regions, which—because of their anatomical connectivity and response properties—have been associated with early stages of sound processing. This is in agreement with recent neurophysiological findings indicating that neurons in early auditory regions may exhibit complex spectrotemporal receptive fields and may participate in high-level encoding of auditory objects (2529), e.g., via local feedback loops and reentrant processing. Taken together, these results prompt a revision of models of phoneme and voice abstraction, which assumes that a hierarchy of processing steps is “mapped” into a functional hierarchy of specialized neural modules.

In conclusion, we demonstrated the feasibility of decoding speech content and speaker identity from observation of auditory cortical activation patterns of the listeners. Our analyses provided a detailed empirical demonstration of how the human brain forms computationally efficient representations required for speech comprehension and speaker identification. Our experimental settings, however, were restricted to three vowels and three speakers; furthermore, all sounds were presented in isolation to obtain distinct fMRI activation patterns. Extension of these results to identify a word or concatenation of words in streams of longer speech segments, provides a compelling challenge and will contribute to create a general brain-based decoder of sounds in the context of real-life situations.

Supporting Online Material

Materials and Methods

Supporting Text

Figs. S1 to S5


References and Notes

View Abstract

Stay Connected to Science

Navigate This Article