Sensorimotor Adaptation in Speech Production

See allHide authors and affiliations

Science  20 Feb 1998:
Vol. 279, Issue 5354, pp. 1213-1216
DOI: 10.1126/science.279.5354.1213


Human subjects are known to adapt their motor behavior to a shift of the visual field brought about by wearing prism glasses over their eyes. The analog of this phenomenon was studied in the speech domain. By use of a device that can feed back transformed speech signals in real time, subjects were exposed to phonetically sensible, online perturbations of their own speech patterns. It was found that speakers learn to adjust their production of a vowel to compensate for feedback alterations that change the vowel's perceived phonetic identity; moreover, the effect generalizes across phonetic contexts and to different vowels.

When human subjects are asked to reach to a visual target while wearing displacing prisms over their eyes, they are observed to miss the target initially, but to adapt rapidly such that within a few movements their reaching appears once again to be rapid and natural. Moreover, when the displacing prisms are subsequently removed subjects are observed to show an aftereffect; in particular, they miss the target in the direction opposite to the displacement. This basic result has provided an important tool for investigating the nature of the sensorimotor control system and its adaptive response to perturbations (1).

The experiment described in this report is based on an analogy between reaching movements in limb control and articulatory movements in speech production. Although reaching and speaking are qualitatively very different motor acts, they nonetheless share the similarity of having sensory goals—reaching movements are made to touch or grasp a target, and articulatory movements are made to produce a desired acoustic pattern. It is therefore reasonable to ask whether the speech motor control system might also respond adaptively to alterations of sensory feedback (2). However, beyond the intrinsic interest of speech motor control and the importance of discovering commonalities between different effector systems, there are also advantages to studying sensorimotor adaptation in the speech domain. Whereas in arm movement research there is little agreement as to the nature of the underlying discrete units of complex movements (and indeed there is controversy as to whether or not such discrete units exist), in speech there is substantial evidence regarding an underlying discrete control system. In particular, the disciplines of phonology and phonetics have provided linguistic and psychological evidence for the existence of discrete units such as syllables (3), phonemes (4), and features (5). There are still major controversies, however, regarding the role of such discrete units in the online control of speech production (6). An important reason for the lack of agreement is methodological; in particular, there is no agreed-upon methodology for decomposing articulatory and acoustic records into segments that might be identified with underlying control structures. Thus, while linguistic and psychological evidence have provided useful hypotheses as to the putative discrete control structures underlying speech motor control, it has proven difficult to evaluate these hypotheses directly in experiments on speech motor control.

Our research provides a new line of attack on this problem. In an adaptation paradigm, we can expose subjects to acoustic perturbations of their articulatory output in one linguistic context and ask whether any adaptation that is found transfers to another linguistic context. For example, if the formants of the vowel [ɛ] are altered in the context of “pep,” we can ask whether adaptation generalizes to [ɛ] in the context of “set” or in the context of “forget.” We can also ask whether adaptation is observed for other vowels. Such manipulations provide a direct probe of the putative hierarchical, segmental control of speech production.

We built an apparatus to alter subjects' feedback in real time (Fig.1). The apparatus allows us to shift formant frequencies independently so as to impose arbitrary perturbations on the speech signal within the two-dimensional (F1, F2) formant space (7-9). This apparatus was used in an experiment in which a subject whispered 4220 prompted words over approximately 2 hours. The experiment consisted of the following: a 10-min acclimation phase; a 17-min baseline phase; a 20-min ramp phase; a 1-hour training phase; and a 17-min test phase. During the ramp phase, the feedback heard by the subject was increasingly altered, reaching a maximal alteration strength at which it was held for the duration of the training and test phases (10).

Figure 1

The apparatus used in the experiments. CVC words were prompted on the personal computer (PC) video monitor. Subjects were instructed to whisper the word; we used whispered speech to minimize the effects of bone conduction which are strong in voiced speech. While the subject whispered, the speech signal was picked up by a microphone and sent to a digital signal processing (DSP) board in the PC. The DSP board processed successive intervals of the subject's speech into synthesized, formant-altered feedback with only a 16-ms processing delay [such a delay is nondisruptive; see reference to DAF in (2)]. Each interval was first analyzed into a 64-channel, 4 kHz–wide magnitude spectrum from which formants (which are generally peaks in the spectrum) were estimated (all graphs are schematic plots of magnitude versus frequency). The frequencies of the three lowest frequency formants (F1, F2, and F3) were then shifted to implement a desired feedback alteration (as explained below). The shifted formants were then used to synthesize formant-altered whispered speech. This synthesized speech was fed back to the subject via earphones at sufficient volume that he essentially heard only the synthesized feedback of his whispering.

During the experiment, the subject was prompted to produce words randomly selected from two different sets: a set of training words (in which adaptation was induced) and a set of testing words (in which carryover of the training word adaptation was measured). Test and training words were interspersed with one another throughout the experiment. However, only when the subject produced training words was he exposed to the altered feedback. The training words were all bilabial consonant-vowel-consonants (CVC) with [ɛ] as the vowel (“pep,” “peb,” “bep,” and “beb”) and the subject produced them while hearing either feedback of his whispering or masking noise that blocked his auditory feedback (11). The set of testing words was divided into two subsets, each designed to assess different types of carryover of the training word adaptation. Three of the testing words—“peg,” “gep,” and “teg”—were included to determine if the adaptation of [ɛ] in the bilabial training word context carried over to [ɛ] in different word contexts. The remaining testing words—“pip,” “peep,” “pap,” and “pop”—were included to determine if the adaptation of [ɛ] caused similar production changes in other vowels.

Eight male Massachusetts Institute of Technology (MIT) students participated in the study. All were native speakers of North American English and all were naı̈ve to the purpose of the study (12). Each was run in the adaptation experiment and also in a control experiment that was identical to the adaptation experiment except that no feedback perturbations were introduced.

Figure 2 shows the feedback transformations and resulting compensation and adaptation for a single subject. The diamonds show mean formant positions of the subject's productions of the vowels [i], [ι], [ɛ], [æ], and [α], as measured in a pretest procedure several days before the actual adaptation experiment. Formants were shifted along the path linking the positions of these vowels (dotted line) (13). Formants were shifted in one direction along this path for half the subjects; they were shifted in the opposite direction for the other subjects. The formant shifts were large enough that if the subject produced [ɛ], he heard either [i] or [α], depending on the direction of shift.

Figure 2

Altered feedback and resulting compensation and adaptation for a single subject (subject OB).

For the subject in Fig. 2, formants were shifted toward [i]. Formants were shifted in proportion to the spacing between vowels on the path: If the subject produced [ɛ] his formants were shifted so he heard [i]; if he produced [æ] he heard [ι]; and if he produced [α] he heard [ɛ]. Position B (Fig. 2) corresponds to the mean vowel formants for the training words produced by the subject in the baseline phase of the adaptation experiment. B′ shows the formants presented to the subject as a result of the altered feedback.

The arrow labeled “compensation” is the subject's compensation to the altered feedback: The arrow shows that, in response to hearing B as B′, the subject has, by the test phase of the experiment, changed his production of B to T. The arrow labeled “altered feedback” shows that the altered feedback causes the subject to hear the production change as a shift from B′ to T′. The arrow shows that, by the experiment's test phase, the subject now hears his formants at T′, which are close to the baseline, B. The subject has thus compensated for the altered feedback. The arrow labeled “adaptation” shows how much of the compensation is retained when the feedback is blocked by noise (in this case, about 72% is retained).

The analysis of mean compensation and mean adaptation across subjects is shown in Fig. 3(14). The figure shows that the majority of subjects significantly compensated (P < 0.006) and adapted (P < 0.023) (15). The figure also shows other features commonly seen in adaptation experiments in the reaching domain: compensation varies across subjects, each subject compensates more than he adapts, and subjects that tend to compensate more also tend to adapt more.

Figure 3

Mean compensation (top) and adaptation (bottom) for all subjects (designated CW through AH) in the adaptation (black bars) and control (white bars) experiments.

Figure 4 shows mean generalization for the test words—a ratio expressing the fraction of the adaptation of [ɛ] in the training words that carried over to the vowel production in a testing word (16). Adaptation to the training set affected the production of the vowels in test words containing the same vowel but in different consonant contexts (Fig. 4A). Overall, there is significant generalization of the training word adaptation to these test words (P < 0.040) (17). However, the apparently greater mean generalization to “peg” than to “gep” and “teg” is not statistically significant. This lack of significance is traceable to coarticulatory influences that caused imperfect estimates of steady-state vowel formants of [ɛ] in “gep” and “teg”.

Figure 4

Mean generalization for the analyzable testing words in the experiment. Shown are (A) words with the same vowel ([ɛ]) used in the training words, but different consonants; and (B) words with different vowels.

Adaptation to the training set affected the production of the vowels in words containing different vowels (Fig. 4B) (18). Again, there is overall significant generalization of the training word adaptation to these test words (P < 0.013), but again, the apparent differences in mean generalization between the words is not statistically significant.

In summary, our experimental results show that control of the production of vowels adapts to perturbations of auditory feedback. This adaptation is analogous to the adaptation seen in the control of reaching. Moreover, the generalization observed for [ɛ] in the testing words provides direct evidence that the testing and the training words share a common representation of the production of [ɛ]; it is of course natural to hypothesize that this common representation is the phoneme [ɛ]. Finally, the significant generalization to “pip” and “pap” considered together shows that the adaptation of a vowel can spread not only across contexts but also to other vowels. This suggests that the control process underlying the production of the trained vowel is partially shared in the control of the productions of other vowels; moreover, it is natural to attempt to identify these control structures with the featural decompositions studied in phonology.

  • * To whom correspondence should be addressed. E-mail: houde{at}

  • Present address: University of California San Francisco, Keck Center, 513 Parnassus Avenue, S-877, San Francisco, CA 94143–0732, USA.


View Abstract

Navigate This Article