Unlocking the Potential of the Spoken Word

Science  26 Sep 2008:
Vol. 321, Issue 5897, pp. 1787-1788
DOI: 10.1126/science.1157353

The best available evidence suggests that the human brain, and the human facility for language, were already well developed at least by 50,000 years ago (1). For most of the time since then, the spoken word provided the only practical way of using language to share our understanding of the world with others. To this day, people find spoken expression and its visual correlates (such as gesture and facial expression) to be a fluid and compelling way of communicating. It was the invention of writing, however, that ignited the continuing cycle of innovation that we associate with modern society. We now stand at the threshold of a new era, one in which the spoken word can again rise to prominence.

About 5000 years ago, we see the first indications of the emergence of written language (2). Writing has important features that the spoken word lacks, including a degree of permanence that can help to overcome some limitations of human memory. It rapidly proliferated well beyond mere commercial records to play a multifaceted role in complex forms of social organization. This proliferation inspired other innovations: ways of finding documents again, and ways of writing that conveyed the needed context to a reader. The written word also has other attractive qualities (for example, you can read at your own pace), but permanence, findability, and contextualization are responsible for its foundational role in human civilization.

For the past century and a half, inventors have chipped away at those advantages. The earliest known recording of a human voice was made in 1860 by Edward Lyon Scott's phonautograph, although it was not until Edison's 1877 better-known phonograph that the human voice could also be reproduced using technology from the same era (3). Later technologies, from wire recorders through reel-to-reel tape recording, were widely adopted for commercial purposes. It was, however, introduction of the compact cassette in 1962 that ultimately made sound recording technology robust and affordable. By the end of that decade, ordinary people could record hundreds of hours of speech, for media costs of about a dollar an hour. Today, digitized speech is easily acquired (for example, using any of the world's 2.5 billion mobile phones), easily transferred over digital networks, and easily stored, all for just a few cents per hour. It would take just $100 or so of networked disk storage to record everything that you will speak or hear this year.

Digital storage is a great equalizer with regard to permanence: The same infrastructure that can reliably store digital text can equally well store digital speech. Why, then, do we not record our lives in this way? Actually, some people do. For example, researchers at Carnegie Mellon University crafted a memory aid by recording their side of conversations and then using face recognition to cue up audio from an earlier meeting—no more forgetting people's names (4)! Gordon Bell at Microsoft has gone further, assembling digital materials from his entire life (5). This works well for some things (such as e-mail), but speech is not one of them—searching through large collections of spontaneously produced speech has remained a challenge.

This situation is about to change. Commercial “media management” systems can now reliably find specific content in the well-articulated speech of news announcers, and laboratory systems can handle much of the substantial variation in speaking styles that have made automatic transcription of interviews, meetings, and telephone conversations difficult. Hardware costs are higher for speech than for born-digital text (around a factor of 100 for storage, and perhaps a factor of 1000 for processing), but it is possible today to acquire, store, and process digitized speech at lower cost than was possible for born-digital text at the dawn of the Web. Robust accommodation to noisy environments and unfamiliar words remain important challenges, however, limiting the tasks to which present speech technology can be applied.


As increasingly capable systems emerge from the laboratory, we will soon find ourselves in a world in which speech need no longer be ephemeral. How will that change our society? No one can know for sure, but it is not difficult to envision some questions that might arise. The Carnegie Mellon system recorded only one side of the conversation because it is illegal in Pennsylvania (and 11 other U.S. states) to record full conversations without the explicit consent of all parties. Will a new balance between social costs and benefits lead us to think in more nuanced ways about when recording conversations should be permissible, just as many of us have learned to think differently about e-mail privacy at home and at the office? The wide diffusion of writing required standardization to facilitate mutual intelligibility. Will increasingly broad dissemination of spoken language accelerate the demise of regional dialects and less widely spoken languages? Written contracts today have greater legal standing than verbal ones. Will that distinction persist in a world in which spoken and written words have equal permanence? How can we harness this new technology to accelerate access to new knowledge, and what would be the implications of the resulting compression of innovation cycles?

Our parents complained that our generation relied on calculators rather than learning arithmetic. Will we complain when our grandchildren rely on speech-enabled systems rather than learning to read and write? Near-universal literacy has been one of humankind's greatest accomplishments, with 82% of the world's adult population now able to read and write. But it was the ephemeral nature of speech that gave rise to the imperative for literacy, and it is intriguing to imagine what will happen as that imperative abates. In Plato's Phaedrus, the Pharaoh Thamus says of writing, “If men learn this, it will implant forgetfulness in their souls: They will cease to exercise memory because they rely on that which is written” (6). Plato could not anticipate all the ways in which writing would be used for so much more than merely to augment memory—from an Internet that transports ideas through time and space, to great works of literature that transport our imagination to places that do not exist. What would a modern-day Plato have to say about the rise of speech to stand alongside writing as a cornerstone for our society? Our generation will unlock the full potential of the spoken word, but it may fall to our children, and to their children, to learn how best to use that gift.


