Read The Language Instinct: How the Mind Creates Language Online
Authors: Steven Pinker
Every language has phonological rules, but what are they for? You may have noticed that they often make articulation easier. Flapping a
t
or a
d
between two vowels is faster than keeping the tongue in place long enough for air pressure to build up. Spreading voicelessness from the end of a word to its suffix spares the talker from having to turn the larynx off while pronouncing the end of the stem and then turn it back on again for the suffix. At first glance, phonological rules seem to be a mere summary of articulatory laziness. And from here it is a small step to notice phonological adjustments in some dialect other than one’s own and conclude that they typify the slovenliness of the speakers. Neither side of the Atlantic is safe. George Bernard Shaw wrote:
The English have no respect for their language and will not teach their children to speak it. They cannot spell it because they have nothing to spell it with but an old foreign alphabet of which only the consonants—and not all of them—have any agreed speech value. Consequently it is impossible for an Englishman to open his mouth without making some other Englishman despise him.
In his article “Howta Reckanize American Slurvian,” Richard Lederer writes:
Language lovers have long bewailed the sad state of pronunciation and articulation in the United States. Both in sorrow and in anger, speakers afflicted with sensitive ears wince at such mumblings as
guvmint
for
government
and
assessories
for
accessories
. Indeed, everywhere we turn we are assaulted by a slew of slurrings.
But if their ears were even more sensitive, these sorrowful speakers might notice that in fact there is no dialect in which sloppiness prevails. Phonological rules give with one hand and take away with the other. The same bumpkins who are derided for dropping
g
’s in
Nothin’ doin
’ are likely to enunciate the vowels in
pólice
and
accidént
that pointy-headed intellectuals reduce to a neutral “uh” sound. When the Brooklyn Dodgers pitcher Waite Hoyt was hit by a ball, a fan in the bleachers shouted, “Hurt’s hoit!” Bostonians who pahk their cah in Hahvahd Yahd name their daughters Sheiler and Linder. In 1992 an ordinance was proposed that would have banned the hiring of any immigrant teacher who “speaks with an accent” in—I am not making this up—Westfield, Massachusetts. An incredulous woman wrote to the
Boston Globe
recalling how her native New England teacher defined “homonym” using the example
orphan
and
often
. Another amused reader remembered incurring the teacher’s wrath when he spelled “cuh-rée-uh”
k-o-r-e-a
and “cuh-rée-ur”
c-a-r-e-e-r
, rather than vice versa. The proposal was quickly withdrawn.
There is a good reason why so-called laziness in pronunciation is in fact tightly regulated by phonological rules, and why, as a consequence, no dialect allows its speakers to cut corners at will. Every act of sloppiness on the part of a speaker demands a compensating measure of mental effort on the part of the conversational partner. A society of lazy talkers would be a society of hard-working listeners. If speakers were to have their way, all rules of phonology would spread and reduce and delete. But if listeners were to have their way, phonology would do the opposite: it would enhance the acoustic differences between confusable phonemes by forcing speakers to exaggerate or embroider them. And indeed, many rules of phonology do that. (For example, there is a rule that forces English speakers to round their lips while saying
sh
but not while saying
s
. The benefit of forcing everyone to make this extra gesture is that the long resonant chamber formed by the pursed lips enhances the lower-frequency noise that distinguishes
sh
from
s
, allowing for easier identification of the
sh
by the listener.) Although every speaker soon becomes a listener, human hypocrisy would make it unwise to depend on the speaker’s foresight and consideration. Instead, a single, partly arbitrary set of phonological rules, some reducing, some enhancing, is adopted by every member of a linguistic community when he or she acquires the local dialect as a child.
Phonological rules help listeners even when they do not exaggerate some acoustic difference. By making speech patterns predictable, they add redundancy to a language; English text has been estimated as being between two and four times as long as it has to be for its information content. For example, this book takes up about 900,000 characters on my computer disk, but my file compression program can exploit the redundancy in the letter sequences and squeeze it into about 400,000 characters; computer files that do not contain English text cannot be squished nearly that much. The logician Quine explains why many systems have redundancy built in:
It is the judicious excess over minimum requisite support. It is why a good bridge does not crumble when subjected to stress beyond what reasonably could have been foreseen. It is fallback and failsafe. It is why we address our mail to city and state in so many words, despite the zip code. One indistinct digit in the zip code would spoil everything…. A kingdom, legend tells us, was lost for want of a horseshoe nail. Redundancy is our safeguard against such instability.
Thanks to the redundancy of language, yxx cxn xndxrstxnd whxt x xm wrxtxng xvsn xf x rxplxcx xll thx vxwxls wxth xn “x” (t gts Ittl hrdr f y dn’t vn kn whr th vwls r). In the comprehension of speech, the redundancy conferred by phonological rules can compensate for some of the ambiguity in the sound wave. For example, a listener can know that “thisrip” must be
this rip
and not
the srip
because the English consonant cluster
sr
is illegal.
So why is it that a nation that can put a man on the moon cannot build a computer that can take dictation? According to what I have explained so far, each phoneme should have a telltale acoustic signature: a set of resonances for vowels, a noise band for fricatives, a silence-burst-transition sequence for stops. The sequences of phonemes are massaged in predictable ways by ordered phonological rules, whose effects could presumably be undone by applying them in reverse.
The reason that speech recognition is so hard is that there’s many a slip ’twixt brain and lip. No two people’s voices are alike, either in the shape of the vocal tract that sculpts the sounds, or in the person’s precise habits of articulation. Phonemes also sound very different depending on how much they are stressed and how quickly they are spoken; in rapid speech, many are swallowed outright.
But the main reason an electric stenographer is not just around the corner has to do with a general phenomenon in muscle control called coarticulation. Put a saucer in front of you and a coffee cup a foot or so away from it on one side. Now quickly touch the saucer and pick up the cup. You probably touched the saucer at the edge nearest the cup, not dead center. Your fingers probably assumed the handle-grasping posture while your hand was making its way to the cup, well before it arrived. This graceful smoothing and overlapping of gestures is ubiquitous in motor control. It reduces the forces necessary to move body parts around and lessens the wear and tear on the joints. The tongue and throat are no different. When we want to articulate a phoneme, our tongue cannot assume the target posture instantaneously; it is a heavy slab of meat that takes time to heft into place. So while we are moving it, our brains are anticipating the next posture in planning the trajectory, just like the cup-and-saucer maneuver. Among the range of positions in the mouth that can define a phoneme, we place the tongue in the one that offers the shortest path to the target for the next phoneme. If the current phoneme does not specify where a speech organ should be, we anticipate where the next phoneme wants it to be and put it there in advance. Most of us are completely unaware of these adjustments until they are called to our attention. Say
Cape Cod
. Until now you probably never noticed that your tongue body is in different positions for the two
k
sounds. In
horseshoe
, the first
s
becomes a
sh;
in
NPR
, the
n
becomes an
m;
in
month
and
width
, the
n
and
d
are articulated at the teeth, not the usual gum ridge.
Because sound waves are minutely sensitive to the shapes of the cavities they pass through, this coarticulation wreaks havoc with the speech sound. Each phoneme’s sound signature is colored by the phonemes that come before and after, sometimes to the point of having nothing in common with its sound signature in the company of a different set of phonemes. That is why you cannot cut up a tape of the sound
cat
and hope to find a beginning piece that contains the
k
alone. As you make earlier and earlier cuts, the piece may go from sounding like
ka
to sounding like a chirp or whistle. This shingling of phonemes in the speech stream could, in principle, be a boon to an optimally designed speech recognizer. Consonant and vowels are being signaled simultaneously, greatly increasing the rate of phonemes per second, as I noted at the beginning of this chapter, and there are many redundant sound cues to a given phoneme. But this advantage can be enjoyed only by a high-tech speech recognizer, one that has some kind of knowledge of how vocal tracts blend sounds.
The human brain, of course, is a high-tech speech recognizer, but no one knows how it succeeds. For this reason psychologists who study speech perception and engineers who build speech recognition machines keep a close eye on each other’s work. Speech recognition may be so hard that there are only a few ways it could be solved in principle. If so, the way the brain does it may offer hints as to the best way to build a machine to do it, and how a successful machine does it may suggest hypotheses about how the brain does it.
Early in the history of speech research, it became clear that human listeners might somehow take advantage of their expectations of the kinds of things a speaker is likely to say. This could narrow down the alternatives left open by the acoustic analysis of the speech signal. We have already noted that the rules of phonology provide one sort of redundancy that can be exploited, but people might go even farther. The psychologist George Miller played tapes of sentences in background noise and asked people to repeat back exactly what they heard. Some of the sentences followed the rules of English syntax and made sense.
Furry wildcats fight furious battles.
Respectable jewelers give accurate appraisals.
Lighted cigarettes create smoky fumes.
Gallant gentlemen save distressed damsels.
Soapy detergents dissolve greasy stains.
Others were created by scrambling the words within phrases to create colorless-green-ideas sentences, grammatical but nonsensical:
Furry jewelers create distressed stains.
Respectable cigarettes save greasy battles.
Lighted gentlemen dissolve furious appraisals.
Gallant detergents fight accurate fumes.
Soapy wildcats give smoky damsels.
A third kind was created by scrambling the phrase structure but keeping related words together, as in
Furry fight furious wildcat battles.
Jewelers respectable appraisals accurate give.
Finally, some sentences were utter word salad, like
Furry create distressed jewelers stains.
Cigarettes respectable battles greasy save.
People did best with the grammatical sensible sentences, worse with the grammatical nonsense and the ungrammatical sense, and worst of all with the ungrammatical nonsense. A few years later the psychologist Richard Warren taped sentences like
The state governors met with their respective legislatures convening in the capital city
, excised the first
s
from
legislatures
, and spliced in a cough. Listeners could not tell that any sound was missing.
If one thinks of the sound wave as sitting at the bottom of a hierarchy from sounds to phonemes to words to phrases to the meanings of sentences to general knowledge, these demonstrations seem to imply that human speech perception works from the top down rather than just from the bottom up. Maybe we are constantly guessing what a speaker will say next, using every scrap of conscious and unconscious knowledge at our disposal, from how coarticulation distorts sounds, to the rules of English phonology, to the rules of English syntax, to stereotypes about who tends to do what to whom in the world, to hunches about what our conversational partner has in mind at that very moment. If the expectations are accurate enough, the acoustic analysis can be fairly crude; what the sound wave lacks, the context can fill in. For example, if you are listening to a discussion about the destruction of ecological habitats, you might be on the lookout for words pertaining to threatened animals and plants, and then when you hear speech sounds whose phonemes you cannot pick out like “eesees,” you would perceive it correctly as
species
—unless you are Emily Litella, the hearing-impaired editorialist on
Saturday Night Live
who argued passionately against the campaign to protect endangered feces. (Indeed, the humor in the Gilda Radner character, who also fulminated against saving Soviet jewelry, stopping violins in the streets, and preserving natural racehorses, comes not from her impairment at the bottom of the speech-processing system but from her ditziness at the top, the level that should have prevented her from arriving at her interpretations.)