Read The Language Instinct: How the Mind Creates Language Online
Authors: Steven Pinker
What about kinds of things, or categories? Isn’t it true that no two individuals are exactly alike? Yes, but they are not arbitary collections of properties, either. Things that have long furry ears and tails like pom-poms also tend to eat carrots, scurry into burrows, and breed like, well, rabbits. Lumping objects into categories—giving them a category label in mentalese—allows one, when viewing an entity, to infer some of the properties one cannot directly observe, using the properties one
can
observe. If Flopsy has long furry ears, he is a “rabbit”; if he is a rabbit, he might scurry into a burrow and quickly make more rabbits.
Moreover, it pays to give objects several labels in mentalese, designating different-sized categories like “cottontail rabbit,” “rabbit,” “mammal,” “animal,” and “living thing.” There is a tradeoff involved in choosing one category over another. It takes less effort to determine that Peter Cottontail is an animal than that he is a cottontail (for example, an animallike motion will suffice for us to recognize that he is an animal, leaving it open whether or not he is a cottontail). But we can predict more new things about Peter if we know he is a cottontail than if we merely know he is an animal. If he is a cottontail, he likes carrots and inhabits open country or woodland clearings; if he is merely an animal, he could eat anything and live anywhere, for all one knows. The middle-sized or “basic-level” category “rabbit” represents a compromise between how easy it is to label something and how much good the label does you.
Finally, why separate the rabbit from the scurry? Presumably because there are predictable consequences of rabbithood that cut across whether it is scurrying, eating, or sleeping: make a loud sound, and in all cases it will be down a hole lickety-split. The consequences of making a loud noise in the presence of lionhood, whether eating or sleeping, are predictably different, and that is a difference that makes a difference. Likewise, scurrying has certain consequences regardless of who is doing it; whether it be rabbit or lion, a scurrier does not remain in the same place for long. With sleeping, a silent approach will generally work to keep a sleeper—rabbit or lion—motionless. Therefore a powerful prognosticator should have separate sets of mental labels for kinds of objects and kinds of actions. That way, it does not have to learn separately what happens when a rabbit scurries, what happens when a lion scurries, what happens when a rabbit sleeps, what happens when a lion sleeps, what happens when a gazelle scurries, what happens when a gazelle sleeps, and on and on; knowing about rabbits and lions and gazelles in general, and scurrying and sleeping in general, will suffice. With
m
objects and
n
actions, a knower needn’t go through
m
X
n
learning experiences; it can get away with
m
+
n
of them.
So even a wordless thinker does well to chop continuously flowing experience into things, kinds of things, and actions (not to mention places, paths, events, states, kinds of stuff, properties, and other types of concepts). Indeed, experimental studies of baby cognition have shown that infants have the concept of an object before they learn any words for objects, just as we would expect. Well before their first birthday, when first words appear, babies seem to keep track of the bits of stuff that we would call objects: they show surprise if the parts of an object suddenly go their own ways, or if the object magically appears or disappears, passes through another solid object, or hovers in the air without visible means of support.
Attaching words to these concepts, of course, allows one to share one’s hard-won discoveries and insights about the world with the less experienced or the less observant. Figuring out which word to attach to which concept is the
gavagai
problem, and if infants start out with concepts corresponding to the kinds of meanings that languages use, the problem is partly solved. Laboratory studies confirm that young children assume that certain kinds of concepts get certain types of words, and other kinds of concepts cannot be the meaning of a word at all. The developmental psychologists Ellen Markman and Jeanne Hutchinson gave two-and three-year-old children a set of pictures, and for each picture asked them to “find another one that is the same as this.” Children are intrigued by objects that interact, and when faced with these instructions they tend to select pictures that make groups of role-players like a blue jay and a nest or a dog and a bone. But when Markman and Hutchinson told them to “find another
dax
that is the same as this
dax
,” the children’s criterion shifted. A word must label a
kind
of thing, they seemed to be reasoning, so they put together a bird with another type of bird, a dog with another type of dog. For a child, a
dax
simply cannot mean “a dog or its bone,” interesting though the combination may be.
Of course, more than one word can be applied to a thing: Peter Cottontail is not only a
rabbit
but an
animal
and a
cottontail
. Children have a bias to interpret nouns as middle-level kinds of objects like “rabbit,” but they also must overcome that bias, to learn other types of words like
animal
. Children seem to manage this by being in sync with a striking feature of language. Though most common words have many meanings, few meanings have more than one word. That is, homonyms are plentiful, synonyms rare. (Virtually all supposed synonyms have some difference in meaning, however small. For example,
skinny
and
slim
differ in their connotation of desirability;
policeman
and
cop
differ in formality.) No one really knows why languages are so stingy with words and profligate with meanings, but children seem to expect it (or perhaps it is this expectation that causes it!), and that helps them further with the
gavagai
problem. If a child already knows a word for a kind of thing, then when another word is used for it, he or she does not take the easy but wrong way and treat it as a synonym. Instead, the child tries out some other possible concept. For example, Markman found that if you show a child a pair of pewter tongs and call it
biff
, the child interprets
biff
is meaning tongs in general, showing the usual bias for middle-level objects, so when asked for “more biffs,” the child picks out a pair of plastic tongs. But if you show the child a pewter cup and call it
biff
, the child does not interpret
biff
as meaning “cup,” because most children already know a word that means “cup,” namely,
cup
. Loathing synonyms, the children guess that
biff
must mean something else, and the stuff the cup is made of is the next most readily available concept. When asked for more
biffs
, the child chooses a pewter spoon or pewter tongs.
Many other ingenious studies have shown how children home in on the correct meanings for different kinds of words. Once children know some syntax, they can use it to sort out different kinds of meaning. For example, the psychologist Roger Brown showed children a picture of hands kneading a mass of little squares in a bowl. If he asked them, “Can you see any sibbing?,” the children pointed to the hands. If instead he asked them, “Can you see a sib?,” they point to the bowl. And if he asked, “Can you see any sib?,” they point to the stuff inside the bowl. Other experiments have uncovered great sophistication in children’s understanding of how classes of words fit into sentence structures and how they relate to concepts and kinds.
So what’s in a name? The answer, we have seen, is, a great deal. In the sense of a morphological product, a name is an intricate structure, elegantly assembled by layers of rules and lawful even at its quirkiest. And in the sense of a listeme, a name is a pure symbol, part of a cast of thousands, rapidly acquired because of a harmony between the mind of the child, the mind of the adult, and the texture of reality.
When I was a student I worked in a laboratory at McGill University
that studied auditory perception. Using a computer, I would synthesize trains of overlapping tones and determine whether they sounded like one rich sound or two pure ones. One Monday morning I had an odd experience: the tones suddenly turned into a chorus of screaming munchkins. Like this: (beep boop-boop) (beep boop-boop) (beep boop-boop) HUMPTY-DUMPTY-HUMPTY-DUMPTY-HUMPTY-DUMPTY (beep boop-boop) (beep boop-boop) HUMPTY-DUMPTY-HUMPTY-DUMPTY-HUMPTY-HUMPTY-DUMPTY-DUMPTY (beep boop-boop) (beep boop-boop) (beep boop-boop) HUMPTY-DUMPTY (beep boop-boop) HUMPTY-HUMPTY-HUMPTY-DUMPTY (beep boop-boop). I checked the oscilloscope: two streams of tones, as programmed. The effect had to be perceptual. With a bit of effort I could go back and forth, hearing the sound as either beeps or munchkins. When a fellow student entered, I recounted my discovery, mentioning that I couldn’t wait to tell Professor Bregman, who directed the laboratory. She offered some advice: don’t tell anyone, except perhaps Professor Poser (who directed the psychopathology program).
Years later I discovered what I had discovered. The psychologists Robert Remez, David Pisoni, and their colleagues, braver men than I am, published an article in
Science
on “sine-wave speech.” They synthesized three simultaneous wavering tones. Physically, the sound was nothing at all like speech, but the tones followed the same contours as the bands of energy in the sentence. “Where were you a year ago?” Volunteers described what they heard as “science fiction sounds” or “computer bleeps.” A second group of volunteers was told that the sounds had been generated by a bad speech synthesizer. They were able to make out many of the words, and a quarter of them could write down the sentence perfectly. The brain can hear speech content in sounds that have only the remotest resemblance to speech. Indeed, sine-wave speech is how mynah birds fool us. They have a valve on each bronchial tube and can control them independently, producing two wavering tones which we hear as speech.
Our brains can flip between hearing something as a bleep and hearing it as a word because phonetic perception is like a sixth sense. When we listen to speech the actual sounds go in one ear and out the other; what we perceive is
language
. Our experience of words and syllables, of the “b”-ness of
b
and the “ee”-ness of
ee
, is as separable from our experience of pitch and loudness as lyrics are from a score. Sometimes, as in sine-wave speech, the senses of hearing and phonetics compete over which gets to interpret a sound, and our perception jumps back and forth. Sometimes the two senses simultaneously interpret a single sound. If one takes a tape recording of
da
, electronically removes the initial chirplike portion that distinguishes the
da
from
ga
and
ka
, and plays the chirp to one ear and the residue to the other, what people hear is a chirp in one ear and
da
in the other—a single clip of sound is perceived simultaneously as
d
-ness and a chirp. And sometimes phonetic perception can transcend the auditory channel. If you watch an English-subtitled movie in a language you know poorly, after a few minutes you may feel as if you are actually understanding the speech. In the laboratory, researchers can dub a speech sound like
ga
onto a close-up video of a mouth articulating
va, ba, tha
, or
da
. Viewers literally
hear
a consonant like the one they see the mouth making—an astonishing illusion with the pleasing name “McGurk effect,” after one of its discoverers.
Actually, one does not need electronic wizardry to create a speech illusion. All speech is an illusion. We hear speech as a string of separate words, but unlike the tree falling in the forest with no one to hear it, a word boundary with no one to hear it has no sound. In the speech sound wave, one word runs into the next seamlessly; there are no little silences between spoken words the way there are white spaces between written words. We simply hallucinate word boundaries when we reach the edge of a stretch of sound that matches some entry in our mental dictionary. This becomes apparent when we listen to speech in a foreign language: it is impossible to tell where one word ends and the next begins. The seamlessness of speech is also apparent in “oronyms,” strings of sound that can be carved into words in two different ways:
The good can decay many ways.
The good candy came anyways.
The stuffy nose can lead to problems.
The stuff he knows can lead to problems.
Some others I’ve seen.
Some mothers I’ve seen.
Oronyms are often used in songs and nursery rhymes:
I scream,
You scream,
We all scream
For ice cream.
Mairzey doats and dozey doats
And little lamsey divey,
A kiddley-divey do,
Wouldn’t you?
Fuzzy Wuzzy was a bear,
Fuzzy Wuzzy had no hair.
Fuzzy Wuzzy wasn’t fuzzy,
Was he?
In fir tar is,
In oak none is.
In mud eel is,
In clay none is.
Goats eat ivy.
Mares eat oats.
And some are discovered inadvertently by teachers reading their students’ term papers and homework assignments:
Jose can you see by the donzerly light? [Oh say can you see by the dawn’s early light?]
It’s a doggy-dog world, [dog-eat-dog]
Eugene O’Neill won a Pullet Surprise. [Pulitzer Prize]
My mother comes from Pencil Vanea. [Pennsylvania]
He was a notor republic, [notary public]
They played the Bohemian Rap City. [Bohemian Rhapsody]
Even the sequence of sounds we think we hear within a word is an illusion. If you were to cut up a tape of someone’s saying
cat
, you would not get pieces that sounded like
k, a
, and
t
(the units called “phonemes” that correspond roughly to the letters of the alphabet). And if you spliced the pieces together in the reverse order, they would be unintelligible, not
tack
. As we shall see, information about each component of a word is smeared over the entire word.
Speech perception is another one of the biological miracles making up the language instinct. There are obvious advantages to using the mouth and ear as a channel of communication, and we do not find any hearing community opting for sign language, though it is just as expressive. Speech does not require good lighting, face-to-face contact, or monopolizing the hands and eyes, and it can be shouted over long distances or whispered to conceal the message. But to take advantage of the medium of sound, speech has to overcome the problem that the ear is a narrow informational bottleneck. When engineers first tried to develop reading machines for the blind in the 1940s, they devised a set of noises that corresponded to the letters of the alphabet. Even with heroic training, people could not recognize the sounds at a rate faster than good Morse code operators, about three units a second. Real speech, somehow, is perceived an order of magnitude faster: ten to fifteen phonemes per second for casual speech, twenty to thirty per second for the man in the late-night Veg-O-Matic ads, and as many as forty to fifty per second for artificially sped-up speech. Given how the human auditory system works, this is almost unbelievable. When a sound like a click is repeated at a rate of twenty times a second or faster, we no longer hear it as a sequence of separate sounds but as a low buzz. If we can hear forty-five phonemes per second, the phonemes cannot possibly be consecutive bits of sound; each moment of sound must have several phonemes packed into it that our brains somehow unpack. As a result, speech is by far the fastest way of getting information into the head through the ear.
No human-made system can match a human in decoding speech. It is not for lack of need or trying. A speech recognizer would be a boon to quadriplegics and other disabled people, to professionals who have to get information into a computer while their eyes or hands are busy, to people who never learned to type, to users of telephone services, and to the growing number of typists who are victims of repetitive-motion syndromes. So it is not surprising that engineers have been working for more than forty years to get computers to recognize the spoken word. The engineers have been frustrated by a tradeoff. If a system has to be able to listen to many different people, it can recognize only a tiny number of words. For example, telephone companies are beginning to install directory assistance systems that can recognize anyone saying the word
yes
, or, in the more advanced systems, the ten English digits (which, fortunately for the engineers, have very different sounds). But if a system has to recognize a large number of words, it has to be trained to the voice of a single speaker. No system today can duplicate a person’s ability to recognize both many words and many speakers. Perhaps the state of the art is a system called DragonDictate, which runs on a personal computer and can recognize 30,000 words. But it has severe limitations. It has to be trained extensively on the voice of the user. You…have…to…talk…to…it…like…this, with quarter-second pauses between the words (so it operates at about one-fifth the rate of ordinary speech). If you have to use a word that is not in its dictionary, like a name, you have to spell it out using the “Alpha, Bravo, Charlie” alphabet. And the program still garbles words about fifteen percent of the time, more than once per sentence. It is an impressive product but no match for even a mediocre stenographer.
The physical and neural machinery of speech is a solution to two problems in the design of the human communication system. A person might know 60,000 words, but a person’s mouth cannot make 60,000 different noises (at least, not ones that the ear can easily discriminate). So language has exploited the principle of the discrete combinatorial system again. Sentences and phrases are built out of words, words are built out of morphemes, and morphemes, in turn, are built out of phonemes. Unlike words and morphemes, though, phonemes do not contribute bits of meaning to the whole. The meaning of
dog
is not predictable from the meaning of
d
, the meaning of
o
, the meaning of
g
, and their order. Phonemes are a different kind of linguistic object. They connect outward to speech, not inward to mentalese: a phoneme corresponds to an act of making a sound. A division into independent discrete combinatorial systems, one combining meaningless sounds into meaningful morphemes, the others combining meaningful morphemes into meaningful words, phrases, and sentences, is a fundamental design feature of human language, which the linguist Charles Hockett has called “duality of patterning.”
But the phonological module of the language instinct has to do more than spell out the morphemes. The rules of language are discrete combinatorial systems: phonemes snap cleanly into morphemes, morphemes into words, words into phrases. They do not blend or melt or coalesce:
Dog bites man
differs from
Man bites dog
, and believing in God is different from believing in Dog. But to get these structures out of one head and into another, they must be converted to audible signals. The audible signals people can produce are not a series of crisp beeps like on a touch-tone phone. Speech is a river of breath, bent into hisses and hums by the soft flesh of the mouth and throat. The problems Mother Nature faced are digital-to-analog conversion when the talker encodes strings of discrete symbols into a continuous stream of sound, and analog-to-digital conversion when the listener decodes continuous speech back into discrete symbols.
The sounds of language, then, are put together in several steps. A finite inventory of phonemes is sampled and permuted to define words, and the resulting strings of phonemes are then massaged to make them easier to pronounce and understand before they are actually articulated. I will trace out these steps for you and show you how they shape some of our everyday encounters with speech: poetry and song, slips of the ear, accents, speech recognition machines, and crazy English spelling.
One easy way to understand speech sounds is to track a glob of air through the vocal tract into the world, starting in the lungs.
When we talk, we depart from our usual rhythmic breathing and take in quick breaths of air, then release them steadily, using the muscles of the ribs to counteract the elastic recoil force of the lungs. (If we did not, our speech would sound like the pathetic whine of a released balloon.) Syntax overrides carbon dioxide: we suppress the delicately tuned feedback loop that controls our breathing rate to regulate oxygen intake, and instead we time our exhalations to the length of the phrase or sentence we intend to utter. This can lead to mild hyperventilation or hypoxia, which is why public speaking is so exhausting and why it is difficult to carry on a conversation with a jogging partner.