Read Kanzi: The Ape at the Brink of the Human Mind Online
Authors: Sue Savage-Rumbaugh
Why are consonants important? Couldn’t apes and monkeys simply use the sounds that they can make to construct a language all of their own? The issue is more complicated than it seems. Studies of the vocal repertoires of chimpanzees reveals that they, like many other mammals, possess a “graded” system of vocal communication. This means that instead of producing distinct calls that can easily be distinguished from each other, they produce a set of sounds that grade into each other with no clear boundaries. In one sense, a graded system permits richer communication than a system with fixed calls, which is what characterizes many bird species, for example. The graded vocal system of chimpanzees permits them to utilize pitch, intensity, and duration to add specific affective information to their vocal signals. For example, food calls signal the degree of pleasure felt about the food, as well as the interest in food per se.
More important, these affectively loaded signals are exchanged rapidly back and forth and the parameters of pitch,
intensity, and duration serve communicative functions very similar to human speech. We can say a phrase like, “Oh, I am very happy” with such feeling that the happiness almost leaps out of the speaker, or with a cynicism that lets the listener know the speaker is not really happy at all. Graded systems are well designed for transmitting emotional information that is itself graded in content. A feeling of happiness is something like a color in its endless variations. However, a word such as “fruit” or “nut” does not lend itself to a graded system. Words are units of specific information, and while they may themselves generate affect, they are not dependent on the affect for their information-bearing qualities. Consequently, unlike affective signals that constantly intergrade, words have a definite beginning and ending.
If apes indeed are intelligent enough to do so, why have they not elaborated their graded system into one with units, as we have? Unfortunately, vowels are ill-equipped to permit such “packed” communication. Even in human speech, vowels grade into one another, making it impossible to determine where one starts and ends. When tested with a computer-generated sound that slowly transitions from the vowel sound “Ah” to that of “Eh,” humans exhibit a sort of “fuzzy boundary.” There is a large area of the transition space that we label either as “Ah” or “Eh” without much consistency. This is true in trials of different listeners and of the same listener on different trials.
It is consonants that permit us to package vowels and therefore produce a speech stream that can be readily segmented into distinct auditory units, or word packages. Here we experience what is called a “categorical shift.” If the computer presents us with consonants rather than vowels, as in a test where the sound slowly changes from “Ba” to “Pa,” we continue to hear “Ba” until all of a sudden it sounds as though the computer decided to switch to “Pa.” Although the computer has indeed presented us with a gray area of transition, just as it did when it played the vowels, we no longer recognize that it is happening with consonants. It is as though we have an auditory system equipped with filters designed to let us hear either “Ba” or “Pa,” but nothing in between. When we hear a “Ba” it either
fits the “Ba” filter parameters or it does not. If it does not fit, we cannot make a judgment about it, as we do a vowel, because we simply don’t hear it as some mixture of “Ba + Pa” to judge.
For some time after this phenomenon of “categorical shift” was discovered, scientists thought that humans alone among mammals possessed the ability to process speech sounds categorically. Moreover, it was widely accepted that this capacity was a genetically predetermined aspect of our auditory system. Even though many scientists recognized that animals could learn to respond to single-word spoken commands, it was assumed that they were doing so on the basis of intonational contours, rather than the phonemic units themselves.
This view held sway until a method was devised to ask animals what they heard as they listened to consonants and vowels that graded into one another. The techniques used in these tests were modeled after those that had been applied to human infants, in which they were asked a similar question regarding their categorical skills. Human infants proved able to categorize consonants in a manner similar to that of adults, a fact that was initially viewed as strong support for the belief that these capacities were genetically programmed into our auditory systems. However, tests with mammals as different as chinchillas and rhesus monkeys revealed clearly that man was not unique in the capacity to make categorical judgments about consonants. Other animals could form acoustic boundaries that categorically differentiated consonants, even though they employed no such sounds in their own vocal systems. Thus, speech sounds are unique to humans only with regard to our ability to produce them, not with regard to our ability to hear them. On recollection, it seems odd that it should have surprised us that auditory systems are capable of far greater sound definition than the organism is able to produce with its vocal cords. After all, we live in a very noisy environment and to get along in the forest, we certainly need to be able to discriminate and make sense out of many sounds that we ourselves cannot produce.
Consonants are rather “funny” sounds. They must, for example, always be linked to a vowel if they are to be heard as a consonant. We cannot separate vowels and consonants in normal speech
because it is impossible for humans to say a consonant without also saying a vowel. Thus we cannot utter “G,” but rather must say “Gee” or “Ga” or “Ghuh” or some similar sound. However, it is possible, with the aid of computer, to chop apart vowels and consonants and thus have the computer say “G” in a way that we cannot. To accomplish this, you need only record some speech into a computer using a program that can transform auditory information into visual information. Once you have a picture of the sound on the screen, you can then play back the picture and watch as a time pointer moves through the wave form while listening to Ga, or any other sound you have recorded. If you watch the pointer move through the wave form while listening, you can determine the point at which you do not hear the “G” but are instead listening to the “ah.” If you cut the word at this point and play the two halves, you will find something astonishing. The “ah” sounds like a normal “ah,” but the “G” is not recognizable at all. It sounds like some sort of clicking, hissing noise and you will think that somehow the computer has made a mistake. But you have only to paste this hissing, clicking noise back onto the “ah” sound to hear the “G” sound again, as clear as can be.
What does this tell us? Variations in our perceptions are more a property of the auditory and neurological systems that we listen with than the sound pattern itself. Sharp, short sounds like clicks and hisses are perceived differently from tonal longer sounds. Why would this be? It may be the result of another unusual fact about clicks and hisses: We can localize them extremely well in space, a skill we probably owe to the fact that a broken branch or disturbed leaf can signal the approach of a predator. Most mammals, including ourselves, need to be able to turn in the right direction quickly and respond without hesitation when such a sound portends danger in the forest. By contrast, longer vowel-like sounds are produced by most mammalian and avian vocal tracts and are used for communicative purposes, not for hunting. We cannot localize such sounds as well. When animals hunt they are quiet, and clicks and scraping noises as they move through the forest are the only clue. Thus it seems that auditory systems have evolved different ways of listening to different sorts of sounds.
The fact that clicks and hissing sounds are so distinct and easily perceived gives them unusual properties when they are linked to vowel-like tonal sounds. The merging together of these two sound types results in what we hear as consonants. Without consonants it is doubtful that we would have spoken language. Why not? The answer is that vowel sounds are difficult to tell apart. It is hard to determine when an “ee” sound turns into an “ii” sound. At the extremes, you can determine which vowel is being produced, but this ability fails us rapidly as one vowel sound begins to grade into another.
The same is not true of short sounds like hisses and clicks. We hear them as discrete staccatolike events well localized in space. When these clicks are merged with vowels, consonants appear and act as the boundaries around vowels that permit us to determine readily where one syllable starts and another stops—so that we hear words as individual units.
It is startling to learn that these things we call words, which we hear as such distinct entities, are really not distinct at all. When we look at a visual wave form of a sentence, we find that the distinctions between words vanish completely. If we pause in our speaking for a break or for emphasis, we see a break in the wave form, but for regular speech, a sentence looks like one continuous word. Thus the units that we hear are not present in the physical energy we generate as we talk. We hear the sound spectrum of speech as segmented into words only because the consonants allow our brains to break the lump down at just the right joints—the joints we call words.
Seen from this perspective, the fascinating thing about human language becomes our ability to produce the actual units of speech. If we did not have the ability to attach clicks to vowels, we could not make consonants. Without consonants, it would be difficult to create a spoken language that could be understood, regardless of how intelligent we were.
It seems odd that the human animal is the only one that has gained the ability to produce consonants. Of course, it is also the case that we are the only animal that is a habitual biped, and the demands of bipedality have pressed some rather important constraints
upon our skull. Of course, we paid some prices for these changes. Our small teeth could no longer serve as weapons, and the sharp bend in our throats left us forever prone to choking. But the ability to form consonants readily gave
Homo
a way to package vowel sounds in many different envelopes, making possible a multitude of discriminable sounds. For the first time in primate evolutionary history, it became physically possible for us to invent a language. I suspect that our intellect had the potential for language long before, but it took the serendipitous physical changes that accompanied bipedalism to permit us to package vowels and consonants together in a way that made possible the open-ended generation of discriminable sound units—the crucial step leading to speech around the world.
These unusual properties of the auditory system are paralleled by similar phenomena in the visual system. Suppose we look at a row of marquee lights flashing off and on. If the time between the flashing of light A and light B is brief enough, we will perceive the lights as a single moving piece of energy. That is, we will not see any breaks or holes in the movement; our brain will fill in the gaps. If the light is slowed down, however, we will perceive it as jumping from one marquee bulb to another, with gaps in between. Thus, at one speed we see only a moving light; at another, we see a jumping light. This visual phenomenon is, like the categorical shift phenomenon, a property of the visual system of many primates as well.
Given the perceptual constraints of the auditory system, it is evident that the appearance of language awaited the development of a vocal system capable of packaging vowel sounds with consonants. Regardless of brain size, if the vocal system of the organism could not produce consonants, it is not likely that language would emerge. The majority of land-dwelling mammals are quadrupedal and consequently have retained the sloping vocal tract designed to modify vowels, to convey affect, and to enable them to swallow easily without choking. This elongation makes rapid consonant-vowel transitions physically implausible, even if the neuro-circuitry were to permit the ape to attempt it.
What of early hominids—would their vocal tracts have permitted them to produce consonants and thus package their vowels into discriminable units of sound? Edmund Crelin has constructed model vocal tracts for
Australopithecus, Homo erectus
, Neanderthals, and other archaic
Homo sapiens
. While the reconstruction of soft tissue is always difficult, and the testing of a rubber mold is also subject to numerous subtle variations, such tests are nonetheless the current best way to approximate the speech capacities of extinct species. Crelin concluded that the ability to produce vowel-like sounds typical of modern speech would not have appeared until the advent of archaic
Homo sapiens
, around two hundred and fifty thousand years ago. These creatures had a brain capacity similar to our own.
If Crelin is correct, then language cannot have been responsible for the creation of
Homo sapiens
. Rather, it appears that gaining the vocal tract that made language possible may simply have been a free benefit as we evolved into being better bipeds. How we achieved the fine neuroanatomical control required to orchestrate the co-articulatory movements and the voluntary respiratory control to operate our vocal tract, however, remains something of a mystery.