Moog Monday - On Synthesizers: Vocal Sounds, Part I

Image placeholder title

The human voice is the oldest musical instrument, and still remains the most important. Its ability to convey expression and emotion arises partly from the literal content of voice sounds (the words themselves), and partly from their acoustic properties. In discussing the production and processing of vocal sounds for musical purposes, I will focus on the sound quality of the voice, and ignore those factors that relate only to the intelligibility of the literal message.

In several of my past columns, I have pointed out that the characteristic qualities of musical sounds often depend more on how the sound parameters move than on what the sound parameters are at a given instant of time. In no case is this more true than in that of the human voice. The vocal tract is the only acoustic musical tone source whose shape can be rapidly, precisely, and drastically changed by the "player." In order to understand what the constituent properties of vocal sounds are, let's examine how they are produced, and then see how we can describe and classify them.

The Vocal Tract
Located directly in back of your Adam's apple are a pair of flaps, each of which is fastened down along three of its four sides. The unattached sides of the flaps are close together, forming an opening through which air from your lungs blows. By tightening these flaps, you can close the opening, thus forming a vibrating system not unlike a trumpet player's lips. When you breathe out, the vocal cords periodically allow through puffs of air, which form the beginning waveform of a voiced sound. This initial waveform has the approximate shape and spectrum of a rounded rectangular wave—it is generally rich in harmonics, but with some of the harmonics attenuated. The more you tighten the vocal cords, the faster they flap back and forth. This is how you control the pitch of your speech and singing. In addition, if your voice is trained, you can probably influence the "roundness" of the air puffs, thereby affecting the overall harmonic content of the initial waveform.

The space between the vocal cords and your lips is a complex cavity, which the initial waveform must go through in order to be heard. When you speak, you actually divide this cavity into two parts by raising some portion of your tongue toward the roof of your mouth. The back portion of the cavity has two important resonances, generally called F1 and F3, while the front portion of the cavity has one important resonance, called F2. Besides these, the cavity between the mouth and nose sometimes comes into play. By varying the height of your tongue, the place where it forms a bump, the position of the back of the roof of your mouth, and the opening between your lips, you control the resonant frequencies of F1, F2 and F3. To see all of this happening, stand in front of a mirror and slowly say "ah-eh-ee-aw-ooh." Watch the inside of your mouth closely to see everything that moves.

Many vocal sounds do not involve the vibration of the vocal cords. These are called "voiceless" sounds, and include consonants such as s, sh, f, p, k, and t. The initial waveform for these sounds is actually white noise formed by the turbulence of air as it rushes past the tongue, teeth, and lips.

Finally, articulation (the rapid starting and stopping of sounds) is performed by the tongue and lips. Rapid sounds such as b, k, and ch are formed in this way.

Characteristics Of Vocal Sounds
Vowels are the easiest sounds to understand. The vibration of the vocal cords excites the resonances F1, F2 and F3; the nasal cavity is shut off. Here are representative F1, F2, and F3 for the five vowels that you just said to yourself in front of your mirror:

 Vowel F1 (Hz) F2 (Hz) F3 (Hz) ah 750 1100 2450 eh 550 1850 2500 ee 275 2300 3025 aw 575 850 2425 ooh 300 875 2250

Now you can correlate what you saw in the mirror with what the above chart says. The cavities on either side of the raised part of your tongue can be thought of as Helmholtz resonators, just like empty beer bottles or tuned speaker cabinets. As the volume of such a cavity goes up, its resonant frequency goes down (bigger beer bottles have lower resonant frequencies than smaller ones). As the size of the cavity opening goes up, the resonant frequency also goes up. Now, F1 is the main resonance of the space at the back of your mouth, and F2 is the main resonance of the space at the front of your mouth. "Aw" and "ooh" are spoken with the lips partially closed, thus lowering F2; "ee" is spoken with the lips open and the front cavity shortened by the raised tongue, thus raising F2; and so on. Of course, no two voices are exactly alike. The resonant frequencies given above are approximate. For instance, the formant frequencies of women's voices are typically 15% higher than those of men. The relationships among the sets of formants of the various vowels are, however, relatively consistent.

Not all vowels can be described by three constant formant frequencies. Vowels such as long "u" (as in "you") and "ay" actually consist of transitions between two "steady" vowels. Say "you" slowly. You will hear that the word starts off as "ee" and ends as "ooh." Thus the formants move from one set of frequencies to another during the course of the sound. Other dynamic vowels (technically known as diphthongs) are "ow" and long "i."

Simulation Of Vocal Sounds
The initial waveform of voiced vocal sounds can be accurately simulated by a filtered rectangular waveform; white noise provides the initial waveform for unvoiced sounds. Three dynamic resonant filters provide formants F1, F2 and F3, while a fourth resonance provides nasal emphasis. The usual voltage-controlled amplifier and contour generator can provide articulation to define the consonants and shape the vowels. Finally, a sequencer can be employed to connect the components of a composite vocal sound. Standard synthesizer modules and sections to perform these functions are available. In fact, experimental speech synthesizers of this type have been around for at least forty years.

Another approach to simulating vocal sounds is the use of a vocoder, a device that analyzes and then reconstructs the frequency bands of real speech sounds. My next column will describe several vocoder systems currently in use. After that, we'll talk about vocal sound simulation using conventional synthesizer components.

For more articles by Bob Moog, please visit

Image placeholder title