Intro from Jay Allison: If you work in sound or film, you will come to know the name Walter Murch by your colleagues' tone when they say it. This is the man responsible for movies you remember for the dance between sound and picture--he shaped them both --The Conversation, The English Patient, Apocalypse Now, Cold Mountain -- and those are just a few of his picture editing and sound mixing credits. He has won multiple Oscars in both categories and is, well, generally regarded with some awe. Walter has created for Transom a new essay called Womb Tone as a companion to his lecture, Dense Clarity - Clear Density, now illustrated here with sound and film clips, detailing Walter's process. It's amazing. Take a chair in the classroom, and sit quietly. In case you think this will be a gut, let me quote this from Walter's bio, "Between films, he pursues interests in the science of human perception, cosmology and the history of science. Since 1995, he has been working on a reinterpretation of the Titius-Bode Law of planetary spacing, based on data from the Voyager Probe, the Hubble telescope, and recent discoveries of exoplanets orbiting distant stars." Walter will be around to answer your questions, but only intermittently because he is now editing and mixing Jarhead, about which he noted in email, "...the strange thing is that there is a clip from Apocalypse Now in Jarhead: a scene of the marines watching the helicopter attack as they get themselves pumped up to go to Kuwait. The experience, for me, is like being trapped inside an Escher drawing."
Hearing is the first of our senses to be switched on, four-and-a-half months after we are conceived. And for the rest of our time in the womb—another four-and-a-half months—we are pickled in a rich brine of sound that permeates and nourishes our developing consciousness: the intimate and varied pulses of our mother’s heart and breath; her song and voice; the low rumbling and sudden flights of her intestinal trumpeting; the sudden, mysterious, alluring or frightening fragments of the outside world — all of these swirl ceaselessly around the womb-bound child, with no competition from dormant Sight, Smell, Taste or Touch.
Birth wakens those four sleepyhead senses and they scramble for the child’s attention—a race ultimately won by the darting and powerfully insistent Sight—but there is no circumventing the fact that Sound was there before any of the other senses, waiting in the womb’s darkness as consciousness emerged, and was its tender midwife.
So although our mature consciousness may be betrothed to sight, it was suckled by sound, and if we are looking for the source of sound’s ability—in all its forms—to move us more deeply than the other senses and occasionally give us a mysterious feeling of connectedness to the universe, this primal intimacy is a good place to begin.
One of the infant’s first discoveries about the outside world is silence, which was never experienced in the womb. In later life, the absence of sound may come to seem a blessed relief, but for the newly-born, silence must be intensely threatening, with its implications of cessation and death. In radio, accordingly, a gap longer than the distance between a few heartbeats is taboo. In film, however, silence can be a glowing and powerful force if, like any potentially dangerous substance, it is handled correctly.
Another of the infant’s momentous discoveries about the world is its synchronization: our mother speaks and we see her lips move, they close and she falls silent; a plate tumbles off the table and crashes to the floor; we clap our hands and hear (as well as feel) the smack of flesh against flesh. Sounds remembered from the womb are discovered to have an external point of origin. The consequent realization that there is a world “outside” separate from the self (which must therefore be somehow “inside”) is a profound and earth-shaking discovery, and it deserves more attention than we can give it here. Perhaps it is enough to say that this feeling of separation between the self and the world is a hallmark of human existence, and the source of equal amounts of joy when it is overcome and pain when it is not.
Synchronization of sight and sound, which naturally does not exist in radio, can be the glory or the curse of cinema. A curse, because if overused, a string of images relentlessly chained to literal sound has the tyrannical power to strangle the very things it is trying to represent, stifling the imagination of the audience in the bargain. Yet the accommodating technology of cinema gives us the ability to loosen those chains and to re-associate the film’s images with other, carefully-chosen sounds which at first hearing may be “wrong” in the literal sense, but which can offer instead richly descriptive sonic metaphors.
This metaphoric use of sound is one of the most flexible and productive means of opening up a conceptual gap into which the fertile imagination of the audience will reflexively rush, eager (even if unconsciously so) to complete circles that are only suggested, to answer questions that are only half-posed. What each person perceives on screen, then, will have entangled within it fragments of their own personal history, creating that paradoxical state of mass intimacy where—though the audience is being addressed as a whole—each individual feels the film is addressing things known only to him or her.
So the weakness of present-day cinema is paradoxically its strength of representation: it doesn’t automatically possess the built-in escape valves of ambiguity that painting, music, literature, black-and-white silent film, and radio have simply by virtue of their sensory incompleteness —an incompleteness that automatically engages the imagination of the viewer/listener as compensation for what can only be suggested by the artist. In film, therefore, we go to considerable lengths to achieve what comes naturally to radio and the other arts: the space to evoke and inspire, rather than to overwhelm and crush, the imagination of the audience.
The essay that follows asks some questions about multilayered density in sound: are there limits to the number and nature of different elements we can superimpose? Can the border between sparse clarity and obscure density be located in advance?
These questions are, at heart, about how many separate ideas the mind can handle at the same time, and on this topic there seems, surprisingly, to be a common thread linking many different realms of human experience—music, Chinese writing, and Dagwood sandwiches, to name a few—and so I hope some of the tentative answers presented here, even though derived from film, will find their fruitful equivalents in radio.
Note About the Womb Tone Clip: This was recorded by my wife, Muriel (aka Aggie), who was a midwife for fifteen years and currently works in radio.
Before proceeding, I offer you a nut to crack, whose mysterious meat may help to qualify some of the less-than-obvious differences between sound in film and radio.Back in the decades B.D. (Before Digital) we mixed to 35mm black-and-white copies of the film in order to save the fragile workprint and, not incidentally, money. Photographically, these ‘dupes’ were dismal, but in a perverse way they helped the creative process by encouraging us to make the sound as good as possible to compensate for the low quality of the image.At the completion of the mix, still smarting from those sound-moments we felt we hadn’t quite pulled off, we would finally have the opportunity to screen the soundtrack with the ‘answer print’ — the first high-quality print from the lab. This was always an astonishing moment: if the sound had been good, it was now better, and even those less-than-successful moments seemed passable. It was as if the new print had cast a spell—which in a way is exactly what it had done.
This was not a unique situation by any means: it was a rule of thumb throughout the industry never to let producers or studio executives hear the final mix unless it could be screened with an answer print.What was going on?
Part 2: Dense Clarity – Clear Density
Simple and Complex
One of the deepest impressions on someone who happens to wander into a film mixing studio is that there is no necessary connection between ends and means. Sometimes, to create the natural simplicity of an ordinary scene between two people, dozens and dozens of soundtracks have to be created and seamlessly blended into one. At other times an apparently complex ‘action’ soundtrack can be conveyed with just a few carefully selected elements. In other words, it is not always obvious what it took to get the final result: it can be simple to be complex, and complicated to be simple.
The general level of complexity, though, has been steadily increasing over the eight decades since film sound was invented. And starting with Dolby Stereo in the 1970’s, continuing with computerized mixing in the 1980’s and various digital formats in the 1990’s, that increase has accelerated even further. Seventy years ago, for instance, it would not be unusual for an entire film to need only fifteen to twenty sound effects. Today that number could be hundreds to thousands of times greater.
Well, the film business is not unique: compare the single-take, single-track 78rpm discs of the 1930’s to the multiple-take, multi-track surround-sound CDs of today. Or look at what has happened with visual effects: compare King Kong of the 1930’s to the Jurassic dinosaurs of the 1990’s. The general level of detail, fidelity, and what might be called the “hormonal level” of sound and image has been vastly increased, but at the price of much greater complexity in preparation.
The consequence of this, for sound, is that during the final recording of almost every film there are moments when the balance of dialogue, music, and sound effects will suddenly (and sometimes unpredictably) turn into a logjam so extreme that even the most experienced of directors, editors, and mixers can be overwhelmed by the choices they have to make.
So what I’d like to focus on are these ‘logjam’ moments: how they come about, and how to deal with them when they do. How to choose which sounds should predominate when they can’t all be included? Which sounds should play second fiddle? And which sounds – if any – should be eliminated? As difficult as these questions are, and as vulnerable as such choices are to the politics of the filmmaking process, I’d like to suggest some conceptual and practical guidelines for threading your way through, and perhaps even disentangling these logjams.
Or– better yet — not permitting them to occur in the first place.
Code and Body
To begin to get a handle on this, I’d like you to think about sound in terms of light.
White light, for instance, which looks so simple, is in fact a tangled superimposure of every wavelength (that is to say, every color) of light simultaneously. You can observe this in reverse when you shine a flashlight through a prism and see the white beam fan out into the familiar rainbow of colors from violet (the shortest wavelength of visible light) – through indigo, blue, green, yellow, and orange – to red (the longest wavelength).
Keeping this in mind, I’d now like you to imagine white sound – every imaginable sound heard together at the same time: the sound of New York City, for instance – cries and whispers, sirens and shrieks, motors, subways, jackhammers, street music, Grand Opera and Shea Stadium. Now imagine that you could ‘shine’ this white sound through some kind of magic prism that would reveal to us its hidden spectrum.
Just as the spectrum of colors is bracketed by violet and red, this sound-spectrum will have its own brackets, or limits. Usually, in this kind of discussion, we would now start talking about the lowest audible (20 cycles) and highest audible (20,000 cycles) frequencies of sound. But for the purposes of our discussion I am going to ask you to imagine limits of a completely different conceptual order – something I’ll call Encoded sound, which I’ll put over here on the left (where we had violet); and something else I’ll call Embodied sound, which I’ll put over on the right (red).
The clearest example of Encoded sound is speech.
The clearest example of Embodied sound is music.
When you think about it, every language is basically a code, with its own particular set of rules. You have to understand those rules in order to break open the husk of language and extract whatever meaning is inside. Just because we usually do this automatically, without realizing it, doesn’t mean it isn’t happening. It happens every time someone speaks to you: the meaning of what they are saying is encoded in the words they use. Sound, in this case, is acting simply as a vehicle with which to deliver the code.
Music, however, is completely different: it is sound experienced directly, without any code intervening between you and it. Naked. Whatever meaning there is in a piece of music is ‘embodied’ in the sound itself. This is why music is sometimes called the Universal Language.
What lies between these outer limits? Just as every audible sound falls somewhere between the lower and upper limits of 20 and 20,000 cycles, so all sounds will be found somewhere on this conceptual spectrum from speech to music.
Most sound effects, for instance, fall mid-way: like ‘sound-centaurs,’ they are half language, half music. Since a sound effect usually refers to something specific – the steam engine of a train, the knocking at a door, the chirping of birds, the firing of a gun – it is not as ‘pure’ a sound as music. But on the other hand, the language of sound effects, if I may call it that, is more universally and immediately understood than any spoken language.
Green and Orange
But now I’m going to throw you a curve (you expected this, I’m sure) and say that in practice things are not quite as simple as I have just made them out to be. There are musical elements that make their way into almost all speech – think of how someone says something as a kind of music. For instance, you can usually tell if someone is angry or happy, even if you don’t understand what they are saying, just by listening to the tone (the music) of their voice. We understand R2-D2 entirely through the music of his beeps and boops, not from his ‘words’ (only C-3PO and Luke Skywalker can do that). Stephen Hawking’s computerized speech, on the other hand, is perfectly understandable, but monotonous – it has very little musical content – and so we have to listen carefully to what he says, not how he says it.
To the degree that speech has music in it, its ‘color’ will drift toward the warmer (musical) end of the spectrum. In this regard, R2D2 is warmer than Stephen Hawking, and Mr. Spock is cooler than Rambo.
By the same token, there are elements of code that underlie every piece of music. Just think of the difficulty of listening to Chinese Opera (unless you are Chinese!). If it seems strange to you, it is because you do not understand its code, its underlying assumptions. In fact, much of your taste in music is dependent on how many musical languages you have become familiar with, and how difficult those languages are. Rock and Roll has a simple underlying code (and a huge audience); modern European classical music has a complicated underlying code (and a smaller audience).
To the extent that this underlying code is an important element in the music, the ‘color’ of the music will drift toward the cooler (linguistic) end of the spectrum. Schoenberg is cooler than Santana.
And sound effects can mercurially slip away from their home base of yellow towards either edge, tinting themselves warmer and more ‘musical,’ or cooler and more ‘linguistic’ in the process. Sometimes a sound effect can be almost pure music. It doesn’t declare itself openly as music because it is not melodic, but it can have a musical effect on you anyway: think of the dense (“orange”) background sounds in Eraserhead. And sometimes a sound effect can deliver discrete packets of meaning that are almost like words. A door-knock, for instance, might be a “blue” micro-language that says: “Someone’s here!” And certain kinds of footsteps say simply: “Step! Step! Step!”
Such distinctions have a basic function in helping you to classify – conceptually – the sounds for your film. Just as a well-balanced painting will have an interesting and proportioned spread of colors from complementary parts of the spectrum, so the sound-track of a film will appear balanced and interesting if it is made up of a well-proportioned spread of elements from our spectrum of ‘sound-colors.’ I would like to emphasize, however, that these colors are completely independent of any emotional tone associated with “warmth” or “coolness.” Although I have put music at the red (warm) end of the spectrum, a piece of music can be emotionally cool, just as easily as a line of dialogue – at the cool end of the spectrum – can be emotionally hot.
In addition, there is a practical consideration to all this when it comes to the final mix: It seems that the combination of certain sounds will take on a correspondingly different character depending on which part of the spectrum they come from – some sounds will superimpose transparently and effectively, whereas others will tend to interfere destructively with each other and ‘block up,’ creating a muddy and unintelligible mix.
Before we get into the specifics of this, though, let me say a few words about the differences of superimposing images and sounds.
Harmonic and Non-Harmonic
When you look at a painting or a photograph, or the view outside your window, you see distinct areas of color – a yellow dress on a washing line, for instance, outlined against a blue sky. The dress and the sky occupy separate areas of the image. If they didn’t – if the foreground dress was transparent, the wavelengths of yellow and blue would add together and create a new color – green, in this case. This is just the nature of the way we perceive light.
You can superimpose sounds, though, and they still retain their original identity. The notes C, E, and G create something new: a harmonic C-major chord. But if you listen carefully you can still hear the original notes. It is as if, looking at something green, you still also could see the blue and the yellow that went into making it.
And it is a good thing that it works this way, because a film’s soundtrack (as well as music itself) is utterly dependent on the ability of different sounds (‘notes’) to superimpose transparently upon each other, creating new ‘chords,’ without themselves being transformed into something totally different.
Are there limits to how much superimposure can be achieved?
Well, it depends on what we mean by superimposure. Every note played by every instrument is actually a superimposure of a series of tones. A cello playing “A”, for instance, will vibrate strongly at that string’s fundamental frequency, say 110 per second. But the string also vibrates at exact multiples of that fundamental: 220, 330, 440, 550, 660, 770, 880, etc. These extra vibrations are called the harmonic overtones of the fundamental frequency.
Harmonics, as the name indicates, are sounds whose wave forms are tightly linked – literally ‘nested’ together. In the example above, 220, 440, and 880 are all higher octaves of the fundamental note “A” (110). And the other harmonics – 330, 550, 660, and 770 – correspond to the notes E, Db, E, and G which, along with A, are the four notes of the A-major chord (A-Db-E-G-A). So when the note A is played on the violin (or piano, or any other instrument) what you actually hear is a chord. But because the harmonic linkage is so tight, and because the fundamental (110 in this case) is almost twice as loud as all of its overtones put together, we perceive the “A” as a single note, albeit a note with ‘character.’ This character – or timbre – is slightly different for each instrument, and that difference is what allows us to distinguish not only between types of instrument – clarinets from violins, for example – but also sometimes between individual instruments of the same type – a Stradivarius violin from a Guarnieri.
This kind of harmonic superimposure has no practical limits to speak of. As long as the sounds are harmonically linked, you can superimpose as many elements as you want. Imagine an orchestra, with all the instruments playing octaves of the same note. Add an organ, playing more octaves. Then a chorus of 200, singing still more octaves. We are superimposing hundreds and hundreds of individual instruments and voices, but it will all still sound unified. If everyone started playing and singing whatever they felt like, however, that unity would immediately turn into chaos.
To give an example of non-musical harmonic superimposure: in Apocalypse Now we wanted to create the sound of a field of crickets for one of the beginning scenes (Willard alone in his hotel room at night), but for story reasons we wanted the crickets to have a hallucinatory degree of precision and focus. So rather than going out and simply recording a field of crickets, we decided to build the sound up layer by layer out of individually recorded crickets. We brought a few of them into our basement studio, recorded them one by one on a multitrack machine, and then kept adding track after track, recombining these tracks and then recording even more until we had finally many thousands of chirps superimposed. The end result sounded unified – a field of crickets – even though it had been built up out of many individual recordings, because the basic unit (the cricket’s chirp) is so similar – each chirp sounds pretty much like the last. This was not music, but it would still qualify, in my mind, as an example of harmonic superimposure.
(Incidentally, you’ll be happy to know that the crickets escaped and lived happily behind the walls of this basement studio for the next few years, chirping at the most inappropriate moments.)
Dagwood and Blondie
What happens, though, when the superimposure is not harmonic?
Technically, of course, you can superimpose as much as you want: you can create huge “Dagwood sandwiches” of sound – a layer of dialogue, two layers of traffic, a layer of automobile horns, of seagulls, of crowd hubbub, of footsteps, waves hitting the beach, foghorns, outboard motors, distant thunder, fireworks, and on, and on. All playing together at the same time. (For the purposes of this discussion, let’s define a layer as a conceptually-unified series of sounds which run more or less continuously, without any large gaps between individual sounds. A single seagull cry, for instance, does not make a layer.)
The problem, of course, is that sooner or later (mostly sooner) this kind of intense layering winds up sounding like the rush of sound between radio stations – white noise – which is where we began our discussion. The trouble with white noise is that, like white light , there is not a lot of information to be extracted from it. Or rather there is so much information tangled together that it is impossible for the mind to separate it back out. It is as indigestible as one of Dagwood’s sandwiches. You still hear everything, technically speaking, but it is impossible to listen to it – to appreciate or even truly distinguish any single element. So the filmmakers would have done all that work, put all those sounds together, for nothing. They could have just tuned between radio stations and gotten the same result.
I have here a short section from Apocalypse Now which I hope will show you what I mean. You will be seeing the same minute of film six times over, but you will be hearing different things each pass: one separate layer of sound after another, which should give you an almost geological feel for the sound landscape of this film. This particular scene runs for a minute or so, from Kilgore’s helicopters landing on the beach to the explosion of the helicopter and Kilgore saying “I want my men out!” But it is part of a much longer action sequence.
Originally, back in 1978, we organized the sound this way because we didn’t have enough playback machines – we couldn’t run everything together: there were over a hundred and seventy-five separate soundtracks for this section of film alone. It was my very own Dagwood sandwich. So I had to break the sound down into smaller, more manageable groups, called premixes, of about 30 tracks each. But I still do the same thing today even though I may have eight times as many faders as I did back then.
The six Apocalypse Now premix Layer were:
- Music (The Valkyries)
- Small Arms Fire (AK47’s and M16s)
- Explosions (Mortars, Grenades, Heavy Artillery)
- Footsteps and other foley-type sounds
These layers are listed in order of importance, in somewhat the same way that you might arrange the instrumental groups in an orchestra. Mural painters do somewhat the same thing when they grid a wall into squares and just deal with one square at a time. What murals and mixing and music all have in common is that in each of them the detail has to be so exactly proportioned to the immense scale of the work that it is easy to go wrong – either the details will overwhelm the eye (or ear) but give no sense of the whole, or the whole will be complete but without convincing details.
The human voice must be understood clearly in almost all circumstances, whether it is singing in an opera or dialogue in a film, so the first thing I did was mix the dialogue for this scene, isolated from any competing elements. Clips Courtesy of and Copyright American Zoetrope, 1979
Then I asked myself: what is the next most dominant sound in the scene? In this case it happened to be the helicopters, so I mixed all the helicopter tracks together onto a separate roll of 35mm film, while listening to the playback of the dialogue, to make sure I didn’t do anything with the helicopters to obscure the dialogue.
Then I progressed to the third most dominant sound, which was the “Ride of the Valkyries” as played through the amplifiers of Kilgore’s helicopters. I mixed this to a third roll of film while monitoring the two previous premixes of helicopters and dialogue.
In the end, I had six premixes of film, each one a six-channel master (three channels behind the screen: left, center, and right; two channels in the back of the theater: left and right; and one channel for low frequency enhancement). Each premix was balanced against the others so that – theoretically, anyway — the final mix should simply have been a question of playing everything together at one set level.
What I found to my dismay, however, was that in the first rehearsal of the final mix everything seemed to collapse into that big ball of noise I was talking about earlier. Each of the sound-groups I had premixed was justified by what was happening on screen, but by some devilish alchemy they all melted into an unimpressive racket when they were played together.
The challenge seemed to be to somehow find a balance point where there were enough interesting sounds to add meaning and help tell the story, but not so many that they overwhelmed each other.
The question was: where was that balance point?
Suddenly I remembered my experience ten years earlier with Robot Footsteps, and my first encounter with the mysterious Law of Two-and-a-Half.
Final Mixdown from Apocalypse Now:
Note About the Apocalypse Premix Clips: The picture for these six clips is black and white, which is what we mixed to except when checking the final, when we had a color print. The sound on these clips is mono, but in the premixes it was full six-track split-surround with boom (what we call 5.1 today). Clips Courtesy of and Copyright by American Zoetrope 1979
Robots and Grapes
This had happened in 1969, on one of the first films I worked on: George Lucas’s THX-1138. It was a low-budget film, but it was also science fiction, so my job was to produce an otherworldly soundtrack on a shoestring. The shoestring part was easy, because that was the only way I had worked up till then. The otherworldly part, though, meant that most of the sounds that automatically ‘came with’ the image (the sync sound) had to be replaced. A case in point: the footsteps of the policemen in the film, who were supposed to be robots made out of six hundred pounds of steel and chrome. During filming, of course, these robots were actors in costume who made the normal sound that anyone would make when they walked. But in the film we wanted them to sound massive, so I built some special metal shoes, fitted with springs and iron plates, and went to the Museum of Natural History in San Francisco at 2am, put them on and recorded lots of separate ‘walk-bys’ in different sonic environments, stalking around like some kind of Frankenstein’s monster.
They sounded great, but I now had to sync all these footstep up. We would do this differently today – the footsteps would be recorded on what is called a Foley stage, in sync with the picture right from the beginning. But I was young and idealistic – I wanted it to sound right! – and besides we didn’t have the money to go to Los Angeles and rent a Foley stage.
So there I was with my overflowing basket of footsteps, laying them in the film one at a time, like doing embroidery or something. It was going well, but too slowly, and I was afraid I wouldn’t finish in time for the mix. Luckily, one morning at 2am a good fairy came to my rescue in the form of a sudden and accidental realization: that if there was one robot, his footsteps had to be in sync; if there were two robots, also, their footsteps had to be in sync; but if there were three robots, nothing had to be in sync. Or rather, any sync point was as good as any other!
Robots from THX 1138 Clip Courtesy of Warner Brothers
This discovery broke the logjam, and I was able to finish in time for the mix. But…
But why does something like this happen?
Somehow, it seems that our minds can keep track of one person’s footsteps, or even the footsteps of two people, but with three or more people our minds just give up – there are too many steps happening too quickly. As a result, each footstep is no longer evaluated individually, but rather the group of footsteps is evaluated as a single entity, like a musical chord. If the pace of the steps is roughly correct, and it seems as if they are on the right surface, this is apparently enough. In effect, the mind says “Yes, I see a group of people walking down a corridor and what I hear sounds like a group of people walking down a corridor.”
Sometime during the mid-19th century, one of Edouard Manet’s students was painting a bunch of grapes, diligently outlining every single one, and Manet suddenly knocked the brush out of her hand and shouted: “Not like that! I don’t give a damn about Every Single Grape! I want you to get the feel of the grapes, how they taste, their color, how the dust shapes them and softens them at the same time.”
Similarly, if you have gotten Every Single Footstep in sync but failed to capture the energy of the group, the space through which they are moving, the surface on which they are walking, and so on, you have made the same kind of mistake that Manet’s student was making. You have paid too much attention to something that the mind is incapable of assimilating anyway, even if it wanted to.
Note About the THX-1138 Clip: In the final mix, the music dominates, so it is hard to hear the multiple footstep effect except briefly at the beginning and the end. You will have to take my word that there was a full footsteps premix for this section, but it was in creating the sound for this scene, in 1970, that the idea of two-and-a-half layers first came to me. A clearer example of the multiple footstep effect can be found on the #6 Apocalypse Now premix.
Trees and Forests
At any rate, after my robot experience I became sensitive to the transformation that appears to happen as soon as you have three of anything. On a practical level, it saved me a lot of work – I found many places where I didn’t have to count the grapes, so to speak – but I began to see the same pattern occurring in other areas as well, and it had implications far beyond footsteps.
The clearest example of what I mean can be seen in the Chinese symbols for “tree” and “forest.” In Chinese, the word “tree” actually looks like a tree – sort of a pine tree with drooping limbs. And the Chinese word for “forest” is three trees. Now, it was obviously up to the Chinese how many trees were needed to convey the idea of “forest,” but two didn’t seem to be enough, I guess, and sixteen, say, was way too many – it would have taken too long to write and would have just messed up the page. But three trees seems to be just right. So in evolving their writing system, the ancient Chinese came across the same fact that I blundered into with my robot footsteps: that three is the borderline where you cross over from “individual things” to “group.”
It turns out Bach also had some things to say about this phenomenon in music, relative to the maximum number of melodic lines a listener can appreciate simultaneously, which he believed was three. And I think it is the reason that Barnum’s circuses have three rings, not five, or two. Even in religion you can detect its influence when you compare Zoroastrian Duality to the mysterious ‘multiple singularity’ of the Christian Trinity. And the counting systems of many primitive tribes (and some animals) end at three, beyond which more is simply “many.”
So what began to interest me from a creative point of view was the point where I could see the forest and the trees – where there was simultaneously Clarity, which comes through a feeling for the individual elements (the notes), and Density, which comes through a feeling for the whole (the chord). And I found this balance point to occur most often when there were not quite three layers of something. I came to nickname this my “Law of Two-and-a-half.”
Left and Right
Now, a practical result of our earlier distinction between Encoded sound and Embodied sound seems to be that this law of two-and-a-half applies only to sounds of the same ‘color’ – sounds from the same part of the conceptual spectrum. (With sounds from different parts of the spectrum – different colored sounds – there seems to be more leeway.)
The robot footsteps, for instance, were all the same ‘green,’ so by the time there were three layers, they had congealed into a new singularity: robots walking in a group. Similarly, it is just possible to follow two ‘violet’ conversations simultaneously, but not three. Listen again to the scene in The Godfather where the family is sitting around wondering what to do if the Godfather (Marlon Brando) dies. Sonny is talking to Tom, and Clemenza is talking to Tessio – you can follow both conversations and also pay attention to Michael making a phone call to Luca Brasi (Michael on the phone is the “half” of the two-and-a-half), but only because the scene was carefully written and performed and recorded. Or think about two pieces of ‘red’ music playing simultaneously: a background radio and a thematic score. It can be pulled off, but it has to be done carefully.
But if you blend sounds from different parts of the spectrum, you get some extra latitude. Dialogue and Music can live together quite happily. Add some Sound Effects, too, and everything still sounds transparent: two people talking, with an accompanying musical score, and some birds in the background, maybe some traffic. Very nice, even though we already have four layers.
Why is this? Well, it probably has something to do with the areas of the brain in which this information is processed. It appears that Encoded sound (language) is dealt with mostly on the left side of the brain, and Embodied sound (music) is taken care of across the hall, on the right. There are exceptions, of course: for instance, it appears that the rhythmic elements of music are dealt with on the left, and the vowels of speech on the right. But generally speaking, the two departments seem to be able to operate simultaneously without getting in each other’s way. What this means is that by dividing up the work they can deal with a total number of layers that would be impossible for either side individually.
Transom isn't possible without your input.
Help Transom get new work and voices to public radio by donating now.
Density and Clarity
In fact, it seems that the total number of layers, if the burden is evenly spread across the spectrum from Encoded to Embodied (from “violet” dialogue to “red” music) is double what it would be if the layers were stacked up in any one region (color) of the spectrum. In other words, you can manage five layers instead of two-and-a-half, thanks to the left-right duality of the human brain.
What this might mean, in practical terms, is:
- One layer of “violet” dialogue;
- One layer of “red” music;
- One layer of “cool” (linguistic) effects (eg: footsteps);
- One layer of “warm” (musical) effects (eg: atmospheric tonalities);
- One layer of “yellow” (equally balanced ‘centaur’) effects.
What I am suggesting is that, at any one moment (for practical purposes, let’s say that a ‘moment’ is any five-second section of film), five layers is the maximum that can be tolerated by an audience if you also want them to maintain a clear sense of the individual elements that are contributing to the mix. In other words, if you want the experience to be simultaneously Dense and Clear.
But the precondition for being able to sustain five layers is that the layers be spread evenly across the conceptual spectrum. If the sounds stack up in one region (one color), the limit shrinks to two-and-a-half. If you want to have two-and-a-half layers of dialogue, for instance, and you want people to understand every word, you had better eliminate the competition from any other sounds which might be running at the same time.
To highlight the differences in our perception of Encoded vs. Embodied sound, it is interesting to note the paradox that in almost all stereo films produced over the last twenty-five years the dialogue is always placed in the center no matter what the actual position of the actors on the screen: they could be on the far left, but their voices still come out of the center. And yet everyone (including us mixers) still believes the voices are ‘coming from’ the actors. This is a completely different treatment than is given sound effects of the “yellow” variety – car pass-bys, for instance – which are routinely (and almost obligatorily) moved around the screen with the action. And certainly different from “red” music, which is usually arranged so that it come out of all speakers in the theater (including the surrounds) simultaneously. Embodied “orange” sound effects (atmospheres, room tones) are also given a full stereo treatment. “Blue-green” sound effects like footsteps, however, are usually placed in the center like dialogue, unless the filmmakers want to call special attention to the steps, and then they will be placed and moved along with the action. But in this case the actors almost always have no dialogue.
As a general rule, then, the “warmer” the sound, the more it tends to be given a full stereo (multitrack) treatment, whereas the “cooler” the sound, the more it tends to be monophonically placed in the center. And yet we seem to have no problem with this incongruity – just the opposite, in fact. The early experiments (in the 1950’s) which involved moving the dialogue around the screen were eventually abandoned as seeming “artificial.”
Monophonic films have always been this way – that part is not new. What is new and peculiar, though, is that we are able to tolerate – even enjoy – the mixture of mono and stereo in the same film.
Why is this? I believe it has something to do with the way we decode language, and that when our brains are busy with Encoded sound, we willingly abandon any question of its origin to the visual, allowing the image to “steer” the source of the sound. When the sound is Embodied, however, and little linguistic decoding is going on, the location of the sound in space becomes increasingly important the less linguistic it is. In the terms of this lecture, the “warmer” it is. The fact that we can process both Encoded mono and Embodied stereo simultaneously seems to clearly demonstrate some of the differences in the way our two hemispheres operate.
Getting back to my problem on Apocalypse: it appeared to be caused by having six layers of sound, and six layers is essentially the same as sixteen, or sixty: I had passed a threshold beyond which the sounds congeal into a new singularity – dense noise in which a fragment or two can perhaps be distinguished, but not the developmental lines of the layers themselves. With six layers, I had achieved Density, but at the expense of Clarity.
What I did as a result was to restrict the layers for that section of film to a maximum of five. By luck or by design, I could do this because my sounds were spread evenly across the conceptual spectrum.
- Dialogue (violet)
- Small arms fire (blue-green ‘words’ which say “Shot! Shot! Shot!”)
- Explosions (yellow “kettle drums” with content)
- Footsteps and miscellaneous (blue to orange)
- Helicopters (orange music-like drones)
- Valkyries Music (red)
Final Mixdown from Apocalypse Now Clips Courtesy of and Copyright by American Zoetrope 1979
Note About the Apocalypse Final Clip: The soundtrack on this clip is stereo, but the final mix of the film is in 5.1, a format which we created specifically for Apocalypse Now in 1977-9 and which is currently the industry standard.
If the layers had not been as evenly spread out, the limit would be less than five. And as I mentioned before, if they had all been concentrated in one “color zone” of the spectrum, (all violet or all red, for instance) the limit would shrink to two-and-a-half. It seems, then, that the more monochrome the palette, the fewer the layers that can be super¬imposed; the more polychrome the palette, on the other hand, the more layers you get to play with.
So in this section of Apocalypse, I found I could build a “sandwich” with five layers to it. If I wanted to add something new, I had to take something else away. For instance, when the boy in the helicopter says “I’m not going, I’m not going!” I chose to remove all the music. On a certain logical level, that is not reasonable, because he is actually in the helicopter that is producing the music, so it should be louder there than anywhere else. But for story reasons we needed to hear his dialogue, of course, and I also wanted to emphasize the chaos outside — the AK47’s and mortar fire that he was resisting going into – and the helicopter sound that represented “safety,” as well as the voices of the other members of his unit. So for that brief section, here are the layers:
- Dialogue (“I’m not going! I’m not going!”)
- Other voices, shouts, etc.
- AK-47’s and M-16s
- Mortar fire.
Under the circumstances, music was the sacrificial victim. The miraculous thing is that you do not hear it go away – you believe that it is still playing even though, as I mentioned earlier, it should be louder here than anywhere else. And, in fact, as soon as this line of dialogue was over, we brought the music back in and sacrified something else. Every moment in this section is similarly fluid, a kind of shell game where layers are disappearing and reappearing according to the dramatic focus of the moment. It is necessitated by the ‘five-layer’ law, but it is also one of the things that makes the soundtrack exciting to listen to.
But I should emphasize that this does not mean I always had five layers cooking. Conceptual density is something that should obey the same rules as loudness dynamics. Your mix, moment by moment, should be as dense (or as loud) as the story and events warrant. A monotonously dense soundtrack is just as wearing as a monotonously loud film. Just as a symphony would be unendurable if all the instruments played together all the time. But my point is that, under the most favorable of circumstances, five layers is a threshold which should not be surpassed thoughtlessly, just as you should not thoughtlessly surpass loudness thresholds. Both thresholds seem to have some basis in our neurobiology.
The bottom line is that the audience is primarily involved in following the story: despite everything I have said, the right thing to do is ultimately whatever serves the storytelling, in the widest sense. When this helicopter landing scene is over, however, my hope was the lasting impression of everything happening at once – Density – yet everything heard distinctly – Clarity. In fact, as you can see, simultaneous Density and Clarity can only be achieved by a kind of subterfuge.
As I said at the beginning, it can be complicated to be simple and simple to be complicated.
But sometimes it is just complicated to be complicated.
The Transom Online Workshop, with support from the Knight Prototype Fund, helped update this article.