Setting the Bar

For those of us who tend to seek the cutting edge, I am about to pull a Bill Gates. Yes, the richest man in the world once looked into the future and said that “640K should be enough for everyone.” Now, my statement won’t be quite as simple as one sentence. Nor will I pretend to know where this is all headed. But we are audio junkies, so I’ll try.

In this case, the bar I speak of in the title concerns audio capability in a game engine on PC and current-gen consoles. With regret, I’m leaving out handhelds and mobile devices, but they’ll soon follow. Keep in mind this bar is net, not gross, when it comes to allocation of system resources in both hardware and software.

To break it all down, we’re going to take a look at music playback, DSP resources and, finally, the tools that will take us up to the bar.

Let’s start with taking a hit for the team in the interests of getting more valuable audio capability later. Dynamic range and bit rate will be that hit, though not much of one. If you have 16-bit/44k sound across the board, with or without compression, and the ability to play up to 64 sounds of any size, that’s all you need to start. Why create an entire crowd? An entire flock of birds with individual SFX? Madness! That isn’t what games are about, or films for that matter. Rod Abernethy and Jason Graves, partners at Rednote Audio, recently scored Star Trek: Legacy. What do they say to 16/44 across the board with up to 10 stereo music streams at once?

Rod and Jason: Next-gen could mean the death of the dreaded 60-second loop. Imagine music that never repeats itself and changes dynamically with the game, seamlessly moving from one mood to another. We’re not talking about the typical “fade-in, fade-out” transitions, either. This would be one continuous piece of music that immerses the player even further into the game experience.

Ten music streams would provide the means to create true adaptive music — if the developer wants to take the time, energy and budget to implement it.

Most music for entertainment is still composed in a linear, beginning-to-end fashion. With the next-gen consoles, composers can write in true vertical fashion, instead of the traditional linear loop. Interrelated stems and variants of themes would be continuously triggered to a specific action in the game, all independent of each other. This means more planning and thought into the architecture of the music, but that’s why composing music for games is so much fun. It’s true multichannel, adaptive, interactive music.

Now, on to manipulation. You must be able to route the 64 channels to any channel at any time, from stereo to 11.1 surround. If an individual sound needs to be routed to a particular channel in real time in the engine, this should be done using the game’s editor. You should be able to assign reverb and other plug-ins (yes, TDM, VST, DirectX, you name it) to any channel without slowdown and without sacrificing resources to another process.

It doesn’t seem like a lot to ask. So I talked to Brian Schmidt, who runs the Xbox audio group.

Brian Schmidt: First, I have to take some issue with your “bar” of “64 16-bit/44kHz sounds.” Sixty-four?! That’s so five years ago. A typical next-gen console has more processing power in its multiple CPUs available for audio than a high-end Pro Tools system did just a couple of years ago. Windows PCs are increasingly multicore systems. These extra CPU cores essentially double — or even quadruple — the amount of CPU power available.

To be fair, on the PC, 64 can still be a nice number to shoot for. However, for consoles, 64 just barely gets you started. Many of the current-gen console games use literally hundreds of concurrent sounds. Looking for a helicopter sound? It’s not nearly enough to simply get a halfway-decent loop of a helicopter sound. You need to use separate waves for the rotors, engine, air, pistons and so on. Similarly, a good crowd sound will have multiple layers; even though something might be considered a single “sound,” under the hood, a good videogame sound may have several (or in some cases, dozens) of concurrent waves playing at any one time.

But you asked me about “manipulation.” We specifically designed the Xbox 360 audio system to facilitate the use of DSP algorithms as an integral part of game audio. DSP usage in games falls into a few basic categories: environmental DSP, effect DSP, inherent DSP and mixing/mastering DSP.

Environmental DSP is used to take an existing sound and manipulate it in some sort of context-dependent way. The purpose is to place the sound within the specific environment where the action is taking place. The prototypical example of this is adding some reverb to a sound when the action is taking place in a large, presumably reverberant environment or a more enclosed space. Another example of environmental DSP would be a lowpass filter to mimic the effect of a sound coming from behind a closed door or wall.

Effect DSP is used when an existing sound needs to be processed to achieve some specific, desired effect. For example, a character’s dialog may end up as radio communication in the game, so the dialog may need to be processed with DSP to emulate the effect of radio transmission. Performing DSP in real time during the game allows the same dialog to be used for both the “radioized” case and normal, non-radio dialog.

Inherent DSP starts to get interesting, as it uses DSP as part of the sound itself. Now some folks no doubt would ask, “If I know I want to put some DSP on a sound, why wouldn’t I just do that ahead of time and save the resultant sound as its own WAV file?” For certain effects, that may be fine. However, for the majority of sounds in a videogame, that kind of DSP is going to depend on the playback. The easy example is a gunshot. If the character can shoot his/her gun in any of a number of different rooms, the gunshot needs to have a number of different reverbs. It’s impossible to store multiple versions of every sound (dry, and with each a number of different reverbs). It’s far more efficient and flexible to just store the sound “dry” and run the reverb in real time during game-play.

Finally, we’re just now starting to see mixing/mastering DSP. These are the DSP to which most linear media people are accustomed — per-sound EQ, dynamic range control, final-stage compression/limiting. Mixing/mastering DSP is probably more important for games than for traditional media due to the unpredictable nature of game audio — at any given moment, it’s virtually impossible to predict exactly what the sonic landscape will be. Having control over the mix therefore becomes more important than ever.

Finally, we need tools, a tool set that allows us to set mix groups; adjust volume and plug-in settings; set markers within files; hook files to animations, zones and events; and organize the files as objects that can exist as single or multiple files with all the settings (including good randomization). Want a planar or mesh emitter that can have sounds emanating from an object rather than a point? Go for it. Using beat detection and variable/selectable crossfades combined with event mapping will create just about any kind of adaptive soundtrack you’d want.

Dan Forden and Marty O’Donnell handle some awfully big franchises — one doesn’t sniff at Halo 3 or the upcoming Stranglehold. What tool sets do they use and how do they match up against “the bar”?

Dan Forden: We use the Unreal 3 Engine at Midway and are therefore able to work directly in the game, just as the artists do. This enables us to tightly integrate the audio with the other game-play elements. We can attach sound notifications to any frame of an animation. We use the art in the game to tell us what type of material it is so we know what to play when someone walks on it or blows it up. We can also create submixes to ease the task of mixing. However, one of the best things about this environment is the ease of iteration. Working in the editor is essentially working in the game. Any parameter change to any loaded sound asset is heard immediately. Any game designer will tell you that short iteration time is one of the most critical factors when trying to create a successful game, and the ability to preview changes quickly is equally important to making great game audio. We have also created some home-brew functionality to allow us to organize and iterate on large quantities of character dialog and “non-verbal” voice-over — i.e., pain, effort, death, attacks, etc.

Moving ahead, as we approach the second round of games for this generation of hardware, we need to develop high-level scripting tools to more effectively manage the emotional context of game-play. That is, the ability to take stock of the game’s state at a given time and modify the audio accordingly, whether it’s changing the music, changing the mix or applying effects. Having control over these elements and being able to modify them in response to game-play mechanisms is what will create truly interactive, immersive and next-generation game audio.

Well, there you have it: 16/44, 128 simultaneous sounds and a tool set that allows audio editors into the objects and animations. Not a lot to ask, right? And don’t even get me started about online, which has become a major component of game-play with next-gen. A great deal changes when your design is dependent on actions of multiplayer games, but at this point, your bandwidth is a factor based on streaming capability and that affects audio.

There. Boom. The bar has been set. Now, has anybody bellied up?

Alexander Brandon is the audio director for Midway Home Entertainment in San Diego, Calif.