The Importance of Being Earnest

Shutterstock Image: Yannis Ntousiopoulos

People are beginning to take games seriously, but not as seriously as they regard film in terms of dramatic presentation. This is in large part because of one of the weakest links in game sound, as opposed to film: the way voice acting and screen action work (or don’t work) together. Look at a game and you’ll instantly know it’s a game, not a movie. For one thing, game characters don’t look like onscreen live actors. Motion is robotic or unrealistic, lighting sometimes cuts off, or a texture might get jittery. It just isn’t the real that we get on film. So, it doesn’t help that game characters’ voices don’t seem real either. This incongruity exists largely because actors performing game voiceovers rarely record or do ADR (automated dialog replacement) to picture. Also, it’s difficult to pull off a successful performance of the, possibly, thousands of lines that an actor must read, particularly in the case of role-playing games. How hard is it to do one line well without seeing the picture? Now multiply that by several thousand…But the goal of capturing game voices that lend realtiy, rather than unreality, to a game soundtrack is not unattainable. This column explores the voice-recording techniques that are currently in use and ways to use them to reach that goal.

A bit of history first. Back in the ’90s, when games first started to incorporate recorded voices rather than having the player rely on reading text, the voice-over technique was way ahead of the visuals. Facial motions were based on static painted pics or extremely simplified, generic open-and-close-mouthed animations on 3-D characters, even before the concept of cameras was introduced. Now thanks to motion capture, we have realistically rendered models of characters that are tough to distinguish from the real thing, even when they are walking — that is, until they talk.

For you film folks who are reading this, there are two different schools of thought for speech in games. There are cut-scenes and modes, in which the player has no control over how the story is communicated, and there is gameplay, in which the player can be close or far from a conversation and have its position change in real time. Increasingly, the latter is taking precedent, where real-time camera shots become critical to communicate the story without interrupting the player’s control — which, of course, is impossible to predict.

A recent example of a character that looks very realistic statically is John Shepard from the successful role-playing game Mass Effect by BioWare. Shepard is so well-rendered and motion-captured that when his character is idling (a term used for standing still but moving slightly or fidgeting while shifting weight), you might just mistake him for the real thing for a few seconds. Eyelids, eyes and head tracking (which refers to how a head moves independently of a body to observe surroundings) are all extremely lifelike, not unlike the character Aki Ross from the film Final Fantasy: The Spirits Within. However, once Ross opens his mouth, the realism ends. His lip-sync quality is mostly very unconvincing. Sometimes his mouth doesn’t even move while he is speaking, or he appears to be pronouncing a consonant when he should be pronouncing a vowel.

It is one thing to create a photorealistic model with recorded live motion; it is quite another to successfully implement a system that gets input data from WAV files in real time and generate phonemes, which represent the simplest elements of human speech. In the case of lip synching, phoneme refers to the shape that a mouth makes when it forms vowels, consonants and short phrases. The process is incredibly complex, and a systematic approach has been mediocre at best.

For example, say the letter “a” to yourself and pronounce it “ah.” Your lower jaw drops a bit and your upper-jaw raises. This is a simple characteristic, but in phoneme lip-sync systems, pronouncing “ah” may not be distinguishable from “aw,” which presents a rounder mouth shape. In addition, going between phonemes, and especially indicating emotion in words, presents further challenges. For example, saying “s” by itself usually draws the lips back, especially when the character is being emphatic. But when saying the word “waves,” the “s” forces the mouth into a more closed position if the character is more relaxed. Films such as Pirates of the Caribbean: Dead Man’s Chest and At World’s End use motion-captured facial animations at the time voice-over is recorded so that the animations will sync exactly with the voice-over files. As a result, we’re convinced that the character Davy Jones (played by Bill Nighy) puts on a great performance, even though most of his head is animated CGI.

Many films (such as Pirates of the Caribbean, pictured) use motion-captured facial animations during voice-over.

The lesson for us game snobs is pretty simple: Either we shape up by providing more convincing systems for real-time phoneme generation using markers or indicators for emphasis and better mid-word/sentence phoneme translation, or we capture lip movement during recordings. In my view, when it comes to an audience that is increasingly demanding more realistic gameplay, there is no in-between, even though there are technical limitations such as memory and CPU resources. You can’t use partial generic animations for facial expressions, nor can you partially use custom animations; that will only create even more disappointment for the player. Don’t set a bar and then go below it. There is a third option, which is incredibly expensive, and that is hand-animating all lip movement. But in games that have 30,000 lines of dialog or more, such methods are out of the question.

Various studios, such as Technicolor Interactive, provide services in their ADR suites for capturing lip movement. The company’s director of audio services, Tom Hays, mentions that the studio runs a Canon GL-2 into a capture box to deliver video takes that are edited alongside dialog takes, and the process also involves a custom teleprompter.

A few middleware options offer the systems-based approach: FaceFX (www.oc3ent.com/facefx.html) is used on Mass Effect and a number of other titles. It provides moderate (but by no means perfect) lip synchronization by analyzing audio files and extracting phonemes that are then translated onto reference points on a character’s face model.

Take a look at the Image Metrics’ Website (www.image-metrics.com) to see a demo of its technology. You’ll see that the results are pretty impressive, yet still not quite there. (Don’t look at the live actor; just look at the larger, 3-D-rendered face.) Image Metrics uses either straight-on video or a head-mounted camera, and analyzes video data rather than audio data.

Famous3D’s proFACE (http://famous3d.com/3d/index.html) is a more cost-effective solution that also uses either video or motion capture to achieve results. Also, Sega plans to roll out a new facial animation tool called the Magical V Engine, claiming that it is very realistic.

Visage (www.visagetechnologies.com) features an SDK (software development kit) that allows games to incorporate lip-sync animation generated automatically from audio files. A tool that interprets the audio data spits out new data that is applied to “morph targets,” which are a set of vertices controlled individually. (The corner of a mouth could be considered a morph target, the upper lip could be another, and so on; there are dozens that could be applied to the lips alone.) The morph targets then react according to the movement interpreted by the audio interpretation tool.

So to all who are involved in game industry voice-over: Fight for your right to have the lines spoken properly, and consider that the industry is genuinely on the cusp of realistic performances. This is probably why Steven Spielberg has struck a three-game deal with Electronic Arts.

Alex Brandon is audio director at Obsidian Entertainment.