SOUNDS IN SPACE, SPACE FOR SOUNDS: DIALOG MEETS MATH AT M.I.T.

One of the beauties of the academic world is that research in one area often has cross-disciplinary benefits in another. Mechanical engineering breakthroughs might end up benefiting NASA, genetic research can end up being a boon to farmers. Now, sound-for-picture pros may soon reap exciting benefits from research being done into making impractical locations easily viable for filmmakers, thanks to some bold work by the folks at the Massachusetts Institute of Technology’s Media Lab and others on the digital frontier.

“Integration of Observations” is the term used at M.I.T. to describe merging a group of still-photographic views of a scene into a three-dimensional digital model of the actual scene, allowing complete freedom of digital manipulation for picture and sound.

Imagine a background matte in which the plate could be viewed from any angle or perspective imaginable and into which the camera could enter and perform moves such as a 360-degree spin. This technique could eventually allow films to be set in locations that are impractical, too dangerous or too small to accommodate a film crew.

Dr. V. Michael Bove Jr., principal research scientist and head of the object-based imaging group at the M.I.T. Media Lab in Cambridge, Mass., along with his students, made a 3-D model of a house from still photographs and integrated actors into the scenes in NTSC video resolution. Bove and his students then did some interesting things with the audio as well.

“We sent a photographer to a house none of us had ever visited and asked him to photograph the entire interior, making sure each photograph overlapped to some degree. Using computers programmed to understand perspective vanishing points, we built a 3-D model of the rooms in that house and shot actors on a three-sided blue screen. We then were able, in post, to synthesize any viewpoint in the house and drop the actors into that view,” Bove explains.

Bove and his researchers also recorded audio and processed the tracks in a similar manner. “The audio sources are assigned to correspond to objects in the video model so that when we render the model to produce an image, we also render the sound in a corresponding way. As a result, the sound level is calculated to correspond to the spatial location of the object as it appears in the video.”

Bove thinks this linkage between the visual and audio will eventually result in much more productive post-production sessions. “This approach will combine some aspects of post with production,” he says. “The idea is not to put anyone out of a job but to allow post-production people to be more productive” by removing some of their more mundane tasks and, more importantly, by giving them more creative tools and freedom.

Bove’s group has also been doing exciting work digitally separating different audio sources that have been recorded to the same track. This research might eventually result in eliminating the need for actors to be followed around by boom microphones to obtain clean dialog tracks, and it could also eliminate the need for looping necessitated by unwanted noise on the tracks.

“Imagine a film or video set with several stationary microphones placed around the space,” Bove says. “Say there are two actors speaking at the same time. Each mic will pick up a different mix of the two voices; one mic will have more of one voice, and another will have more of the other voice, depending on the mic’s position relative to the sound sources.

“The key to separating these audio sources is knowing the percentage of each sound that went into each microphone in terms of intensity,” he continues. “To learn those proportions you need a search strategy that will figure out those weightings. We have developed equations that tell you the percentages of each sound source picked up by each mic.”

It is all in the math, of course, Bove says: “If mic 1 gets 70 percent of Source A and 30 percent of Source B, that’s a linear system of equations, and those equations are written as a matrix. You can compute the inverse of a matrix and multiply the output of the various mics by that inverse and get the individual sound sources back again, in effect unmixed, assuming you have as many microphones as sound sources.”

However scientifically straightforward that sounds, the real world, as usual, presents some pesky problems.

“If the space in which you’re recording has reverberations, as most do to some degree, then you also have those acoustical effects imposed on the sound sources,” Bove explains. “You have to first undo those effects in order to unmix the various sound sources. Our research is yielding useful results in allowing us to do the unmixing of the overlapping sounds and to undo the acoustical effects of the room. However, at this point we do not have enough separation to make this approach actually useful for recording. We’re not at the point of being able to get rid of boom mics that follow actors around, but that is the direction in which we’re heading.”

This unmixing requires a huge amount of computational power, however. “You’re not going to be doing this on a PC anytime soon,” he acknowledges. “We’re using several 500MHz UNIX Alpha workstations, and even with all that power we’re not doing it in anything like real time.”

Another benefit of this approach to post-production would be the elimination of unwanted sounds from tracks. “Say you’re shooting a 19th-century historical drama and a jet airplane goes overhead during an otherwise wonderful take,” Bove says. “Rather than scrapping the take or bringing the actors into the studio at a later date to loop the dialog, you would be able to move the aircraft sound to another audio channel and have a perfectly clean dialog track.”

This all points to a future where “a lot of the decisions that are being made in visual and audio production will be postponed to post-production. You will be able to change things in post, such as moving the camera three feet to the left, or using a different focal length. And analogous things will be done in audio.

“Basically what will happen is that people who work with real sound and real moving images will enjoy the same degree of freedom that those who work with synthetic audio and video have had for a while,” Bove concludes. “In the virtual world you can manipulate the images and sound in myriad ways and it all still works. But in the real world, you either got it on the set or you didn’t. Increasingly, those working with real sound and pictures will have the same flexibility as those dealing with virtual material.”