Late for the Future

I guess I first noticed it a couple of years ago while I was watching an HD documentary on my local PBS channel. “Welcome to the future!” said the breathless hi-res promo just before the show. But then a talking head appeared, and it seemed that the future was going to be, well, a little delayed. I was seeing a very odd thing on my old 27-inch Panasonic CRT TV set: The person who was supposed to be an expert on whatever the program was about looked more like an actor playing a scientist in a badly dubbed, post-World War II, Japanese atomic-monster movie. The movements of his mouth and the sound I was hearing seemed to have very little to do with each other.

Shutterstock Image: Byron W. Moore

After a few minutes I figured it out: The program was out of sync, and the picture was later than the sound by something like half a second. A few minutes after that, something changed and the video and audio locked up.

In the months to come, I saw this happen a lot. Sometimes when I watched the news on a local commercial station that had switched to HD, the lip-sync would seem to shift whenever there was a remote pickup and some reporters were more out of sync than others. Sometimes even the anchors were out of sync for a little while, and then somehow it would get corrected.

When television programs relied on videotape and simple studio-to-transmitter coaxial or microwave links, the chances of the sound and picture getting away from each other were essentially nil. When the world of video was analog, if sound and picture left the plant in sync, then you could be sure that they would show up on the viewer’s TV set that way. But as distribution systems became more numerous and elaborate, and digital technology entered the delivery path, opportunities were created for all sorts of gremlins to creep in. One memorable event was described by veteran TV mixer Ed Greene in his Hall of Fame acceptance speech at the recent TEC Awards banquet.

“I was watching an awards show — which I wasn’t doing — that George Lucas, of all people, was speaking at, and he was seconds out of sync. The program was seconds out of sync for 20 minutes. So I called the mixer the next day, and I said, ‘What happened here?’ And he said, ‘Well, when they took the program in back in New York, the primary was sent on fiber and the backup was sent on satellite. And they took picture from one and audio from the other.’ [Cue giggles from the audience.] And then about 10 minutes into the program, they figured out it was wrong, and they switched — both of them. [Screams from the audience.] So I said, ‘How come at the end of it they didn’t fix this for the West Coast [broadcast later]?’ And he said, ‘Well, it was a Sunday, and that would have meant bringing somebody in on overtime.’” Loud groans. (You can watch the video at www.mixonline.com.)

But now that digital television signals are going right to the home, the chances of things screwing up have grown dramatically. And it’s not just a problem with careless engineering. According to a lot of people whose business it is to make audio and video stay together, these problems are built into the system, and they’re not going to go away soon.

In fact, this was the subject of a fascinating, if not particularly well-attended, session at the AES conference this pas fast fall entitled “Audio for HDTV: The Lip Sync Issue.” This seminar featured three presenters, none of whom were particularly optimistic.

Randy Conrod, digital products manager at Harris Corporation, discussed the nature of the problem. When it comes to viewer awareness of sync problems, he said, there are two timing thresholds: detectability, the point at which if the viewer tries to look for problems he will see them; and noticeability, the point at which the viewer notices them without trying. Both thresholds are much shorter if the audio leads the video, which is no surprise if you think about it for a moment. Sound arriving later than vision is part of the natural world, he explained, so when the sound is ahead, “Our brains find the experience particularly rattling.” He also pointed out that once a sync problem is noticed, the viewer is more likely to continue to be aware of it if the sound leads than if the picture leads.

Another member of the panel, Andrew Mason, who is an R&D engineer with the BBC, talked about some preliminary research he has done that suggests the problem is worse with HDTV — the acceptable delay window in either direction seems to be smaller with higher-resolution broadcasts.

Conrod pointed to an additional problem that is caused by delays, over and above their being annoying: When the audio is ahead, even if the timing differential drops below the detectability threshold, speech intelligibility and comprehension go way down. Maybe sportscasters don’t care much whether their mouth movements match their words, but you can bet that the last thing an advertiser wants to hear is that because of some esoteric sync discrepancies, no one in the audience can remember their message.

It’s not hard to understand why — now that the whole broadcast chain is digital — this has become a big problem. Digital audio processing is fast: Typical processing delays in most plug-ins are a few dozen samples or fewer, and even a look-ahead limiter will only delay the signal 1,000 or so samples, which at the lowliest 44.1kHz sampling rate is still way less than a single video frame. But video processing is not fast, and the nature of the way it is clocked means that delays tend to be in multiples of fields (i.e., half-frames) or frames, not samples. Even a single CCD camera might have a delay on its video output, and if you follow it with character generators, switchers, routers, digital video effects, frame synchronizers, and especially compressors and decompressors, then you’ve got a signal chain that can put a significant chunk of time between when something happens in front of the camera and when it shows up onscreen. And the delay is not consistent, as I had noticed: In a news program, adding a picture over the anchor’s shoulder — because the two frames have to be processed and synchronized — will add at least a one-frame delay to the picture. If that picture is of a reporter doing a live remote, then when the reporter goes to full screen, the video delay will jump back.

So the program originator, whether it’s a network or a local station, has to be able to insert delays into the audio to keep pace with the video. Some video equipment will provide proper audio compensation automatically, but often it has to be done by hand and that can be tricky. One engineer I know at a network affiliate says, “It’s hard to chase it down because it’s going through so many different paths, and it’s hard to see where the delay is happening.” Some stations are using slewing delays on the audio to make these quick changes in sync, but they have had reports from listeners that the slewing is quite audible — and sounds pretty weird.

They have to be careful with those delays, too. “Sometimes, people patch around equipment they’re having trouble with,” my network-affiliate engineer friend says, “and don’t realize that the delays change.” His station just moved into new quarters and this issue reared its ugly head. “Before we moved, we had a delay line on the audio from Studio A to the equipment room because the audio was analog, but the video was being digitized and reconverted, and that caused a delay. But when we moved, we needed the analog equipment to do something else in the new place, so we patched around it. Everything went out of sync, but it took days before anyone noticed, and there really wasn’t anything we could do about it. A lot of stations are run these days without engineers for much of the day, and if there are lip-sync problems, they literally can’t do anything.”

Compression introduces its own dangers. Kenneth Hunold, a broadcast applications engineer with Dolby Labs who was the third member of the panel, pointed out that decoders and encoders that work with MPEG-2, the most common professional digital video codec, use audio and video “packets” that are intentionally sent out of sync, with delays of many frames. In fact, video frames are often sent in the wrong order, and time stamps and buffers are supposed to make everything right before the signals are output. Although MPEG-2 itself is not the problem, said Hunold, the fact that the delays on the encoders and decoders can be set independently of each other — with one end not knowing what the other end is doing — can lead to big problems.

Both Conrod and Hunold prescribed a number of measures that originators and broadcasters can take to keep these problems at bay. Good old-fashioned clap boards, 2-pops and beep-and-flash frames might seem anachronistic in the digital age, but, in fact, they can be tremendously helpful in determining proper sync in a complex signal chain. Testing, calibrating and documenting video-processing equipment are important so that the broadcaster knows what each stage does to the signal and can compensate intelligently.

But no matter how hard broadcasters may work to ensure the integrity of their programs, once a DTV signal leaves their hands, they are helpless against the biggest threat of all: the “STB” — the set-top box in the viewer’s home. Why is this the case? Because when the Federal Communications Commission adopted the new DTV “standard,” it gave the industry no fewer than 18 different formats to play with. (In a regulation-averse political climate, the FCC somehow reasoned that specifying a single format — as it did with AM radio, FM stereo, color TV and stereo analog TV, and we all know what failures those media have been — would be “anti-business.”)

If a TV manufacturer wants to have a ghost of a chance of succeeding in the market, that company has to make sure that all of its new sets can handle all of the different flavors. A flat-panel display with a fixed number of pixels can have only a small selection of closely related native formats, which will, by necessity, be only a subset of the possible formats it will be asked to show. That means there will be lots of circuitry to deal with varying frame rates, line counts, aspect ratios and scanning formats (interlaced vs. progressive). Well, guess what? All that conversion takes time! How much time? It depends, but in most cases where the signal has to be de-interlaced or scaled, you can count on it being at least 200 ms.

Now, if you are getting audio right from your television set, then it should be smart enough to delay the audio automatically to match the conversion delay that it’s imposing on the video. But a whole lot of people who want to enjoy DTV — especially HDTV — have separate audio systems (with their DVD and TiVo players connected to them), and those audio systems, which are fed right from the cable box or DTV receiver, don’t know nothin’ ’bout no delays. So here we are again: audio leading video by a very noticeable, and very annoying, amount.

So now what do we do? Well, if you’re very conscientious, then you can read all the documentation that comes with your TV set and surround sound system, and try to figure out what kind of delay is being introduced on the video side and how to set your Dolby or DTS or whatever decoder to match it. But not everyone receiving DTV is going to want to deal with this level of complexity (there are still people, after all, who think the rear speakers of their 5.1 system belong in the spare bedroom), so it’s not a particularly practical solution.

Is there such a thing as a practical solution? According to the panelists at the AES session, the answer is no, not really. Hunold suggested that information about status and delay times of different devices in the home could be shared among them. But this data would not come from the broadcaster because the broadcaster has no idea of what the viewer’s system is; instead, it would have to be generated locally and distributed through some kind of home network, perhaps using FireWire or Bluetooth or even infrared.

Until that happens, and until every manufacturer of home audio and video equipment gets onboard with it — which means, basically, never — you’re on your own. Feel free to blame the government for being too chicken to keep a tight enough handle on DTV so that you can watch the news without getting nauseous. As Greene said in his speech, with tongue only slightly in cheek, “If a program gets through in good shape, it’s almost a mistake.” If it doesn’t, you can always pretend you’re watching a Godzilla flick.

Paul D. Lehrman sometimes has trouble catching up with himself. You can catch up with his past 10 years of scribblings in The Insider Audio Bathroom Reader, available from
mixbooks.com
and the usual suspects.