How many Elvis clones does it take to change a lightbulb? None. We have machines for that now. Remember when drummers were the butt of that joke? It was after the first programmable drum machines came out. But they weren’t alone for long. Soon, bass players, horn players string ensembles and even keyboard players were being rendered obsolete by advances in synthesis, sampling, control, DSP and AI technology. Studio musicians adjusted and learned to use the new tools, or else they stuck to playing weddings or got out of the business entirely. But a few groups of traditional music-makers managed to keep themselves working in their old ways, hovering a safe distance above where inanimate surrogates could threaten them. Among them were virtuoso horn and string players and, of course, singers.
This is not to say that singers have been totally exempt from electronic competition. I used my first choral sample on a session in 1977. It came from an optical disk-reading device called a Vako Orchestron (if you want to hear what it sounded like, check out www.hollowsun.com/vintage/orchestron), and it was pretty startling in its realism, at least if I made sure to keep it in the back of the mix. A few years ago, a cappella group Take 6 did a marvelous disc of scat-syllable samples for Kurzweil, and when these samples are used carefully, they’re good enough to be right up front. But real singers singing real words have avoided being replaced by electronics. So far.
Now, the end of vocalists’ immunity may be in sight. Thanks to developments in several areas of audio technology, good singers may find themselves supplanted by not-so-good ones, and all of them by some very clever software. Interestingly, none of this stuff emerged from the think tanks and labs that usually supply us with wondrous new toys, like MIT, Stanford and IRCAM. It’s resulted from collaborations between companies and institutions in England, Japan, Denmark, Western Canada and Spain. And a lot of it came about not from the demands of high-end studios and post houses, but from that bastard stepchild of the pro audio industry — karaoke. And it’s all pretty cool, if not downright scary.
One of the consortia at the forefront of this is TC-Helicon, a relatively new company combining the resources of the Danish TC Electronic and Victoria, B.C.’s, own IVL Technologies. The Danish partner started corporate life in 1976 with guitar stomp boxes, eventually working its way up the audio food chain to its current high-end lines, which include the System 6000 reverb, Finalizer 96k mastering processor and the PowerCore PC-hosted audio system. The Canadian company dates from the early 1980s. From its beginning, IVL Technologies led the field in pitch-recognition technology, providing the guts for products like the Ovation/Takamine MIDI guitar systems. Though the company has produced a few products under its own name, most notably the IVL Pitchrider, most of the output has found its way into other companies’ products, like Digitech (whose harmony generators are based on IVL’s technology), Mackie, Korg, dbx, TC Electronic and even Yamaha. And they’re big in Japan: Many high-end karaoke rooms, in which the owners have invested upward of $15,000 into their sound systems, use IVL-based voice processing to make the paying customers sound far better than they are.
While TC-Helicon is not creating synthetic singers, the company is quite intent on improving the singers whom you happen to have or turning them into other people — or other creatures entirely, if that’s what you want. The company was born in 2000 when the two firms decided to merge their parallel paths in computer modeling: TC was working on acoustic spaces while IVL was working on the human voice. The results include voice-modifying software that runs on TC’s hardware, and hardware dynamics and reverb processors that also do pitch correction, timbral modification and harmonization. And they’re getting a lot of notice.
Kevin Alexander, an IVL veteran and now managing director of the new company, sees two approaches to voice processing. “One is to manipulate the voice to enhance it and make it better, keeping it natural-sounding so the listener doesn’t know processing is happening,” he explains. “Most people who sing or speak like their voice and their sound and don’t want someone to change it, like a guitar player doesn’t want to change into a tuba. The other is to treat the voice as an instrument and work toward creating new voices and effects.
“It’s totally accepted that the voice can be manipulated in ways other than EQ, compression and slapping on some reverb,” he continues. “Pitch correction is now a fact in most pop music, and it’s there whether the singer really needs it or not.” In fact, pitch correction is nothing new: New England Digital’s synclaviers were fixing famous divas’ mistakes in the late ’80s, and by the mid-’90s, Opcode Systems’ StudioVision had a slick pitch-to-MIDI-to-pitch function that let you literally draw new pitch contours onto a vocal line. But TC-Helicon is putting a lot of effort into making the process more transparent, both to the engineer and to the listener.
“Two years ago, you couldn’t do this much voice processing without it being audible,” says Alexander. “Now there are artists known for their ‘pure’ voice work, who are actually getting pitch correction and modeling. Maybe they had a cold that day or they don’t sound as good in the upper range. To make it transparent takes some manual control using graphical edit. If you know what key the tune is in, you can set the key and scale, the attack time, correction window and amount. They work 80 percent of the time, but that means they don’t work 20 percent of the time. We want to get them better so they work almost 100 percent of the time so the amount of input necessary goes to zero.”
The key to effective correction and modeling, Alexander says, is in the analysis of the voice: “How do you break down the voice into its elements? Pitch, formant structure, voiced versus unvoiced consonants. With all of these, you can always go into more detail. We separate the voice into those components and process them all separately and recombine them. The more you can break it down, the better the modeling sounds. And with better analysis, you have more flexibility. So you are asking, what else can you use pitch detection for to drive dynamics processing and EQ? What else can you use detecting voiced and unvoiced sounds for? What’s important?
“For example, Spectral, our voice-modeled EQ, detects these factors and can choose different EQ for different sounds,” he adds. “So it can punch the voice without adding too much ‘ess’ sound. Old approaches looked for unvoiced sounds by analyzing the noise content or transients, but our algorithm actually trains itself to be much more discriminating.
“Our first product, Intonator, had an adaptive low-cut filter that changes with the input pitch. It gets rid of hum, but it also gets rid of room noise. Pitch correction sounds better when you do this, since you’re not pitch-shifting the noise.
“Getting the right level of correction of the voice is the hardest thing,” he adds. “You can’t mess with pitch contour and vibrato, since that alters the style of the singer. For example, someone can scoop to a note and get a stable, confident pitch. You don’t want to change that. But there are a lot of heuristics that you can apply. With more intelligence in the tools, you can get more done. Of course, it’s much easier if you have the file and you’re not trying to do it in real time, so you can look forward and back for stylistic characteristics.”
Taking a completely different tack is Yamaha’s Vocaloid. It’s been getting lots of press, not only from our industry but also from the likes of Popular Science and The New York Times because, at first glance, it looks like it does something far more radical than expand the range of human singers: It could actually make them obsolete.
Like a lot of technologies that Yamaha has brought to the music industry, the company didn’t invent this one itself. It actually came out of 10-year-old Music Technology Group at the Universitat Pompeu Fabra in Barcelona, Spain. The group is currently involved with (or just finished) more than a dozen research projects that deal with subjects such as algorithmic orchestration, machine recognition of music, how to intelligently browse millions of online music files, and developing and distributing free Linux- and GNU-based tools for audio.
The group began working with Yamaha in 1996 on frequency domain — based signal processing, and its initial research resulted in a paper titled “Voice Morphing System for Impersonating in Karaoke Systems”; in other words, “how to make anyone who walks into a bar sound like Elvis, Celine Dion or Tom Waits,” which doesn’t sound terribly different from things IVL has done. However, Yamaha decided that the idea couldn’t be made into a marketable product, and in March of 2000, the consortium switched their attention to vocal synthesis. The first product was released this past fall.
While the Spanish group developed the basic technology, Yamaha built the application engine and development software. The voice “libraries” that are sold with the product are from yet another source: The English company Zero-G came up with Leon and Lola, the male and female “vocalists” that you get when you purchase Vocaloid in Europe or America. And by the time you read this, the company will have released a third library, Miriam, which is based on the voice of Miriam Stockley, a solo and backup singer well-known in Europe. Yamaha also built several virtual vocalists who sing in English and Japanese, but they’re only for in-house use.
The way that you create a library, very simply, is this: You make recordings of the subject’s voice, put them through proprietary frequency-domain processing and break them up into segments, or phonemes. The segments are put into a database, where front-end software extracts them. Besides playing the basic phonemes, the software also needs to interpolate between segments so that transitions come out smooth.
Immodestly, I have to admit that this is an idea I came up with a number of years ago. Why not create a sample set, I postulated, containing all of the known phonemes in the English language, assign them to MIDI notes and then control their pitch and inflection with various MIDI controllers and pitch bend? Of course, I never attempted to actually make such a system, and now I’m glad I didn’t; it would have been a heck of lot more work than I realized. According to Hideki Kenmochi, group VP at Yamaha’s Advanced System Development Center, it takes some two to three months for a team of engineers (who know what they’re doing and use specialized software) to develop a single voice and the resulting database fills a CD-ROM.
Yamaha licenses its development software to very few companies: “It’s not very user-friendly,” says Kenmochi. “There will probably be more voices available soon from other sources, but their identities are so far not public knowledge.”
So how is it? Well, if you’ve heard the demos, you know that Vocaloid can be pretty impressive. But in the small amount of time I’ve worked on it, and the slightly larger amount I’ve helped a student create a project on it, I’ve found that getting really useful results out of Vocaloid is far from trivial. Yes, creating perfectly tuned, pristinely modulated voices is not too hard, but just like making a drum machine sound like a real drummer, getting Vocaloid to sound close to human takes a whole lot of work. Drummers have always made the best drum programmers (at least until the machines’ mechanical feel somehow became their own reward), and similarly, you have to know quite a bit about singing to get anything interesting out of the program: phrasing, timing, how to use vibrato, when and how to scoop and slide the pitch, when to voice and un-voice consonants, and the many, many other parameters that determine whether a vocal is actually conveying meaning and emotion or is just carrying a tune. The amount of control the software gives you over the generated vocal track is phenomenal, but you have to know what you’re doing to take advantage of even a small portion of it.
Using the Vocaloid engine for speech synthesis is not yet possible, says Kenmochi, because speech requires complex nonmusical pitch contours to sound realistic, and the pitch and timing of speech has to follow complex linguistic rules: The relationships between words in a sentence have a major bearing on how the words are spoken. I suppose if you have a terrific musical ear, then you might be able to fake speech using the tools that Vocaloid provides, but it would require an even more formidable amount of tweaking, much of it nonintuitive, and I suspect that you’d never really get away from having it sound “sing-song.”
But it’s just this problem that may prove to be Vocaloid’s major limitation. Although while writing a song, you specify pitches and rhythms, a real singer doesn’t perform the song exactly that way, any more than a real saxophonist plays a chart precisely the way it’s written. Beyond what a saxophonist does, however, a singer uses words, and when a vocal performance is convincing, it’s in large part because the words inform the music. The inflections of speech are laid on top of the notes and rhythms, altering them in minute but critical ways. Because Vocaloid doesn’t “understand” what it’s singing, there’s no way to automate that, and the amount of work involved in doing such exquisite adjustments by hand could be truly daunting. For a few phrases here and there, Vocaloid seems manageable, but to create a whole song, making sure each and every syllable sounds right and real, seems like far more trouble than it’s worth.
And here’s another thing to consider: When you want to use synths or samplers to simulate horn parts in a sequence or recording, by far the best way to make them sound at all realistic is to play the parts in “live,” ideally using some kind of wind or breath controller, but a keyboard will do fine if you know how to simulate a horn’s natural phrasing, breathing, vibrato and so on. (To me, it’s heartbreaking to see classical composers invest thousands of dollars in computers and soundware so that they can hear what the music sounds like without hiring an orchestra, and then enter all of the music into the system by drawing notes into Finale or Sibelius.)
However, you can’t record into Vocaloid in real time — not with a microphone, a MIDI keyboard or anything else. “Information must be sent to the engine far in advance,” says Kenmochi, “so the software has time to set up the phonemes. For example, a consonant at the beginning of a syllable may actually sound before the beat that the syllable is on so that the vowel can happen right on the beat.” Of course, a look-ahead buffer on a fast machine might overcome that, but then there’s the problem of what are you going to use for an input device? You can import MIDI files created elsewhere into the software, which can be helpful in that this allows you to play an instrumental line and have the synthetic vocalist more or less follow it. But there’s no direct feedback when you’re in-putting from the vocal sound, which is a crucial part of achieving musical realism.
Conversely, the software generates MIDI files so that you can import a Vocaloid “track” into another — VSTi-compatible — sequencer, which then uses the track to play Vocaloid as a VST instrument in sync with whatever other MIDI or audio tracks you want. But if you look at the Vocaloid tracks, you’ll see a bewildering procession of dozens of text events, followed by hundreds of Non-Registered Parameter Number controller commands and no notes at all. In other words, although they’re legal MIDI files, no other device on earth could generate them or understand them.
Will Vocaloid overcome these problems? Quite possibly. Many of us are old enough to remember when it was an ironclad rule that the maximum number of simultaneous audio tracks you could ever hope to stream off a hard disk was four. Computers will get faster, synthesis engines will get more efficient and perhaps we’ll see the day when someone with a 3-D holographic control console will produce stunning vocals — in any language, in four- and six-part counterpoint, in real time — by blowing through a straw and waving his or her fingers in the air. Or maybe we’d just rather hear somebody sing.
So while you might be able to save a few bucks on backup singers by using Vocaloid (if you get really good at it and the parts are pretty simple) or a TC-Helicon voice processor, I wouldn’t fire your lead singer just yet.
Paul Lehrman has invented many things that he never got around to making, which is just as well, as he enjoys telling the people who do make them what they did wrong.