Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now



Pro Audio and Machine Learning: Ready for Prime Time?

By Craig Anderton. Will machine learning make audio engineers obsolete?

When people said synthesizers would replace musicians, my standard reply was, “Who do you think plays them? Accountants?” And now some engineers wonder whether they’re going to be replaced by machine learning—but those machines may become our assistants, not our masters.

Machine learning is a machine’s ability to analyze data, learn from it and derive outcomes based on what it learns. It is essentially an application of the broader concept of artificial intelligence, or machines being able to carry out tasks in a way that simulates human intelligence. Machine learning comes in two main flavors. With supervised learning, a machine analyzes what it’s learned so it can predict future events. It can also compare its predictions to what actually occurs and modify its future behavior accordingly. Unsupervised learning is more about making sense of a set of data, but not necessarily drawing conclusions. It is supervised learning that will likely have the most impact on pro audio in the short term.

Adobe VoCo, a prototype of which was showcased at Adobe Max in November 2016, is a good example of technology on the edge of machine learning. Consider this: You’ve recorded and edited narration for a commercial, but the client decides that “King Rex” dog food needs to be changed to “King Canine” dog food. Unfortunately, your talent is on vacation and unavailable for an ADR session. Fortunately, you’ve got access to VoCo, which is theoretically able to generate new audio to change words in a voiceover. (Adobe VoCo is not a commercially available product.)

Software Tech: Apple Drops Intel? What Does It Mean for Us?

Reduced to basics, you type “canine,” VoCo searches through the existing audio takes, finds the phonemes (individual sounds that make up words) needed to create “canine,” and substitutes them seamlessly for “king.” Because the sound comes from the original speaker, the generated word sounds far more natural in context than, for example, Siri. Of course, an engineer could spend hours trying to do the same thing. But VoCo can analyze massive amounts of data and create speech that never existed before—not just edit existing speech—because it has been trained to recognize phonemes.

To take this idea farther, I was recently on a panel with an engineer who couldn’t conceive of a world where we don’t comp vocals. Neither can I—but how we assemble those takes into a final vocal might be a job for machine learning. Such a machine could analyze the various comps, then pick the parts that are the most intelligible, have the best pitch, include spaces between words to allow for seamless transitions, and offer consistent timbres. Furthermore, it could present a few options and you could pick your favorite—which would become part of the data it uses to make future decisions regarding your preferences when stringing vocals together. You could set constraints—any phrases need to be at least three seconds long, for example—or deal with the comping on a more granular level.

In this case, the intelligent machine doesn’t replace us, but it makes our tasks easier. Ultimately we still have to judge whether the machine’s comp worked or not, but as it assimilates more data about our preferences, it will do its job more efficiently. (If you don’t believe me, aren’t you finding that the endless ads peppering your online existence are ever-closer to your particular interests?)

Another simple, practical example involves mastering. Although many of my clients want a master with maximum levels, I prefer more dynamics. To split the difference, I analyze a mixed file and look for half-cycle peaks that exceed, for example, -3 dB below a reference. There can easily be 30 or 40 such peaks. I can then normalize each half cycle down to -3 dB, which allows raising the overall level by 3 dB without introducing artifacts like pumping, while maintaining the dynamics. But doing this over and over is tiresome. A machine learning process could analyze the audio, reduce the peaks automatically, run a LUFS analysis, and finally add enough limiting or maximizing (if needed) to hit -14 to -12 LUFS. Over time, as the process generated more files and found out which ones I preferred, it would learn to strike the right balance of half-cycle normalization, dynamics processing and overall level to achieve the desired result. This would help automate a pretty tedious process.

Want more information like this? Subscribe to our newsletter and get it delivered right to your inbox.

The primary use of machine learning will probably be in marketing, because the process can go through amounts of data that would be incredibly daunting to humans, recognize patterns, and target people with very specific messages.

There are also going to be audio applications of this technology, and it’s not going to make engineers obsolete. Likely, it will just make our jobs that much easier, at least in the near future.

Author/musician Craig Anderton updates every Friday with news and tips. His latest album, Simplicity, is now available on Spotify and cdbaby.