By now, you’ve probably come across some appliance or service thatrecognizes human speech: your cell phone, perhaps, or the customerservice call-in line at your credit card company. What you may not haverealized is that a related technology is at work, instigated by“The Man,” and put in place to listen in on radio and TVtransmissions solely to recognize songs and performances. Why wouldanyone set up these little music spies? What’s going on with thistechnology?
There are several different machines that recognize audio, whetherit is speech or music. By and large, they all share one thing incommon: These machines “listen to” and process a sample ofany material that they later recognize or match. This is an applicationof heuristics, learning from practical experience. Several specializedaudio recognizers of human speech are available from IBM, MacSpeech andScanSoft, and I can tell you from seemingly endless hours of“practical experience” that machine recognition ofcontinuous or natural speech is one of the toughest problems incomputing.
In contrast, music recognition is a good bit easier, as anyparticular performance, once it’s recorded, is “etched instone,” so to speak. The spectral makeup, timing and amplitudevariations are fixed; and only global gain changes, noise anddistortion are added when the performance is reproduced. That fact hasspawned several vendors to sell recognition tools and services: One ofthese is Comparisonics Corporation, makers of the findsounds.comservice. Findsounds.com lets you type in descriptors and its enginewill return site URLs that host sounds that match your needs. This canbe useful for multimedia producers and musicians who are hunting forthat perfect effect or sample. Another heuristic audio search productis SoundFisher, a cross-platform, database-management system featuringcontent-based recognition, matching and retrieval.
A more interesting and difficult application of music-recognitiontechnology deals with digital-rights management and performancemetrics. This is where those machine spies come in. Two companies,Audible Magic and Relatable, are using their audiofeature-identification smarts to monitor network traffic, especiallyP2P activity, recordings on optical and magnetic media and radiobroadcasts. Audible Magic, in particular, has acquired quite a fewcompanies, including SoundFisher’s developer, in an effort to be theone-stop shop to control content in modern media’s chaotic world.
Both Relatable and Audible Magic have products that“sniff” IP packets and “listen” to the audiobeing carried within file transfers. They’ve tried to go beyond mereidentification to actually block illegal files, but so far, it hasn’tworked as planned. The computational and network resources torecognize, validate and block illegal music-carrying packets in realtime are still some ways away.
A third company, the solution provider formerly known asCantametrix, is now part of Gracenote, those CDDB guys. For those ofyou who don’t get out much, CDDB is the largest commercial database ofCD metadata, which many MP3 player applications rely on to provide discand song titles. According to Gracenote, its “informationservices are used by leading media players including AOL’s Winamp,Apple’s iTunes and RealNetworks’ RealOne Player.” Leading CEmanufacturers, including Pioneer, Philips and Sony, incorporateGracenote’s CDDB technology into their latest generation of home,mobile and portable music products.
In addition to the commercial products I’ve already mentioned, thereare several Open Source or freely downloadable software whatsitz thatalso do the heuristics dance. One is MusicBrainz’s Tagger, a Winapplication that “allows you to automatically look up the tracksin your music collection and then write clean metadata tags [ID3 tagsor Vorbis comment fields] to your files. As you tag the files in yourcollection that MusicBrainz didn’t recognize, you submit the acousticfingerprints [TRM IDs] of your files back to the server. Submittingacoustic fingerprints will allow MusicBrainz to automatically identifythese tracks in the future so that other people using the Tagger canbenefit.” TRM IDs are profiles typically generated by Relatable’sTRM audio fingerprinting technology. A version of TRM’s audio featureextraction client was used by the MusicBrainz project.
Another no-cost machine is SWMUMDIS, a “universal tool todevelop and explore audio representations that process theridges” of a preprocessed spectrogram. SWMUMDIS is ademonstration of research principals and not a product, even by OpenSource standards, but it does serve as a point of reference for furtherdevelopment by pointy-headed programmers.
Other music-recognition uses include automatic quality assessmentand visualization of parameters such as spectral content, which makesrapid identification of sections easier for editing. Another utilityapplication is quality control. The International TelecommunicationUnion (ITU) created the PEAQ (Perceptual Evaluation of Audio Quality)standard for objective machine evaluation of perceptually coded audio,of which the MP3 codec is a widespread example. Basically, PEAQsoftware “listens” to incoming audio, makes an evaluationbased on a model of human hearing and that subjective factor we referto as “quality,” and then rates the audio in real time.This is invaluable for broadcasters, replicators and anyone who needs away to monitor their “product” while never tiring orgrowing bored with the program material. PEAQ’s quality assessment isbased on a group of trained human listeners whose talents were bakedinto software. PEAQ-based products are available as software-only andhardware implementations.
These days, the audio data-sniffing field is crowded enough thatparticipants are vying for mind share by claiming the fastestrecognition time — “I can name that tune in a dozennotes!” “Hah! I laugh at your algorithms! I can name thattune in half as many!” — and so it goes until, at somepoint, the programs will be able to name that tune with just one note,and then we can all retire and let the computers do our work. The worldof machine intelligence and audio recognition may someday provide atruly useful product to, say, automatically assemble a soundtrack foryour life. But until then, audio recognition remains a useful toolprimarily for bean counters and intellectual-property cops. Justremember that, even in space, something can hear your Stratocasterscream.
OMas’ computer auto-assembled this column while he was preparinga delicately toasted cheese sandwich. All that time, he and hisPowerBook were under the influence of Morcheeba’s latest,Charango, and the wide-ranging styles of new Brit-pop kids,Delays.
Pedant in a Box Spectrogram
A spectrogram is a visualization technique for acoustic events oraudio material. Spectrograms provide a time vs. frequency and amplitudeplot and can be real or out of real time. Nowadays, most spectrogramsmap frequency to a predefined color table to visually clarify the plot.Forensic investigators, audio restorers and speech pathologistsroutinely employ spectrograms in their work.
The following two spectrograms are from SoundHack and Frequency, thepoor man’s Retouch. The color plot from SoundHack shows a stereofolk-rock .AIFF file. Notice the tempo appears as almost a grid ofvertical beats, while the monochrome Frequency screenshot displays myvoice. The selected utterance is the word “SCSI.”For both, the X axis is time from left to right, while the Y axis isfrequency.