Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×

THE MP3 Machine …or, What Is Happening to My Audio?

Editor's note: This article appeared in the January issue of Mix's new Internet Audio magazine. If you'd like a free subscription to this quarterly publication, then click here.

You’re an up-to-date audio pro, savvy to audio on the Internet. You’ve uploaded and downloaded, optimized your material for compression and use only the very best encoders. But in the back of your mind, there may be one little question that just won’t go away: “What the heck is this MP3 stuff actually doing to my audio?” This article will attempt to explain what really happens in the most prevalent compressed-audio format on the Internet, MP3.

A LITTLE HISTORY

Research on audio data compression has been going on almost as long as there’s been audio, driven by both military and civilian needs. Military organizations are interested in getting intelligible voice in terrible conditions, and this means bringing down the amount of data transmitted. Civilian telcoms, on the other hand, have a potent interest in cramming as many phone calls as possible onto a single cable.

This has added up to significant budgets for a lot of years of private and academic researches delving into the mysteries of human hearing, as well as really neat coding schemes. For decades, the research focus was on speech intelligibility, but in the 1980s and early ’90s, it became feasible to code in such a way that good-sounding music could be compressed into a fraction of the space required by CD audio.

In the early 1990s, as the Internet grew in popularity, compression issues became even more important as more people wanted to transmit audio across a medium with limited bandwidth and speed. It was about this time that the MP3 standard was developed.

ACRONYMS AND LAYERS

MP3 is an acronym of an acronym, expanding into “MPEG-1 Layer 3 Audio.” MPEG stands for the Motion Picture Experts Group, and MPEG-1 is the first in a series of standards for the compressed coding of audio and video for presentation in various media. MPEG-1 video is commonly used on Video CD (a major industry in Asia). MPEG-2 is well known today in association with DVD, Direct Satellite Broadcast (DSB) and Digital Video Broadcasting (DVB) outside the U.S. MP3 audio is part of the original MPEG-1 specification. So what is this Layer 3 thing about, and what happened to Layers 1 and 2?

When the MPEG-1 specification was defined, there were two promising candidates, MUSICAM and ASPEC, vying to become the standard for audio encoding. Listening tests didn’t produce a winner, so MPEG decided to codify three options, which became known as Layers 1, 2 and 3.

MPEG-1 Layer 1 Audio requires the least processing to encode and has the lowest latency, or encoding delay. Its intended application is direct recording, with encoding in real time. The compression algorithm used in the Digital Compact Cassette, known as PASC (Precision Adaptive Subband Coding), is a close relation.

MPEG-1 Layer 2 Audio is the same in principle as Layer 1 but uses more refined processing for encoding. Layer 2 is in widespread use today for DVD (especially in PAL markets), as well as Video CD, digital audio and video broadcasting.

MPEG-1 Layer 3 Audio (MP3) takes the next step: The basic framework of Layer 1/Layer 2 encoding is preserved, but additional elements are added for more efficient compression.

SO HOW DOES IT ALL WORK?

It’s easier to understand MP3 if you first understand Layers 1 and 2, MP1 and MP2, respectively. MP1 and MP2 are examples of a class of compression techniques known as Subband Coding, or SBC. SBC techniques vary in details but have the same general structure and approach.

THE FILTER BANK

Uncompressed audio data (such as CD audio) goes into the system. The first step toward compression is division of the signal into 32 frequency bands by a filter bank. In decoding, these subbands are summed together to reconstruct the signal.

Layers 1 and 2 use filters, called polyphase filters, that are comparatively simple and, therefore, imperfect. These diverge from the ideal in two respects: For one, the filter bands are spaced evenly in regards to frequency (about 500 Hz), making them uneven to the ear. The other imperfection is the sharpness of the filter cutoff curve. Polyphase filters have only moderately sharp roll-off and must overlap one another to reconstruct the signal accurately.

PSYCHOACOUSTIC ANALYSIS

As the input signal goes to the filter bank, it also feeds a section that analyzes the signal content using a Fast Fourier Transform. FFT analysis is also frequency-based, but it is relatively low-resolution and unsuited for reconstructing the signal with any fidelity. Rather, it identifies where there are strong (high-amplitude) tonal or nontonal components, in order to mask lower amplitude sounds in the vicinity.

A psychoacoustic model is applied to determine the effects of frequency masking, which is an upward shift of the hearing threshold of some relatively weaker tones by louder tones. Each component identified in the analysis is assigned a spreading function that determines how strongly it will mask the sounds around it. These are added together to assign a threshold of audibility for each of the 32 subbands.

The MPEG-1 specification defines two optional psychoacoustic models, one relatively basic (Model 1) and one more elaborate (Model 2). Each of these models can be applied in any of the three Layers. Model 2 includes specific features applicable to the more sophisticated processing of MP3.

VARIABLE QUANTIZATION

The psychoacoustic analysis results are then applied to the filter bank outputs. If the energy in any band falls below the overall threshold of audibility, then that band is deleted. Most explanations of audio compression focus on the elimination of inaudible components, but what happens to the audible portions is equally important.

Each audible band is then quantized individually. Signal-to-noise ratio is proportional to bit resolution. By looking at the energy and audibility threshold for each band, the encoder can determine the lowest bit resolution that can be applied.

Thus, subband coding adds content in the form of quantization noise, as well as removes bands that fall below audibility. The combination of these processes results in the overall compression ratio.

CODING AND FORMATTING

We now have a set of bandpass filter outputs, rendered at various resolutions. These are assembled to distribute the bits evenly into a bitstream with a constant data rate. For MPEG-1 audio in hi-fi stereo applications, data rates of 128 to 256 kbps (thousands of bits per second) are typical.

The coded bitstream is then formatted by dividing the raw bitstream into sections, or frames. Additional useful information, such as the stereo mode, bit rate, coding layer, etc., is embedded in headers that mark the beginning of each frame.

STEREO MODES

Stereo audio typically has a large share of redundant information between channels. Greater efficiency results if redundant information is identified.

The sensitivities of the ear to different cues can also be exploited. At low frequencies, localization is weak, and differential information can be combined into a single channel. At high frequencies, cues from the temporal envelope become more important than instantaneous amplitude.

MPEG-1 audio provides three modes for dual-channel input: Dual-Monophonic, Stereo and Joint-Stereo. Dual-Monophonic coding treats each channel separately, with no advantage from redundancy. Stereo mode exploits only direct bit-for-bit redundancy. Joint-Stereo uses more sophisticated analysis and applies psychoacoustic mapping for different frequency ranges.

A BETTER FILTER BANK

MP3 uses a more elaborate design for the first stage filter bank, allowing filters to be placed at intervals matching critical bands of the ear so that each band more closely matches the frequency regions where the ear will experience masking.

MP3 achieves this by following the polyphase filter with a frequency transform filter, known as Modified Discrete Cosine Transform, or “MDCT.” MDCT is optimized for audio, and its output can be reconstructed into an accurate representation of the original waveform.

The downside of the MDCT filter is that it is subject to a form of transient smearing, known as pre-echo. MP3 compensates by detecting the conditions for pre-echo and modifying the filter settings for that interval.

SMARTER QUANTIZATION

MP3 also uses more clever means than Layers 1 and 2 to determine the threshold of audibility in each frequency band. The more sophisticated Model 2 psychoacoustic model is used with better estimates of masking effects. These are used to determine optimal quantization levels for each subband.

LOSSLESS COMPRESSION

MP3 applies a form of lossless compression, called entropy coding, to the output of the filtering and quantization process. In entropy coding, data words (or “symbols”) that occur more frequently are given shorter codes, like an abbreviation.

The type of entropy coding used in MP3, known as Huffman Coding, uses a fixed lookup table to match more commonly used values with shorter codes.

VARIABLE BIT RATE

MP3 makes a provision for a type of variable bit rate encoding, called a “bit reservoir.” When data is multiplexed and formatted, extra space is allowed so that when a difficult-to-encode section comes along, the extra space is used to store additional information. Overall bit rate remains constant.

MP3 also allows for true, variable bit rate encoding. Many MP3 encoders do not implement variable bit rate, but some of the newer encoders support this technology.

MPEG LAYER IMPLEMENTATIONS AND PERFORMANCE

The three MPEG-1 audio Layers are implemented in order of efficiency. Nominally, the MPEG spec posits that for MPEG-1 Layer 1, a bit rate of 256 kbps should produce results that are close to CD quality. The rates for Layers 2 and 3 are 192 and 128 kbps, respectively.

Can this claim be verified? Well, yes and no. Most people will cheerfully listen to 128kbps MP3s and be quite satisfied. (Note that there are substantial differences between encoders.) If you A/B between a CD original and its MP3-coded form over headphones, then it won’t take long to convince yourself that output is not identical to input. The goal of the standard is that the difference between original and encoder be “perceptible, but not annoying.”

CONCLUSION

We’ve taken a lighting-quick look at the way MP3 compresses audio data. There is a lot of detail in the process that has been omitted for brevity and (hopefully) clarity. While professional listeners are likely to find the limits of any audio data compression scheme, it’s indisputable that MP3 has satisfied most of the listeners most of the time. Perhaps the true measure of an audio coding scheme’s effectiveness should be the number of lawyers employed as a result of its application. On that basis, MP3 is a hands-down winner.

Gary Hall has been working in digital audio for over 20 years and has been responsible for many innovations in audio processing. His real ambition, however, is to dance in Bollywood films.

GLOSSARY

Bit Rate: The average number of bits that one second of audio data will consume. Measured in kbps (1,000 bits/sec).

Critical Band: “Reception bands” in the human auditory system, where frequency components will mask a given tone. There are 24 critical bands; bandwidth varies with frequency but is usually between 1/6 and 1/3 octave.

Fast Fourier Transform (FFT): A mathematical function used for analyzing frequency content as it varies over time.

Lossless Compression: A method of data compression in which no data is lost. In terms of size reduction, it is not as efficient as lossy compression, where redundant data is removed.

Motion Picture Experts Group: Established in 1988, a working subset of ISO/IEC to develop standards for coded representation of digital audio and video.

Multiplex: To combine or transmit several data streams on the same circuit or channel.

Psychoacoustic Model: A hearing perception model based on the interaction of sounds with the human auditory system.

Threshold of Audibility: The minimum sound pressure level that can be perceived by the human auditory system — usually referenced as the standard of sound pressure, 0 dB.

DRILLING DOWN

For more technical information on MP3 than you ever wanted to know, grab something caffeinated and check out the following Web links…

www.cselt.it/mpeg The home page of the Moving Pictures Experts Group. Includes comprehensive technical references on MPEG standards, plus work group news and plans.

www.iis.fhg.de Home page of the Fraunhofer IIS-A, the main developer of MPEG Layer 3 and MPEG-2 AAC (Advanced Audio Coding).

www.mpeg.org A guide to MP3 resources on the Internet. Includes FAQs, search engines and links.

www.tnt.uni-hannover.de/project/mpeg/audio
The University of Hannover’s department of Electrical Engineering and Information Technology’s MPEG audio page. Includes general information, plus sound quality assessments and other tests.

Close