What is MPEG?
MPEG stands for "Moving Pictures Experts Groups". It is a group working under the directives of the International Standards Organisation (ISO) and the International Electro-Technical Commission (IEC).
The groups work concentrates on defining standards for the coding of moving pictures, audio and related data.
MPEG-1 defines a framework for coding moving video and audio, significantly reducing the amount of storage with minimal perceived difference in quality. In addition a System specification defines how audio and video streams can be combined to produce a system stream. This forms the basis of the coding used for the VCD format.
MPEG-2 builds on the MPEG-1 specification, adding further pixel resolutions, support for interlace picture, better error recovery possibilities, more chrominance information formats, non-linear macroblock quantization and the possibility of higher resolution DC components.
MPEG video compression
MPEG video compression uses several techniques to achieve high compression ratios with minimal impact on the perceived video quality.
Discrete Cosine Transformation (DCT)
The human vision system exhibits some characteristics that are exploited by MPEG video compression. One of these is that large objects are much more noticeable than detail within them. In other words, low spatial frequency information is much more noticeable than high spatial frequency information.
MPEG video compression discards some high spatial frequency information - the information which is less noticeable to the eye. The first step in this process is to convert a static picture into the frequency domain. The DCT performs this transformation.
A complete frame is split into blocks of 8x8 pixels. The DCT algorithm converts the spatial information within the block into the frequency domain. After the transformation, the top left value of the block represents the DC level (think of this as the average brightness) of the block. The value immediately to the right of this represents low frequency horizontal information. The value in the top right represents high frequency horizontal information. Similarly, the bottom left value represents high frequency vertical information.
The following diagrams show a 4x4 block of pixels and the resulting DCT values. Values in the DCT output matrices range from 0 to 15.
Our DCT transformed values contain an accurate representation of our original macroblock. By applying an inverse DCT on the values we regain our original pixels. Our DCT output is currently held as high precision (e.g. floating point) values. We apply a technique called quantization to reduce the precision of the values. Quantization simply means storing the value using a discrete number of bits, discarding the least significant information. By using the knowledge that the high spatial frequency information is less visible to the eye than low frequency we can quantize the high frequency parts using fewer bits. It is important that the DC component is accurately represented.
In our example blocks above, we have used 4 bit (values in the range 0 to 15) values to represent the DCT matrix. With the knowledge that the eye cannot determine high frequency information as accurately as low frequency information, we can change the number of bits that we quantize each entry in the matrix. The DC component must be accurately represented, but we can reduce the number of bits required for other cells. The following shows an example of how many bits could be allocated for each call in the DCT matrix:
The original matrix had 16 calls with 4 bits per cell, giving a total of 64 bits. The quantized matrix has a total of:
(4x1) + (3x4) + (2x7) + (1x4) = 4 + 12 + 14 + 4 = 34 bits.
A saving of about 50%. A real MPEG encoder varies the number of bits that DCT matrix vales are coded to on each frame.
Modified Huffman Coding
Modified Huffman coding uses fixed tables to perform Huffman coding. The DCT output is encoded using this technique to reduce the number of bits required. The basis of Huffman encoding is that encoded symbols are a variable number of bits. Frequently used symbols consume fewer bits, less frequently used symbols consume more bits. The result is a (hopefully!) saving in the bit requirements.
MPEG video frames are broken into blocks of 8x8 pixels which are DCT processed and quantized as outlined above. Blocks are combined into macroblocks of 16x16 or 16x8 (MPEG-2 only) pixels.
Lets consider a sequence of 6 frames. The encoder starts by encoding a complete representation of the first frame (similar to a static JPEG image). This is known as an Intra-Frame (or I-Frame). I-frames are necessary to give the decoder a starting point.
The encoder could choose to encode the fourth frame in the video as a Predicted frame (or P-frame). To do this is scans the first frame (the reference frame) and the fourth frame, looking for macroblock size areas of the picture that appear similar. If the video contains moving objects, the encoder detects this. For areas of the image which have not changed between first and fourth frame, macroblocks are skipped. Skipped macroblocks do not consume any data in the video stream. The decoder simply copies the macroblock from the previous reference frame. For areas that have changed slightly compared to the reference it takes the pixel difference and encodes this using DCT and quantization techniques. For areas that the encoder can detect the movement of an object from one macroblock position to another it encodes a motion vector and difference information. The motion vector tells the decoder how far and in what direction the macroblock has moved. Where the encoder cannot find a similar macroblock in the reference frame, the macroblock is encoded as if it was an I-frame.
The other frames in the sequence (second, third, fifth and sixth) could be encoded as Bidirectional Predicted frames (B-frames). Considering the second frame, this has two reference frame; the previous reference frame is frame one and the next reference frame is frame four. A B-frame can use macroblocks from either the previous or next reference frames, or preferably a combination of both. Using forward and backward motion vectors allows interpolation of 2 images, reducing noise at low bitrates.
Using our example, video would be encoded using the frame sequence:
It is more normal for I frames to appear less regularly than this, perhaps every 12 frames. A more sophisticated encoder would dynamically detect which frames should be encoded using which frame types, e.g. a scene change would result in an I frame being inserted. Thus the sequence could end up looking more random, such as:
IBBBPBBIBBBPBPBBBBPBBBPBBBPBI... ^ Scene change detected
By using information from previous and next pictures, substantial saving can be made in the bit requirements for P and B pictures compared with I pictures. Typically P-frames would require 30% to 50% the number of bits compared to I-frames, B-frames would require 15% to 25% the number of bits.
MPEG audio compression
Audio data is sampled at a certain sample rate. That means that a number of measurements of the audio signal are taken every second (32,000, 44,100 or 48,000 samples per second for MPEG-1 audio). Each sample is taken at a certain precision (16 bits).
MPEG audio compression uses a psycho-acoustic model of the human ear to determine portions of the audio information that can be encoded at a lower precision without impacting the listeners perception.
The first step in encoding MPEG audio is to use our old friend, the discrete cosine transform (DCT) to convert a short burst of audio data (known as a frame) into the frequency domain. The DCT converts from time samples into 32 equally spaced frequency bands. Using the psycho-acoustic model, the number of data bits used to represent the sampled data can be varied for different frequency bands. Audio information that will not be heard is not allocated any bits.
Frequency (Auditory) Masking
Frequency masking or auditory masking is a term used to describe masking of a sound at one frequency by a sound at another frequency. If a loud sound is present at a particular frequency it reduces the ability of the human ear to discern a softer sound at a second frequency. The louder the first frequency is and the closer the two frequencies the greater the effect.
To illustrate, if a -6db signal is present at 1kHz (1000 Hz) and another signal present at 1.1kHz (1100 Hz) with a loudness of -18dB, the 1.1kHz signal will not be heard.
Temporal masking is the masking of a sound by another sound that occurred before or after it in time. If a loud sound stops abruptly and is replaced by a soft (low volume) sound of short duration, the soft sound will not be heard until the ear can recover from the effects of the loud sound.
A similar effect in the other direction is also possible where a soft sound of short duration is followed by a loud sound. Once again the soft sound is not heard.