Showing posts with label Codecs. Show all posts
Showing posts with label Codecs. Show all posts

Codecs


The 8000 samples-per-second PCM signal, at 16 bits per sample, results in 128,000 bits per second of information. That's fairly high, especially in the world of wireline telephone networks, in which every bit represented some collection of additional copper lines that needed to have been laid in the ground. Therefore, the concept of audio compression was brought to bear on the subject.
An audio or video compression mechanism is often referred to as a codec, short for coderdecoder. The reason is that the compressed signal is often thought of as being in a code, some sequence of bits that is meaningful to the decoder but not much else. (Unfortunately, in anything digital, the term code is used far too often.)
The simplest coder that can be thought of is a null codec. A null codec doesn't touch the audio: you get out what you put in. More meaningful codecs reduce the amount of information in the signal. All lossy compression algorithms, as most of the audio and video codecs are, stem from the realization that the human mind and senses cannot detect every slight variation in the media being presented. There is a lot of noise that can be added, in just the right ways, and no one will notice. The reason is that we are more sensitive to certain types of variations than others. For audio, we can think of it this way. As you drive along the highway, listening to AM radio, there is always some amount of noise creeping in, whether it be from your car passing behind a concrete building, or under power lines, or behind hills. This noise is always there, but you don't always hear it. Sometimes, the noise is excessive, and the station becomes annoying to listen to or incomprehensible, drowned out by static. Other times, however, the noise is there but does not interfere with your ability to hear what is being said. The human mind is able to compensate for quite a lot of background noise, silently deleting it from perception, as anyone who has noticed the refrigerator's compressor stop or realized that a crowded, noisy room has just gone quiet can attest to. Lossy compression, then, is the art of knowing which types of noise the listener can tolerate, which they cannot stand, and which they might not even be able to hear.
(Why noise? Lossy compression is a method of deleting information, which may or may not be needed. Clearly, every bit is needed to restore the signal to its original sampled state. Deleting a few bits requires that the decompressor or the decoder restore those deleted bits' worth of information on the other end, filling them in with whatever the algorithm states is appropriate. That results in a difference of the signal, compared to the original, and that difference is distortion. Subtract the two signals, and the resulting difference signal is the noise that was added to the original signal by the compression algorithm. One only need amplify this noise signal to appreciate how it sounds.)


G.711 and Logarithmic Compression

The first, and simplest, lossy compression codec for audio that we need to look at is called logarithmic compression. Sixteen bits is a lot to encode the intensity of an audio sample. The reason why 16 bits was chosen was that it has fine enough detail to adequately represent the variations of the softer sounds that might be recorded. But louder sounds do not need such fine detail while they are loud. The higher the intensity of the sample, the more detailed the 16-bit sampling is relative to the intensity. In other words, the 16-bit resolution was chosen conservatively, and is excessively precise for higher intensities. As it turns out, higher intensities can tolerate even more error than lower ones—in a relative sense, as well. A higher-intensity sample may tolerate four times as much error as a signal half as intense, rather than the two times you would expect for a linear process. The reason for this has to do with how the ear perceives sound, and is why sound levels are measured in decibels. This is precisely what logarithmic compression does. Convert the intensities to decibels, where a 1 dB change sounds roughly the same at all intensities, and a good half of the 16 bits can be thrown away. Thus, we get a 2:1 compression ratio.
The ITU G.711 standard is the first common codec we will see, and uses this logarithmic compression. There are two flavors of G.711: μ-law and A-law. μ-law is used in the United States, and bases its compression on a discrete form of taking the logarithm of the incoming signal. First, the signal is reduced to a 14-bit signal, discarding the two least-significant bits. Then, the signal is divided up into ranges, each range having 16 intervals, for four bits, with twice the spacing as that of the next smaller range. Table 1 shows the conversion table.
Table1: μ-Law Encoding Table 
Input Range
Number of Intervals in Range
Spacing of Intervals
Left Four Bits of Compressed Code
Right Four Bits of Compressed Code
8158 to 4063
16
256
0×8
number of interval
4062 to 2015
16
128
0×9
number of interval
2014 to 991
16
64
0×a
number of interval
990 to 479
16
32
0×b
number of interval
478 to 223
16
16
0×c
number of interval
222 to 95
16
8
0×d
number of interval
94 to 31
16
4
0×e
number of interval
30 to 1
15
2
0×f
number of interval
0
1
1
0×f
0×f
1
1
1
0×7
0×f
31 to 2
2
15
0×7
number of interval
32 to 95
4
16
0×6
number of interval
223 to 96
8
16
0×5
number of interval
479 to 224
16
16
0×4
number of interval
991 to 480
32
16
0×3
number of interval
2015 to 992
64
16
0×2
number of interval
4063 to 2016
128
16
0×1
number of interval
8159 to 4064
256
16
0×0
number of interval
The number of the interval is where the input falls within the range. 90, for example, would map to 0xee, as 90-31 = 59, which is 14.75, or 0xe (rounded down) away from zero, in steps of four. (Of course, the original 16-bit signal was four times, or two bits, larger, so 360 would have been one such 16-bit input, as would have any number between 348 and 363. This range represents the loss of information, as 363 and 348 come out the same.)
A-law is similar, but uses a slightly different set of spacings, based on an algorithm that is easier to see when the numbers are written out in binary form. The process is simply to take the binary number and encode it by saving only four bits of significant digits (except the leading one), and to record the base-2 (binary) exponent. This is how floating-point numbers are encoded. Let's look at the previous example. The number 360 is encoded in 16-bit binary as
0000 0001 0110 1000
with spaces placed every four digits for readability. A-law only uses the top 13 bits. Thus, as this number is unsigned, it can be represented in floating point as
1.01101 (binary)x25.
The first four significant digits (ignoring the first 1, which must be there for us to write the number in binary scientific notation, or floating point), are "0110", and the exponent is 5. A-law then records the number as
0001 0110
where the first bit is the sign (0), the next three are the exponent, minus four, and the last four are the significant digits.
A-law is used in Europe, on their telephone systems. For voice over IP, either will usually work, and most devices speak in both, no matter where they are sold. The distinctions are now mostly historical.
G.711 compression preserves the number of samples, and keeps each sample independently of the others. Therefore, it is easy to figure out how the samples can be packaged into packets or blocks. They can be cut arbitrarily, and a byte is a sample. This allows the codec to be quite flexible for voice mobility, and should be a preferred option.
Error concealment, or packet loss concealment (PLC), is the means by which a codec can recover from packet loss, by faking the sound at the receiver until the stream catches up. G.711 has an extension, known as G.711I, or G.711, Appendix I. The most trivial error concealment technique is to just play silence. This does not really conceal the error. An additional technique is to repeat the last valid sample set—usually, a 10ms or 20ms packet's worth—until the stream catches up. The problem is that, should the last sample have had a plosive—any of the consonants that have a stop to them, like a p, d, t, k, and so on-the plosive will be repeated, providing an effect reminiscent of a quickly skipping record player or a 1980s science-fiction television character.[*] Appendix I states that, to avoid this effect, the previous samples should be tested for the fundamental wavelength, and then blocks of those wavelengths should be cross-faded together to produce a more seamless recovery. This is a purely heuristic scheme for error recovery, and competes, to some extent, with just repeating the last segment then going silent.
In many cases, G.711 is not even mentioned when it is being used. Instead, the codec may be referred to as PCM with μ-law or A-law encoding.


G.729 and Perceptual Compression

ITU G.729 and the related G.729a specify using a more advanced encoding scheme, which does not work sample by sample. Rather, it uses mathematical rules to try to relate neighboring samples together. The incoming sample stream is divided into 10ms blocks (with 5ms from the next block also required), and each block is then analyzed as a unit. G.729 provides a 16:1 compression ratio, as the incoming 80 samples of 16 bits each are brought down to a ten-byte encoded block.
The concept around G.729 compression is to use perceptual compression to classify the type of signal within the 10ms block. The concept here is to try to figure out how neighboring samples relate. Surely, they do relate, because they come from the same voice and the same pitch, and pitch is a concept that requires time (thus, more than one sample). G.729 uses a couple of techniques to try to figure out what the sample must "sound like," so it can then throw away much of the sample and transmit only the description of the sound.
To figure out what the sample block sounds like, G.729 uses Code-Excited Linear Prediction (CELP). The idea is that the encoder and decoder have a codebook of the basics of sounds. Each entry in the codebook can be used to generate some type of sound. G.729 maintains two codebooks: one fixed, and one that adapts with the signal. The model behind CELP is that the human voice is basically created by a simple set of flat vocal chords, which excite the airways. The airways—the mouth, tongue, and so on—are then thought of as signal filters, which have a rather specific, predictable effect on the sound coming up from the throat.
The signal is first brought in and linear prediction is used. Linear prediction tries to relate the samples into the block to the previous samples, and finds the optimal mapping. ("Optimal" does not always mean "good," as there is almost always an optimal way to approximate a function using a fixed number of parameters, even if the approximation is dead wrong.) The excitation provides a representation that represents the overall type of sound, a hum or a hiss, depending on the word being said. This is usually a simple sound, an "uhhh" or "ahhh." The linear predictor figures out how the humming gets shaped, as a simple filter. What's left over, then, is how the sound started in the first place, the excitation that makes up the more complicated, nuanced part of speech. The linear prediction's effects are removed, and the remaining signal is the residue, which must relate to the excitations. The nuances are looked up in the codebook, which contains some common residues and some others that are adaptive. Together, the information needed for the linear prediction and the codebook matches are packaged into the ten-byte output block, and the encoding is complete. The encoded block contains information on the pitch of the sound, the adaptive and fixed codebook entries that best match the excitation for the block, and the linear prediction match.
On the other side, the decoding process looks up the codebooks for the excitations. These excitations get filtered through the linear predictor. The hope is that the results sound like human speech. And, of then, it does. However, anyone who has used cellphones before is aware that, at times, they can render human speech into a facsimile that sounds quite like the person talking, but is made up of no recognizable syllables. That results from a CELP decoder struggling with a lossy channel, where some of the information is missing, and it is forced to fill in the blanks.
G.729a is an annex to G.791, or a modification, that uses a simpler structure to encode the signal. It is compatible with G.729, and so can be thought of as interchangeable for the purposes of this discussion.

Other Codecs

There are other voice codecs that are beginning to appear in the context of voice mobility. These codecs are not as prevalent as G.711 and G.729—some are not available in more than softphones and open-source implementations—but it is worth a paragraph on the subject. These newer coders are focused on improving the error concealment, or having better delay or jitter tolerances, or having a richer sound. One such example is the Internet Low Bitrate Codec (iLBC), which is used in a number of consumer-based peer-to-peer voice applications such as Skype.
Because the overhead of packets on most voice mobility networks is rather high, finding the highest amount of compression should not be the aim when establishing the network. Instead, it is better to find the codec which are supported by the equipment and provide the highest quality of voice over the expected conditions of the network. For example, G.711 is fine in many conditions, and G.729 might not be necessary.

Codecs

When speech is carried on the Internet, it is of course carried in digital form. Speech is also carried digitally through most portions of modern telephone networks (although in the PSTN, it is still normally converted to analog form on the last mile of transmission over analog telephone lines). Having the speech signals available digitally provides the opportunity to use digital speech processing techniques, and particularly speech coders, which can compress the digital bit stream to low bit rates—with trade-offs against increased delay, implementation complexity/cost, and quality. In this section, we will discuss motivations for speech compression, review the basics of coding, discuss some specific coders, and look at the trade-offs that must be understood to decide which kind of coding is appropriate for specific applications (and, in particular, whether compression is desirable).

Motivations for Speech Coding in Internet-Telephony Integration

Classically, speech compression techniques have been used in situations where bandwidth is limited or very expensive. For example, prior to the development of high-bandwidth digital fiber-optic undersea cables, undersea telephone trunks were carried on very expensive analog coaxial cable systems, and the utilization of these expensive trunks was increased by the use of voice activity detection (VAD) techniques. The technique used by these early systems, known as time assignment speech interpolation (TASI), was to detect when speech was present on a given channel and then, in real time, switch it through to one of a set of undersea cable trunks. When the channel went silent due to a pause in the conversation (typically caused by Party A pausing to listen to Party B’s response—in normal, polite conversation, there are only short intervals when both parties talk simultaneously!), the channel would be disconnected and the trunk given over to another active speaker. Given the activity factor of normal conversation, these systems could achieve compression of about 2:1.

Another situation in which the need for speech compression seems obvious is that of wireless transmission—for example, in cellular telephony systems. Over-the-air bandwidth is limited both by government regulation and by the potential for multiple users of the free-space channel to interfere with each other. In addition, due to the tremendously high growth rate of cellular telephony, more and more callers seek to use the same limited bandwidth. As a result, modern cellular telephony standards do indeed provide speech coder options, which achieve various levels of compression and increased utilization of the scarce wireless bandwidth.

Turning to the subject—the integration of the Internet and telephony—the situation is somewhat less clear. On the telephony side, if we are talking about wireline telephony, the cost of bandwidth is a relatively small part of the total cost of providing service. On the Internet side, one of the defining characteristics of the explosive growth pattern that we have seen in recent years has been the utilization of higher- and higher-bandwidth channels to interconnect nodes in the Internet. To a great degree, this is caused by (and, in turn, enables) the running of multimedia applications over the Internet and especially the World Wide Web, involving the transfer of large image files, audio and video streams, and so on. With so much bandwidth, and with so many applications apparently using a larger average bandwidth than ordinary telephone speech, why would anyone want to compress speech on the Internet?

In fact, there are several possible reasons. One simple reason is that bandwidth for access to the Internet is often quite severely constrained, as is the case for consumers who dial up using modems with rates of 56 kbps or lower. In this case, voice and all other active applications have to share a bandwidth that is narrower than that normally provided in the telephone network for speech alone. Another motivation may be integration with wireless networks. As noted earlier, wireless voice typically uses coding for speech compression in order to conserve the scarce, expensive over-the-air bandwidth. So, wireless voice over IP may employ compression for exactly the same reason. Even in the case of an end-to-end path that is only partly wireless, keeping the voice encoded in a compressed format would avoid the degradation of voice quality that can come with repeated encoding and decoding (so-called tandem encoding).

Another whole class of applications for voice coding on the Internet may be revealed if we allow ourselves to step outside the traditional telephony definitions of 4 kHz, point-to-point, real-time voice. For example, integrated messaging applications require the storage of voice signals, and compression can then be important in reducing storage requirements.

Similarly, the flexibility afforded by an all-digital medium may be used to encode voice (or audio) for higher quality, including preserving more of the bandwidth of the input signal and employing multichannel communication techniques (for example, stereo). Applications include the creation of highly realistic teleconferences and the transmission of music. For these applications, efficient coding techniques may be used to keep the consumption of bandwidth by the high-quality service to reasonable levels.

Broadcast applications constitute another class whose needs differ from those of traditional point-to-point telephony. Broadcasting to n locations can consume n times the bandwidth of a point-to-point transmission, so there is a high motivation to code for compression. At the same time, broadcast applications typically are more tolerant of delay, which means that more sophisticated and complex coding algorithms may be used without introducing noticeable impairments.

Voice Coding Basics

Some specific references are given on the subject of voice coding in our bibliography. In this section, our goal is to provide some intuitive understanding of how coders work so that the material that follows makes sense to the general reader.

Everyone probably understands that voices and other sounds in general may differ in frequency (or, in musical terms, pitch). A woman’s voice is usually higher in frequency (or pitch) than a man’s, and a child’s may be higher still. The music of a bassoon is lower in frequency than that of a piccolo. Also, complex sounds actually consist of more than one frequency. In music, the overtones of the saxophone differ from those of the oboe, so that these two instruments sound different even if both are playing the same note. Similarly, we can distinguish between the voices of two people whom we know, even if their voices are about equal in overall pitch, due to the complex structure of frequencies produced by the human voice, which differs in detail from individual to individual.

An idea that is quite fundamental to voice coding is that the range of important frequencies in a given sound (and, in particular, the sound of the human voice) is limited. We need only reproduce this range of frequencies for the transmitted sound to be recognizable and useful. It is true that the quality and naturalness of the sound will be better if more frequencies are transmitted. However, we are all familiar with an example in which very useful voice transmission is accomplished by using a quite limited range of frequencies: In telephone networks, frequencies higher than 4 kHz (4000 vibrations per second) are not transmitted, and, in fact, the actual range is probably closer to between 200 and 3400 Hz. Nonetheless, we can not only understand what is said, but usually we can recognize and distinguish familiar voices as well.

Digital coding starts with sampling the continuous-sound waveform at discrete intervals (see Figure 1). An important theorem states that the waveform may be completely reproduced from samples if the sampling rate is at least twice as great as the highest frequency that is contained in the sound. Now you can see why we were so concerned with the range of frequencies to be transmitted—this tells us what sampling rate is needed. For telephone speech that is limited to 4 kHz, the sampling rate is 2 x 4000 = 8000 samples per second.

Image from book
Figure 1: Sound sampling.

Each of these samples is a measurement, typically of an electrical voltage somewhere inside the coding system. However, in order to be carried in packets on the Internet (or, indeed, on a digital telephone network), the output of the coder needs to be a string of bits. This is accomplished by encoding each voltage measurement sample as a binary number. Another critical parameter is how many bits will be used to encode each sample. Research done over 40 years ago showed that excellent speech reproduction could be achieved with 8 bits per sample. At 8 bits per sample and 8000 samples per second, this implies a bit rate out of the coder of 64,000 bits per second (64 kbps).

The system that we have just described is the most basic form of digital coder, called a pulse-code modulation (PCM) system. In practice, one additional step is taken to improve the performance of such coders and, in fact, to ensure that 8 bits per sample is sufficient to encode the samples. This is companding, in which the dynamic range (range between maximum and minimum values) is compressed at the coder and then decompressed (expanded) at the decoder. Telephone systems in the United States and some other places use a companding formula called m-law, and those in Europe and some other places use a companding formula called A-law, leading to one of those troublesome international standards differences! Both the A- and m-law systems have been standardized by ITU-T as G.711.

By the way, how does the decoder work? For PCM, it is relatively simple to describe. The sequence of 8-bit binary numbers is turned back into a string of voltage pulses, the magnitude of each pulse corresponding to the encoded voltage measurement. This string of pulses is then passed through a circuit called a low-pass filter, which interpolates between the pulses, producing a smooth signal. (In terms of frequency, the action of the low-pass filter is to remove irrelevant high-frequency components; in the case of standard telephony, everything above 4 kHz is removed.) What remains is a close reproduction of the original voice waveform with quantization noise, which represents the slightly inaccurate encoding of the voltage measurements as finite-length binary numbers.

Achieving a Lower Bit Rate

Very simply, the goal of compressed speech coding for telephony is to use less than 64 kbps of bandwidth while preserving desirable characteristics of the speech. Here we will briefly discuss some basic approaches to reducing the bit rate that is required for carrying digital speech below the nominal 64 kbps.

A simple variation of PCM that achieves a lower bit rate with still quite good quality is differential PCM (DPCM). In DPCM, information about the difference between succeeding samples is transmitted instead of their absolute value. This takes advantage of the fact that two succeeding samples will often be quite close in value. With a slightly more sophisticated variant, called adaptive differential PCM (ADPCM), it is easy to get quite excellent speech reproduction while using only half the bit rate of straight PCM (that is, using 32 kbps). ADPCM has been standardized by ITU-T as G.726.

To get lower bit rates, it is necessary to adopt much more sophisticated approaches. Some low-bit-rate coders attempt to take advantage of the fact that the input signal is known to be human speech. Vocoding, in which an electrical model of the human vocal tract is constructed and used as the basis of a low-bit-rate coder/decoder system, is a decades-old idea from speech research that has been made practical by advances in high-speed electronics. Besides vocoders, other types of very-low-bit-rate coders include parametric coders and waveform interpolation coders. The ITU-T has standardized a number of low-bit-rate coders, including the following:

  • G.723.1, low-bit-rate coder for multimedia applications, 6.3 and 5.3 kbps

  • G.728, 16-kbps low-delay code-excited linear prediction (LD-CELP) coder

  • G.729, 8-kbps conjugate-structure algebraic-code-excited linear prediction (CS-ACELP)
For an excellent discussion of these low-bit-rate ITU coders, see Cox and Kroon (1996).
Since we are interested in integrating Internet and telephony, it is important to note that the coder bit rates we have quoted do not, of course, take into account various overheads that are introduced when voice is packetized, compared with circuit-switched voice. Packetization is quite a complex subject in its own right, and outside the scope of our present subject—speech coding. Suffice it to say that, depending on the specific choice of coder, packetization technique, and protocol stack, it is quite possible to use up most (or even all!) of the bandwidth gained in compression through packetization overhead. Other choices can result in a net bandwidth gain compared with uncompressed circuit-switched voice. Obviously, packetization is an area that requires careful attention if achieving actual bandwidth savings is important to the application.

Trade-offs

In spite of the truly impressive advances that have been made in the past few years both in developing more sophisticated algorithms for compression and in high-speed electronics to run them, the world of speech coding still provides many illustrations of the earthy adage: There’s no such thing as a free lunch. In general, lower-bit-rate coders introduce more delay in the signal path, are more complex-expensive to implement, and involve more compromises to voice quality. This section discusses these trade-offs and is intended to help you decide whether you want to use voice compression for your application and, if so, how aggressive you can afford to be.

Delay

Voice communication can be highly sensitive to total end-to-end delay. Excessive delay interrupts the normal conversational pattern in which speakers reply to each other’s utterances and also exacerbates the problem of echoes in communication circuits. Delay is the reason why links via geostationary satellites are, at present, only used on very thin traffic routes in the modern international telephone network. Even with echo cancellation systems in place, the hundreds of milliseconds of delay introduced by the trip up to the satellite and back is very disruptive to conversation, which you will notice immediately if you ever make a call over such a circuit. The strong preference is to use optical fiber routed over the earth’s surface (or under the ocean) wherever it is feasible.

The most fundamental component of delay introduced by speech processing is called algorithmic delay. Algorithmic delay comes about because most speech coders work by doing an analysis on a batch of speech samples. Some minimum amount of speech is needed to do this analysis, and the time to accumulate this number of samples is an irreducible delay component—the algorithmic delay. Another component added by the coder is processing delay, the time for the coder hardware to analyze the speech and the decoder hardware to reproduce it. This component can be reduced by using faster hardware. Cox and Kroon (1996) state that for ease of communication, the total system delay, which includes these coder components plus the one-way communication delay, should be less than 200 ms. The algorithmic delay for G.729 and G.723.1 coders is 15 and 37.5 ms, respectively. Assuming typical processing delays and communication over a serial connection (such as a circuit-switched transport), operating at the bit rate of the coder, the total system delays will be 35 and 97.5 ms, respectively. If a packet network such as the Internet is involved, there may be an additional packet filling delay. For example, for G.729, the coder outputs 80 bits of compressed speech every 10 ms. If the packet size is 160 bits, this means we have to wait an additional 10 ms before we can transmit the packet, thereby increasing the overall system delay.

From an application point of view, you may want to avoid the use of aggressive low-bit-rate coding in situations where the quality of interaction counts for a lot—teleconferencing, for example, or calls that your salesforce makes to customers. By contrast, a one-way voice broadcast would not be much impaired by some extra delay. Another issue to look out for is added delay from other active electronics in the path.

Complexity

The issue of complexity is of direct concern to designers of equipment. The more demanding a speech processing algorithm is of processing power and memory, the bigger and more expensive the digital signal processor (DSP) or other specialized chip needs to be. For the purchaser of equipment, this primarily translates into an impact on price, but possibly to some other parameters of interest, such as power consumption in a wireless handset, which will determine how long you can talk before the battery runs out.

Quality

The tried-and-true method of measuring quality in voice communications, and the one that is still used to evaluate speech coders, is the subjective test of mean opinion score (MOS). This is a test in which people are asked to listen to the speech and rate its quality as bad, poor, fair, good, or excellent. Cox and Kroon (1996) have compiled the results of many MOS tests of ITU-T and other standardized coders.

Behind the seeming scientific nature of mean opinion score testing are many issues that are difficult to quantify. How do the coders perform in the presence of a variety of types of background noise? Can individual speakers be recognized by the sound of their voices? What if the sound is something other than voice (music, for example)? The best thing for a prospective system purchaser to do is listen, of course, and test the system in as close an approximation of the intended environment as possible.

The bottom line is that integration of the PSTN and the Internet presents opportunities to use very sophisticated, modern voice coding techniques, but it is up to you as the system developer or purchaser to decide whether the advantages are worth the cost and potential trade-offs in quality.

Telecom Made Simple

Related Posts with Thumbnails