When speech is carried on the  Internet, it is of course carried in digital form. Speech is also carried  digitally through most portions of modern telephone networks (although in the  PSTN, it is still normally converted to analog form on the last mile of  transmission over analog telephone lines). Having the speech signals available  digitally provides the opportunity to use digital speech processing techniques,  and particularly speech coders, which can compress the digital bit stream to low  bit rates—with trade-offs against increased delay, implementation  complexity/cost, and quality. In this section, we will discuss motivations for  speech compression, review the basics of coding, discuss some specific coders,  and look at the trade-offs that must be understood to decide which kind of  coding is appropriate for specific applications (and, in particular, whether  compression is desirable).  
Motivations for Speech Coding in Internet-Telephony Integration
Classically, speech compression  techniques have been used in situations where bandwidth is limited or very  expensive. For example, prior to the development of high-bandwidth digital  fiber-optic undersea cables, undersea telephone trunks were carried on very  expensive analog coaxial cable systems, and the utilization of these expensive  trunks was increased by the use of voice activity  detection (VAD) techniques. The technique used by these early  systems, known as time assignment speech  interpolation (TASI), was to detect when speech was present on a  given channel and then, in real time, switch it through to one of a set of  undersea cable trunks. When the channel went silent due to a pause in the  conversation (typically caused by Party A pausing to listen to Party B’s  response—in normal, polite conversation, there are only short intervals when  both parties talk simultaneously!), the channel would be disconnected and the  trunk given over to another active speaker. Given the activity factor of normal  conversation, these systems could achieve compression of about 2:1.
Another situation in which the need for  speech compression seems obvious is that of wireless transmission—for example,  in cellular telephony systems. Over-the-air bandwidth is limited both by  government regulation and by the potential for multiple users of the free-space  channel to interfere with each other. In addition, due to the tremendously high  growth rate of cellular telephony, more and more callers seek to use the same  limited bandwidth. As a result, modern cellular telephony standards do indeed  provide speech coder options, which achieve various levels of compression and  increased utilization of the scarce wireless bandwidth.
Turning to the subject—the  integration of the Internet and telephony—the situation is somewhat less clear.  On the telephony side, if we are talking about wireline telephony, the cost of  bandwidth is a relatively small part of the total cost of providing service. On  the Internet side, one of the defining characteristics of the explosive growth  pattern that we have seen in recent years has been the utilization of higher-  and higher-bandwidth channels to interconnect nodes in the Internet. To a great  degree, this is caused by (and, in turn, enables) the running of multimedia  applications over the Internet and especially the World Wide Web, involving the  transfer of large image files, audio and video streams, and so on. With so much  bandwidth, and with so many applications apparently using a larger average  bandwidth than ordinary telephone speech, why would anyone want to compress  speech on the Internet?
In fact, there are several possible  reasons. One simple reason is that bandwidth for access to the Internet is often  quite severely constrained, as is the case for consumers who dial up using  modems with rates of 56 kbps or lower. In this case, voice and all other active  applications have to share a bandwidth that is narrower than that normally  provided in the telephone network for speech alone. Another motivation may be  integration with wireless networks. As noted earlier, wireless voice typically  uses coding for speech compression in order to conserve the scarce, expensive  over-the-air bandwidth. So, wireless voice over IP may employ compression for  exactly the same reason. Even in the case of an end-to-end path that is only  partly wireless, keeping the voice encoded in a compressed format would avoid  the degradation of voice quality that can come with repeated encoding and  decoding (so-called tandem encoding).  
Another whole class of applications for  voice coding on the Internet may be revealed if we allow ourselves to step  outside the traditional telephony definitions of 4 kHz, point-to-point,  real-time voice. For example, integrated messaging applications require the  storage of voice signals, and compression can then be important in reducing  storage requirements.
Similarly, the flexibility afforded by an  all-digital medium may be used to encode voice (or audio) for higher quality,  including preserving more of the bandwidth of the input signal and employing  multichannel communication techniques (for example, stereo). Applications  include the creation of highly realistic teleconferences and the transmission of  music. For these applications, efficient coding techniques may be used to keep  the consumption of bandwidth by the high-quality service to reasonable  levels.
Broadcast applications constitute  another class whose needs differ from those of traditional point-to-point  telephony. Broadcasting to n locations can  consume n times the bandwidth of a  point-to-point transmission, so there is a high motivation to code for  compression. At the same time, broadcast applications typically are more  tolerant of delay, which means that more sophisticated and complex coding  algorithms may be used without introducing noticeable impairments.
Voice Coding Basics
Some specific references are given  on the subject of voice coding in our bibliography. In this section, our goal is  to provide some intuitive understanding of how coders work so that the material  that follows makes sense to the general reader.
Everyone probably understands that voices  and other sounds in general may differ in frequency (or, in musical terms, pitch). A woman’s voice is usually higher in  frequency (or pitch) than a man’s, and a child’s may be higher still. The music  of a bassoon is lower in frequency than that of a piccolo. Also, complex sounds  actually consist of more than one frequency. In music, the overtones of the  saxophone differ from those of the oboe, so that these two instruments sound  different even if both are playing the same note. Similarly, we can distinguish  between the voices of two people whom we know, even if their voices are about  equal in overall pitch, due to the complex structure of frequencies produced by  the human voice, which differs in detail from individual to individual.  
An idea that is quite fundamental to  voice coding is that the range of important frequencies in a given sound (and,  in particular, the sound of the human voice) is limited. We need only reproduce  this range of frequencies for the transmitted sound to be recognizable and  useful. It is true that the quality and naturalness of the sound will be better  if more frequencies are transmitted. However, we are all familiar with an  example in which very useful voice transmission is accomplished by using a quite  limited range of frequencies: In telephone networks, frequencies higher than 4  kHz (4000 vibrations per second) are not transmitted, and, in fact, the actual  range is probably closer to between 200 and 3400 Hz. Nonetheless, we can not  only understand what is said, but usually we can recognize and distinguish  familiar voices as well.
Digital coding starts with sampling the continuous-sound waveform at discrete  intervals (see Figure 1).  An important theorem states that the waveform may be completely reproduced from  samples if the sampling rate is at least twice as great as the highest frequency  that is contained in the sound. Now you can see why we were so concerned with  the range of frequencies to be transmitted—this tells us what sampling rate is  needed. For telephone speech that is limited to 4 kHz, the sampling rate is 2 x  4000 = 8000 samples per second.
Each of these samples is a measurement,  typically of an electrical voltage somewhere inside the coding system. However,  in order to be carried in packets on the Internet (or, indeed, on a digital  telephone network), the output of the coder needs to be a string of bits. This  is accomplished by encoding each voltage  measurement sample as a binary number. Another critical parameter is how many  bits will be used to encode each sample. Research done over 40 years ago showed  that excellent speech reproduction could be achieved with 8 bits per sample. At  8 bits per sample and 8000 samples per second, this implies a bit rate out of  the coder of 64,000 bits per second (64 kbps).
The system that we have just described is  the most basic form of digital coder, called a pulse-code modulation (PCM) system. In practice,  one additional step is taken to improve the performance of such coders and, in  fact, to ensure that 8 bits per sample is sufficient to encode the samples. This  is companding, in which the dynamic range  (range between maximum and minimum values) is compressed at the coder and then  decompressed (expanded) at the decoder. Telephone systems in the United States  and some other places use a companding formula called m-law, and those in Europe  and some other places use a companding formula called A-law, leading to one of those troublesome  international standards differences! Both the A- and m-law systems have been standardized by ITU-T as G.711.
By the way, how does the decoder work?  For PCM, it is relatively simple to describe. The sequence of 8-bit binary  numbers is turned back into a string of voltage pulses, the magnitude of each  pulse corresponding to the encoded voltage measurement. This string of pulses is  then passed through a circuit called a low-pass  filter, which interpolates between the pulses, producing a smooth  signal. (In terms of frequency, the action of the low-pass filter is to remove  irrelevant high-frequency components; in the case of standard telephony,  everything above 4 kHz is removed.) What remains is a close reproduction of the  original voice waveform with quantization  noise, which represents the slightly inaccurate encoding of the  voltage measurements as finite-length binary numbers.
Achieving a Lower Bit Rate
Very simply, the goal of compressed  speech coding for telephony is to use less than 64 kbps of bandwidth while  preserving desirable characteristics of the speech. Here we will briefly discuss  some basic approaches to reducing the bit rate that is required for carrying  digital speech below the nominal 64 kbps.
A simple variation of PCM that achieves a  lower bit rate with still quite good quality is differential PCM (DPCM). In DPCM, information about  the difference between succeeding samples is transmitted instead of their  absolute value. This takes advantage of the fact that two succeeding samples  will often be quite close in value. With a slightly more sophisticated variant,  called adaptive differential PCM (ADPCM), it  is easy to get quite excellent speech reproduction while using only half the bit  rate of straight PCM (that is, using 32 kbps). ADPCM has been standardized by  ITU-T as G.726.
To get lower bit rates, it is necessary  to adopt much more sophisticated approaches. Some low-bit-rate coders attempt to  take advantage of the fact that the input signal is known to be human speech.  Vocoding, in which an electrical model of the  human vocal tract is constructed and used as the basis of a low-bit-rate  coder/decoder system, is a decades-old idea from speech research that has been  made practical by advances in high-speed electronics. Besides vocoders, other  types of very-low-bit-rate coders include parametric  coders and waveform interpolation  coders. The ITU-T has standardized a number of low-bit-rate coders,  including the following:
-  
G.723.1, low-bit-rate coder for multimedia applications, 6.3 and 5.3 kbps -  
G.728, 16-kbps low-delay code-excited linear prediction (LD-CELP) coder -  
 
For an excellent discussion of these  low-bit-rate ITU coders, see Cox and Kroon (1996).
Since we are interested in  integrating Internet and telephony, it is important to note that the coder bit  rates we have quoted do not, of course, take into account various overheads that  are introduced when voice is packetized, compared with circuit-switched voice.  Packetization is quite a complex subject in its own right, and outside the scope  of our present subject—speech coding. Suffice it to say that, depending on the  specific choice of coder, packetization technique, and protocol stack, it is  quite possible to use up most (or even all!) of the bandwidth gained in  compression through packetization overhead. Other choices can result in a net  bandwidth gain compared with uncompressed circuit-switched voice. Obviously,  packetization is an area that requires careful attention if achieving actual  bandwidth savings is important to the application.
Trade-offs
In spite of the truly impressive  advances that have been made in the past few years both in developing more  sophisticated algorithms for compression and in high-speed electronics to run  them, the world of speech coding still provides many illustrations of the earthy  adage: There’s no such thing as a free lunch. In general, lower-bit-rate coders  introduce more delay in the signal path, are more complex-expensive to  implement, and involve more compromises to voice quality. This section discusses  these trade-offs and is intended to help you decide whether you want to use  voice compression for your application and, if so, how aggressive you can afford  to be.
Delay
Voice communication can be highly  sensitive to total end-to-end delay. Excessive delay interrupts the normal  conversational pattern in which speakers reply to each other’s utterances and  also exacerbates the problem of echoes in communication circuits. Delay is the  reason why links via geostationary satellites are, at present, only used on very  thin traffic routes in the modern international telephone network. Even with  echo cancellation systems in place, the hundreds of milliseconds of delay  introduced by the trip up to the satellite and back is very disruptive to  conversation, which you will notice immediately if you ever make a call over  such a circuit. The strong preference is to use optical fiber routed over the  earth’s surface (or under the ocean) wherever it is feasible.  
The most fundamental component of delay  introduced by speech processing is called algorithmic  delay. Algorithmic delay comes about because most speech coders work  by doing an analysis on a batch of speech samples. Some minimum amount of speech  is needed to do this analysis, and the time to accumulate this number of samples  is an irreducible delay component—the algorithmic delay. Another component added  by the coder is processing delay, the time  for the coder hardware to analyze the speech and the decoder hardware to  reproduce it. This component can be reduced by using faster hardware. Cox and  Kroon (1996) state that for ease of communication, the total system delay, which  includes these coder components plus the one-way communication delay, should be  less than 200 ms. The algorithmic delay for G.729 and G.723.1 coders is 15 and  37.5 ms, respectively. Assuming typical processing delays and communication over  a serial connection (such as a circuit-switched transport), operating at the bit  rate of the coder, the total system delays will be 35 and 97.5 ms, respectively.  If a packet network such as the Internet is involved, there may be an additional  packet filling delay. For example, for G.729,  the coder outputs 80 bits of compressed speech every 10 ms. If the packet size  is 160 bits, this means we have to wait an additional 10 ms before we can  transmit the packet, thereby increasing the overall system delay.
From an application point of view,  you may want to avoid the use of aggressive low-bit-rate coding in situations  where the quality of interaction counts for a lot—teleconferencing, for example,  or calls that your salesforce makes to customers. By contrast, a one-way voice  broadcast would not be much impaired by some extra delay. Another issue to look  out for is added delay from other active electronics in the path.
Complexity
The issue of complexity is of  direct concern to designers of equipment. The more demanding a speech processing  algorithm is of processing power and memory, the bigger and more expensive the  digital signal processor (DSP) or other  specialized chip needs to be. For the purchaser of equipment, this primarily  translates into an impact on price, but possibly to some other parameters of  interest, such as power consumption in a wireless handset, which will determine  how long you can talk before the battery runs out.
Quality
The tried-and-true method of  measuring quality in voice communications, and the one that is still used to  evaluate speech coders, is the subjective test of mean  opinion score (MOS). This is a test in which people are asked to  listen to the speech and rate its quality as bad, poor, fair, good, or  excellent. Cox and Kroon (1996) have compiled the results of many MOS tests of  ITU-T and other standardized coders.  
Behind the seeming scientific nature of  mean opinion score testing are many issues that are difficult to quantify. How  do the coders perform in the presence of a variety of types of background noise?  Can individual speakers be recognized by the sound of their voices? What if the  sound is something other than voice (music, for example)? The best thing for a  prospective system purchaser to do is listen, of course, and test the system in  as close an approximation of the intended environment as possible.
The bottom line is that integration  of the PSTN and the Internet presents opportunities to use very sophisticated,  modern voice coding techniques, but it is up to you as the system developer or  purchaser to decide whether the advantages are worth the cost and potential  trade-offs in quality.