Codecs


The 8000 samples-per-second PCM signal, at 16 bits per sample, results in 128,000 bits per second of information. That's fairly high, especially in the world of wireline telephone networks, in which every bit represented some collection of additional copper lines that needed to have been laid in the ground. Therefore, the concept of audio compression was brought to bear on the subject.
An audio or video compression mechanism is often referred to as a codec, short for coderdecoder. The reason is that the compressed signal is often thought of as being in a code, some sequence of bits that is meaningful to the decoder but not much else. (Unfortunately, in anything digital, the term code is used far too often.)
The simplest coder that can be thought of is a null codec. A null codec doesn't touch the audio: you get out what you put in. More meaningful codecs reduce the amount of information in the signal. All lossy compression algorithms, as most of the audio and video codecs are, stem from the realization that the human mind and senses cannot detect every slight variation in the media being presented. There is a lot of noise that can be added, in just the right ways, and no one will notice. The reason is that we are more sensitive to certain types of variations than others. For audio, we can think of it this way. As you drive along the highway, listening to AM radio, there is always some amount of noise creeping in, whether it be from your car passing behind a concrete building, or under power lines, or behind hills. This noise is always there, but you don't always hear it. Sometimes, the noise is excessive, and the station becomes annoying to listen to or incomprehensible, drowned out by static. Other times, however, the noise is there but does not interfere with your ability to hear what is being said. The human mind is able to compensate for quite a lot of background noise, silently deleting it from perception, as anyone who has noticed the refrigerator's compressor stop or realized that a crowded, noisy room has just gone quiet can attest to. Lossy compression, then, is the art of knowing which types of noise the listener can tolerate, which they cannot stand, and which they might not even be able to hear.
(Why noise? Lossy compression is a method of deleting information, which may or may not be needed. Clearly, every bit is needed to restore the signal to its original sampled state. Deleting a few bits requires that the decompressor or the decoder restore those deleted bits' worth of information on the other end, filling them in with whatever the algorithm states is appropriate. That results in a difference of the signal, compared to the original, and that difference is distortion. Subtract the two signals, and the resulting difference signal is the noise that was added to the original signal by the compression algorithm. One only need amplify this noise signal to appreciate how it sounds.)


G.711 and Logarithmic Compression

The first, and simplest, lossy compression codec for audio that we need to look at is called logarithmic compression. Sixteen bits is a lot to encode the intensity of an audio sample. The reason why 16 bits was chosen was that it has fine enough detail to adequately represent the variations of the softer sounds that might be recorded. But louder sounds do not need such fine detail while they are loud. The higher the intensity of the sample, the more detailed the 16-bit sampling is relative to the intensity. In other words, the 16-bit resolution was chosen conservatively, and is excessively precise for higher intensities. As it turns out, higher intensities can tolerate even more error than lower ones—in a relative sense, as well. A higher-intensity sample may tolerate four times as much error as a signal half as intense, rather than the two times you would expect for a linear process. The reason for this has to do with how the ear perceives sound, and is why sound levels are measured in decibels. This is precisely what logarithmic compression does. Convert the intensities to decibels, where a 1 dB change sounds roughly the same at all intensities, and a good half of the 16 bits can be thrown away. Thus, we get a 2:1 compression ratio.
The ITU G.711 standard is the first common codec we will see, and uses this logarithmic compression. There are two flavors of G.711: μ-law and A-law. μ-law is used in the United States, and bases its compression on a discrete form of taking the logarithm of the incoming signal. First, the signal is reduced to a 14-bit signal, discarding the two least-significant bits. Then, the signal is divided up into ranges, each range having 16 intervals, for four bits, with twice the spacing as that of the next smaller range. Table 1 shows the conversion table.
Table1: μ-Law Encoding Table 
Input Range
Number of Intervals in Range
Spacing of Intervals
Left Four Bits of Compressed Code
Right Four Bits of Compressed Code
8158 to 4063
16
256
0×8
number of interval
4062 to 2015
16
128
0×9
number of interval
2014 to 991
16
64
0×a
number of interval
990 to 479
16
32
0×b
number of interval
478 to 223
16
16
0×c
number of interval
222 to 95
16
8
0×d
number of interval
94 to 31
16
4
0×e
number of interval
30 to 1
15
2
0×f
number of interval
0
1
1
0×f
0×f
1
1
1
0×7
0×f
31 to 2
2
15
0×7
number of interval
32 to 95
4
16
0×6
number of interval
223 to 96
8
16
0×5
number of interval
479 to 224
16
16
0×4
number of interval
991 to 480
32
16
0×3
number of interval
2015 to 992
64
16
0×2
number of interval
4063 to 2016
128
16
0×1
number of interval
8159 to 4064
256
16
0×0
number of interval
The number of the interval is where the input falls within the range. 90, for example, would map to 0xee, as 90-31 = 59, which is 14.75, or 0xe (rounded down) away from zero, in steps of four. (Of course, the original 16-bit signal was four times, or two bits, larger, so 360 would have been one such 16-bit input, as would have any number between 348 and 363. This range represents the loss of information, as 363 and 348 come out the same.)
A-law is similar, but uses a slightly different set of spacings, based on an algorithm that is easier to see when the numbers are written out in binary form. The process is simply to take the binary number and encode it by saving only four bits of significant digits (except the leading one), and to record the base-2 (binary) exponent. This is how floating-point numbers are encoded. Let's look at the previous example. The number 360 is encoded in 16-bit binary as
0000 0001 0110 1000
with spaces placed every four digits for readability. A-law only uses the top 13 bits. Thus, as this number is unsigned, it can be represented in floating point as
1.01101 (binary)x25.
The first four significant digits (ignoring the first 1, which must be there for us to write the number in binary scientific notation, or floating point), are "0110", and the exponent is 5. A-law then records the number as
0001 0110
where the first bit is the sign (0), the next three are the exponent, minus four, and the last four are the significant digits.
A-law is used in Europe, on their telephone systems. For voice over IP, either will usually work, and most devices speak in both, no matter where they are sold. The distinctions are now mostly historical.
G.711 compression preserves the number of samples, and keeps each sample independently of the others. Therefore, it is easy to figure out how the samples can be packaged into packets or blocks. They can be cut arbitrarily, and a byte is a sample. This allows the codec to be quite flexible for voice mobility, and should be a preferred option.
Error concealment, or packet loss concealment (PLC), is the means by which a codec can recover from packet loss, by faking the sound at the receiver until the stream catches up. G.711 has an extension, known as G.711I, or G.711, Appendix I. The most trivial error concealment technique is to just play silence. This does not really conceal the error. An additional technique is to repeat the last valid sample set—usually, a 10ms or 20ms packet's worth—until the stream catches up. The problem is that, should the last sample have had a plosive—any of the consonants that have a stop to them, like a p, d, t, k, and so on-the plosive will be repeated, providing an effect reminiscent of a quickly skipping record player or a 1980s science-fiction television character.[*] Appendix I states that, to avoid this effect, the previous samples should be tested for the fundamental wavelength, and then blocks of those wavelengths should be cross-faded together to produce a more seamless recovery. This is a purely heuristic scheme for error recovery, and competes, to some extent, with just repeating the last segment then going silent.
In many cases, G.711 is not even mentioned when it is being used. Instead, the codec may be referred to as PCM with μ-law or A-law encoding.


G.729 and Perceptual Compression

ITU G.729 and the related G.729a specify using a more advanced encoding scheme, which does not work sample by sample. Rather, it uses mathematical rules to try to relate neighboring samples together. The incoming sample stream is divided into 10ms blocks (with 5ms from the next block also required), and each block is then analyzed as a unit. G.729 provides a 16:1 compression ratio, as the incoming 80 samples of 16 bits each are brought down to a ten-byte encoded block.
The concept around G.729 compression is to use perceptual compression to classify the type of signal within the 10ms block. The concept here is to try to figure out how neighboring samples relate. Surely, they do relate, because they come from the same voice and the same pitch, and pitch is a concept that requires time (thus, more than one sample). G.729 uses a couple of techniques to try to figure out what the sample must "sound like," so it can then throw away much of the sample and transmit only the description of the sound.
To figure out what the sample block sounds like, G.729 uses Code-Excited Linear Prediction (CELP). The idea is that the encoder and decoder have a codebook of the basics of sounds. Each entry in the codebook can be used to generate some type of sound. G.729 maintains two codebooks: one fixed, and one that adapts with the signal. The model behind CELP is that the human voice is basically created by a simple set of flat vocal chords, which excite the airways. The airways—the mouth, tongue, and so on—are then thought of as signal filters, which have a rather specific, predictable effect on the sound coming up from the throat.
The signal is first brought in and linear prediction is used. Linear prediction tries to relate the samples into the block to the previous samples, and finds the optimal mapping. ("Optimal" does not always mean "good," as there is almost always an optimal way to approximate a function using a fixed number of parameters, even if the approximation is dead wrong.) The excitation provides a representation that represents the overall type of sound, a hum or a hiss, depending on the word being said. This is usually a simple sound, an "uhhh" or "ahhh." The linear predictor figures out how the humming gets shaped, as a simple filter. What's left over, then, is how the sound started in the first place, the excitation that makes up the more complicated, nuanced part of speech. The linear prediction's effects are removed, and the remaining signal is the residue, which must relate to the excitations. The nuances are looked up in the codebook, which contains some common residues and some others that are adaptive. Together, the information needed for the linear prediction and the codebook matches are packaged into the ten-byte output block, and the encoding is complete. The encoded block contains information on the pitch of the sound, the adaptive and fixed codebook entries that best match the excitation for the block, and the linear prediction match.
On the other side, the decoding process looks up the codebooks for the excitations. These excitations get filtered through the linear predictor. The hope is that the results sound like human speech. And, of then, it does. However, anyone who has used cellphones before is aware that, at times, they can render human speech into a facsimile that sounds quite like the person talking, but is made up of no recognizable syllables. That results from a CELP decoder struggling with a lossy channel, where some of the information is missing, and it is forced to fill in the blanks.
G.729a is an annex to G.791, or a modification, that uses a simpler structure to encode the signal. It is compatible with G.729, and so can be thought of as interchangeable for the purposes of this discussion.

Other Codecs

There are other voice codecs that are beginning to appear in the context of voice mobility. These codecs are not as prevalent as G.711 and G.729—some are not available in more than softphones and open-source implementations—but it is worth a paragraph on the subject. These newer coders are focused on improving the error concealment, or having better delay or jitter tolerances, or having a richer sound. One such example is the Internet Low Bitrate Codec (iLBC), which is used in a number of consumer-based peer-to-peer voice applications such as Skype.
Because the overhead of packets on most voice mobility networks is rather high, finding the highest amount of compression should not be the aim when establishing the network. Instead, it is better to find the codec which are supported by the equipment and provide the highest quality of voice over the expected conditions of the network. For example, G.711 is fine in many conditions, and G.729 might not be necessary.

Bearer Protocols in Detail


The bearer protocols are where the real work in voice gets done. The bearer channel carries the voice, sampled by microphones as digital data, compressed in some manner, and then placed into packets which need to be coordinated as they fly over the networks.
Voice, as you know, starts off as sound waves (Figure 1). These sound waves are picked up by the microphone in the handset, and are then converted into electrical signals, with the voltage of the signal varying with the pressure the sound waves apply to the microphone.
The signal (see Figure 2) is then sampled down into digital, using an analog-to-digital converter. Voice tends to have a frequency around 3000 Hz. Some sounds are higher—music especially needs the higher frequencies—but voice can be represented without significant distortion at the 3000Hz range. Digital sampling works by measuring the voltage of the signal at precise, instantaneous time intervals. Because sound waves are, well, wavy, as are the electrical signals produced by them, the digital sampling must occur at a high enough rate to capture the highest frequency of the voice. As you can see in the figure, the signal has a major oscillation, at what would roughly be said is the pitch of the voice. Finer variations, however, exist, as can be seen on closer inspection, and these variations make up the depth or richness of the voice. Voice for telephone communications is usually limited to 4000 Hz, which is high enough to capture the major pitch and enough of the texture to make the voice sound human, if a bit tinny. Capturing at even higher rates, as is done on compact discs and music recordings, provides an even stronger sense of the original voice.

 
Figure 2: Example Voice Signal, Zoomed in Three Times
Sampling audio so that frequencies up to 4000 Hz can be preserved requires sampling the signal at twice that speed, or 8000 times a second. This is according to the Nyquist Sampling Theorem. The intuition behind this is fairly obvious. Sampling at regular intervals is choosing which value at those given instants. The worst case for sampling would be ifone sampled a 4000 Hz, say, sine wave at 4000 times a second. That would guarantee to provide a flat sample, as the top pair of graphs in Figure 3 shows. This is a severe case of undersampling, leading to aliasing effects. On the other hand, a more likely signal, with a more likely sampling rate, is shown in the bottom pair of graphs in the same figure. Here, the overall form of the signal, including its fundamental frequency, is preserved, but most of the higher-frequency texture is lost. The sampled signal would have the right pitch, but would sound off.

 
Figure 3: Sampling and Aliasing
The other aspect to the digital sampling, besides the 8000 samples-per-second rate, is the amount of detail captured vertically, into the intensity. The question becomes how many bits of information should be used to represent the intensity of each sample. In the quantization process, the infinitely variable, continuous scale of intensities is reduced to a discrete, quantized scale of digital values. Up to a constant factor, corresponding to the maximum intensity that can be represented, the common value for quantization for voice is to 16 bits, for a number between 215 = 32,768 to 215 1 = 32,767.

The overall result is a digital stream of 16-bit values, and the process is called pulse code modulation (PCM), a term originating in other methods of encoding audio that are no longer used.

SS7


Signaling System #7 (SS7) is the protocol that makes the public telephone networks operate, within themselves and across boundaries. Unlike Q.931, which is designed for simplicity, SS7 is a complete, Internet-like architecture and set of protocols, designed to allow call signaling and control to flow across a small, shared set of circuits dedicated for signaling, freeing up the rest of the circuits for real phone calls.
SS7 is an old protocol, from around 1980, and is, in fact, the seventh version of the protocol. The entire goal of the architecture was to free up lines for phone calls by removing the signaling from the bearer channel. This is the origin of the split signaling and bearer distinction. Before digital signaling, phone lines between networks were similar to phone lines into the home. One side would pick up the line, present a series of digits as tones, and then wait for the other side to route the call and present tones for success, or a busy network. The problem with this method of in-band signaling was that it required having the line held just for signaling, even for calls that could never go through. To free up the waste from the in-band signaling, the networks divided up the circuits into a large pool of voice-only bearer lines, and a smaller number of signaling-only lines. SS7 runs over the signaling lines.
It would be inappropriate here to go into any significant detail into SS7, as it is not seen as a part of voice mobility networks. However, it is useful to understand a bit of the architecture behind it.
SS7 is a packet-based network, structured rather like the Internet (or vice versa). The phone call first enters the network at the telephone exchange, starting at the Service Switching Point (SSP). This switching point takes the dialed digits and looks for where, in the network, the path to the other phone ought to be. It does this by sending requests, over the signaling network, to the Service Control Point (SCP). The SCP has the mapping of userunderstandable telephone numbers to addresses on the SS7 network, known aspoint codes. The SCP responds to the SSP with the path the call ought to take. At this point, the switch (SSP) seeks out the destination switch (SSP), and establishes the call. All the while, routers called Signal Transfer Points (STPs) connect physical links of the network and route the SS7 messages between SSPs and SCPs.
The interesting part of this is that the SCP has this mapping of phone numbers to real, physical addresses. This means that phone numbers are abstract entities, like email addresses or domain names, and not like IP addresses or other numbers that are pinned down to some location. Of course, we already know the benefit of this, as anyone who has ever changed cellular carriers and kept their phone number has used this ability for that mapping to be changed. The mapping can also be regional, as toll-free 800 numbers take advantage of that mapping as well.

ISDN and Q.931


The ISDN protocol is where telephone calls to the outside world get started. ISDN is the digital telephone line standard, and is what the phone company provides to organizations that ask for digital lines. By itself, ISDN is not exactly a voice mobility protocol, but because a great number of voice calls from voice mobility devices must go over the public telephone network at some point, ISDN is important to understand.
With ISDN, however, we leave the world of packet-based voice, and look at tightly timed serial lines, divided into digital circuits. These circuits extend from the local public exchange—where analog phone lines sprout from before they run to the houses—over the same types of copper wires as for analog phones. The typical ISDN line that an enterprise uses starts from the designation T1, referring to a digital line with 24 voice circuits multiplexed onto it, for 1536kbps. The concept of the T1 (also known, somewhat more correctly, as a DS1, with each of the 24 digital circuits known as DS0s) is rather simple. The T1 line acts as a constant source or sink for these 1536kbps, divided up into the 24 channels of 64kbps each. With a few extra bits for overhead, to make sure both sides agree on which channel is which, the T1 simply goes in round-robin order, dedicating an eight-bit chunk (the actual byte) for the first circuit (channel), then the second, and so on. The vast majority of traffic is bearer traffic, encoded as standard 64kbps audio. The 23 channels dedicated for bearer traffic are called B channels.
As for signaling, an ISDN line that is running a signaling protocol uses the 24th line, called the D channel. This runs as a 64kbps network link, and standards define how this continuous serial line is broken up into messages. The signaling that goes over this channel usually falls into the ITU Q.931 protocol.
Q.931's job is to coordinate the setting up and tearing down of the independent bearer channels. To do this, Q.931 uses a particular structure for their messages. Because Q.931can run over any number of different protocols besides ISDN, with H.323 being the other major one, the descriptions provided here will steer clear of describing how the Q.931 messages are packaged.
Table 1 shows the basic format of the Q.931 message. The protocol discriminator is always the number 8. The call reference refers to the call that is being referred to, and is determined by the endpoints. The information elements contain the message body, stored in an extensible yet compact format.
Table 1: Q.931 Basic Format 
Protocol Discriminator
Length of Call Reference
Call Reference
Message Type
Information Elements
1 byte
1 byte
1-15 bytes
1 byte
variable
The message type is encompasses the activities of the protocol itself. To get a better sense for Q.931, the message types and meanings are:
  • SETUP: this message starts the call. Included in the setup message is the dialed number, the number of the caller, and the type of bearer to use.
  • CALL PROCEEDING: this message is returned by the other side, to inform the caller that the call is underway, and specifies which specific bearer channel can be used.
  • ALERTING: informs the caller that the other party is ringing.
  • CONNECT: the call has been answered, and the bearer channel is in use.
  • DISCONNECT: the phone call is hanging up.
  • RELEASE: releases the phone call and frees up the bearer.
  • RELEASE COMPLETE: acknowledges the release.
There are a few more messages, but it is pretty clear to see that Q.931 might be the simplest protocol we have seen yet! There is a good reason for this: the public telephone system is remarkably uniform and homogenous. There is no reason for there to be flexible or complicated protocols, when the only action underway is to inform one side or the other of a call coming in, or choosing which companion bearer lines need to be used. Because Q.931 is designed from the point of view of the subscriber, network management issues do not need to be addressed by the protocol. In any event, a T1 line is limited to only 64kbps for the entire call signaling protocol, and that needs to be shared across the other 23 lines.
Digital PBXs use IDSN lines with Q.931 to communicate with each other and with the public telephone networks. IP PBXs, with IP links, will use one of the packet-based signaling protocols mentioned earlier.

Polycom SpectraLink Voice Priority (SVP)


Early in the days of voice over Wi-Fi, a company called SpectraLink—now owned by Polycom—created a Wi-Fi handset, gateway, and a protocol between them to allow the phones to have good voice quality, when Wi-Fi itself did not yet have Wi-Fi Multimedia (WMM) quality of service. SVP runs as a self-contained protocol, for both signaling and bearer traffic, over IP, using a proprietary IP type (neither UDP nor TCP) for all of the traffic.
SVP is not intended to be an end-to-end signaling protocol. Rather, like Cisco's SCCP, it is intended to bridge between a network server that speaks the real telephone protocol and the proprietary telephone. Therefore, SCCP and SVP have a roughly similar architecture. The major difference is that SVP was designed with wireless in mind to tackle the early quality-of-service issues over Wi-Fi, whereas SCCP was designed mostly as a way of simplifying the operation of phone terminals over wireline IP networks.
Figure 1 shows the SVP architecture. The SVP system integrates into a standard IP PBX deployment. The SVP gateway acts as the location for the extensions, as far as the PBX is concerned. The gateway also acts as the coordinator for all of the wireless phones. SVP phones connect with the gateway, where they are provisioned. The job of the SVP gateway is to perform all of the wireless voice resource management of the network. The SVP performs the admission control for the phones, being configured with the maximum number of phones per access point and denying phones the ability to connect to it through access points that are oversubscribed. The SVP server also engages in performing timeslice coordination for each phone on a given access point.

 
Figure 1: SVP Architecture
This timeslicing function makes sense in the context of how SVP phones operate. SVP phones have proprietary Wi-Fi radios, and the protocol between the SVP gateway and the phone knows about Wi-Fi. Every phone reports back what access point it is associated to. When the phone is placed into a call, the SVP gateway and the phone connect their bearer channels. The timing of the packets sent by the phone is such that it is directly related to the timing of the phone sent by the gateway. Both the phone and the gateway have specific requirements on how the packets end up over the air. This, then, requires that the access points also be modified to be compatible with SVP. The role of the access point is to dutifully follow a few rules which are a part of the SVP protocol, to ensure that the packets access the air at high priority and are not reordered. There are additional requirements for how the access point must behave when a voice packet is lost and must be retransmitted by the access point. By following the rules, the access point allows the client to predict how traffic will perform, and thus ensures the quality of the voice.
SVP is a unique protocol and system, in that it is designed specifically for Wi-Fi, and in such a way that it tries to drive the quality of service of the entire SVP system on that network through intelligence placed in a separate, nonwireless gateway. SVP, and Polycom SpectraLink phones, are Wi-Fi-only devices that are common in hospitals and manufacturing, where there is a heavy mobile call load inside the building but essentially no roaming required to outside.

Skype | Signaling Protocols in Detail


Skype is mentioned here because it is such an intriguing application. Famous for its resiliency when running over the Internet, or any other non-quality-of-service network, as well as for its chat feature and low-cost calls, questions will always come up about Skype. Undoubtedly, Skype has helped many organizations reduce long distance or international phone bills, and many business travelers have favored it when on the road and in a hotel, to avoid room and cell charges for telephone use.
Skype is a completely proprietary peer-to-peer protocol, encrypted hop-by-hop to prevent unauthorized snooping. There are plenty of resources available on how to use Skype, so it will be appropriate for us to stick with just the basics on how it applies for voice mobility.
The most important issue with Skype is that it is not manageable in an enterprise sense. Not only is it a service hosted outside the using enterprise, but the technology itself is encrypted to prevent even basic understanding or diagnosis. Furthermore, it cannot be run independent of Internet connectivity, and it is designed to find ways around firewalls. As a primarily consumer-oriented technology, Skype does not yet have the features necessary for enterprise deployments, and thus is severely limited in a sense useful for large-scale voice mobility.
Another main issue with Skype is that it does not take advantage of quality-of-service protocols to provide reliable or predictable, or even prioritized, voice quality. Traffic engineering with Skype is incredibly difficult, especially if one tries to predict how Skype will consume resources if large portions of the networked population choose to use it, inside or outside the office.
On the other hand, Skype comes with better, high-bitrate codecs that make voice sound much less tinny than the typical low-bitrate codecs used by telephones that may have to access the public switched telephone network (PSTN). Skype's ability to free itself from PSTN integration as the standard case (Skype's landline telephone services can be thought of more as special cases) has allowed it to be optimized for better voice quality in a lossy environment.
Skype is unlikely to be useful in current voice mobility deployments, so it will not be mentioned much further in this book. However, Skype will always be found performing somewhere within the enterprise, and so its usage should be understood. As time progresses, it may be possible that people will have worked out a more full understanding of how to deploy Skype in the enterprise.

Cisco SCCP: "Skinny" | Signaling Protocols in Detail


Cisco has a proprietary Skinny Client Control Protocol (SCCP), which is used by the Cisco Unified Communications Manager and Cisco phones as their own signaling protocol. SCCP requires the Cisco Unified Communications Manager or open-source PBXs to operate. Given the downside of proprietary protocols, the main reason for discussing SCCP within the context of voice mobility is only that Cisco's Wi-Fi-only handsets support SCCP, and so SCCP may be seen in some voice mobility networks. Unfortunately, SCCP internal documentation is not widely available or as well understood as an open protocol is, and so enterprise-grade implementations tend to lock the user into one vendor only.
SCCP runs on TCP, using port 2000. The design goal of SCCP was to keep it "skinny," to allow the phone to have as little intelligence as needed. In this sense, the Cisco Unified Communications Manager (or older Cisco Call Manager) is designed to interface with other telephone technologies as a proxy, leaving the phone to deal with supporting the one proprietary protocol.
SCCP has a markedly different architecture from what we have seen already. SCCP is still an IP-based protocol, and there is the one point of contact that the phone uses for all of its signaling. However, the signaling design of SCCP has the remarkable property, unlike with SIP or H.323, that the phone is not self-contained as an extension. Rather, SCCP is entirely user event—based. The phone's job is to report back to the call manager, in real time, whenever a button is pressed. The call manager then pushes down to the phone any change in state that should accompany the button press. In this way, the entire logic as to what buttons mean is contained in the call manager, which locally runs the various telephone endpoint logic. In this way, SCCP has more in common with Remote Desktop than it has with telephone signaling protocols: the phone's logic really runs in some centralized terminal server, which is called the call manager. To emphasize this point, Table 1 lists a typical sequence of events between a phone and a call manager, from when the phone is taken off the hook.
Table 1: Example SCCP Call Setup Event Flow 
#
Direction
Event Name
State
Meaning
1
Phone  Call Manager
Offhook
Dialing
User has taken the phone off the hook.
2
Call Manager  Phone
StationOutputDisplayText
 
Displays a prompt that the phone is off hook and waiting for digits.
3
Call Manager  Phone
SetRinger
 
Turns off the ringer.
4
Call Manager  Phone
SetLamp
 
Turns on the light for the line that is being used.
5
Call Manager  Phone
CallState
 
Sets the phone up so that the user can hear audio and press buttons.
6
Call Manager  Phone
DisplayPromptStatus
 
The phone is not connected to any other extension yet.
7
Call Manager  Phone
SelectSoftKeys
  
8
Call Manager  Phone
ActivateCallPlane
  
9
Call Manager  Phone
StartTone
 
Starts a dial tone.
10
Phone  Call Manager
KeypadButton (dialed 7)
 
The user dialed the number 7.
11
Call Manager  Phone
StopTone
 
Stops the dial tone, acknowledging that a digit has been dialed.
12
Call Manager  Phone
SelectSoftKeys
 
Changes the keys of interest to just the number pad (no redial buttons, etc.).
13
Phone  Call Manager
KeypadButton (dialed 0)
 
The user dialed the number 0.
14
Phone  Call Manager
KeypadButton (dialed 2)
 
The user dialed the number 2.
15
Phone  Call Manager
KeypadButton (dialed 0)
 
The user dialed the number 0.
16
Call Manager  Phone
SelectSoftKeys
Ringing
Changes the keys of interest.
17
Call Manager  Phone
CallState
 
Changes the state of the phone.
18
Call Manager  Phone
Callinfo
  
19
Call Manager  Phone
DialedNumber
 
Reports that 7020 has been dialed.
20
Call Manager  Phone
StartTone
 
Starts playing a ringback tone.
21
Call Manager  Phone
DisplayPromptStatus
 
Changes the prompt to show that the other side of the phone is ringing.
22
Call Manager  Phone
Callinfo
 
The call is still ringing.
23
Call Manager  Phone
StopTone
Connected
Stops playing the ringback tone.
24
Call Manager  Phone
DisplayPromptStatus
 
Displays that the phone call was answered.
25
Call Manager  Phone
OpenReceiveChannel
 
Prepares for the downward leg of the call.
26
Phone  Call Manager
OpenReceiveChannelAck
 
Acknowledges the downward leg.
27
Call Manager  Phone
StartMediaTransmission
 
The call's bearer channel starts flowing.
28
Phone  Call Manager
OnHook
Hanging Up
The caller hung up.
29
Call Manager  Phone
CloseReceiveChannel
 
Tears down the receive leg.
30
Call Manager  Phone
StopMediaTransmission
 
Stops the bearer channel entirely.
31
Call Manager  Phone
SetSpeakerMode
 
Restores the phone to the original state.
32
Call Manager  Phone
ClearPromptStatus
  
33
Call Manager  Phone
CallState
  
34
Call Manager  Phone
DisplayPromptStatus
  
35
Call Manager  Phone
ActivateCallPlane
  
36
Call Manager  Phone
SetLamp
 
Turns off the light for the line that was in use.
As you can see, the phone's entire personality—the meaning of the buttons, what the display states, which lights are lit, the tones generated—are entirely controlled by the call manager.
Overall, this is a marked difference from true telephone signaling protocols. In this sense, then, one can consider SCCP to be mostly a remote control protocol for phones, and the call manager is thus left with the burden of implementing the true telephone protocol. Unfortunately, however, when SCCP is used with a packet-based voice mobility network, the protocol going over the wireless or edge network is going to be SCCP, and not whatever protocol the call manager is enabled with.
Bearer traffic, on the other hand, still uses RTP, as do the other protocols we have looked at so far. Therefore, most of the discussion on bearer traffic, and on voice traffic in general, holds for SCCP networks.