The IP-Based Voice Network


Voice started out on analog phone lines. Each pair of copper wires was dedicated to one specific phone, and to nothing else. This notion of a dedicated circuit has its advantages. It provides complete isolation of whatever might be going on with that line from the circumstances and problems of other phones in the network. No amount of calls being placed on a neighbor's line can make the original line itself become busy. This isolation and invariance is necessary for voice networks to function when unexpected circumstances occur, and ensures that the voice network is reliable in the face of massive fluctuations in the system. Provisioning is simple, as well, with one line per phone at the edge.
The problem with the concept of the dedicated line is that it is extremely wasteful. When the phone is not in use, the line stays empty. No other calls can be placed on that line. Even when a call is in place, the copper wire is fully occupied with carrying the voice traffic, a small bandwidth application, and a tremendous amount of excess signal capacity exists. Dedicated wires might make sense for short distances between the phone and some next-level aggregation equipment, but these dedicated lines were used as trunks between the aggregators, causing tremendous waste from both idleness and lost bandwidth. But probably the property that caused the most complications with wireline networking was that the dedicated line is not robust. If network problems occur-the bundle of cables is cut, or some intermediate equipment fails and can't do its job-all lines that are attached along that path are brought down with it. Digital telephone networks started to eliminate some of the problems inherent to the oneline dedication of early circuit switching. By having digital processes encode and carry the voice, more voice calls could be multiplexed onto each line, better using the bandwidth available on the copper wire. Furthermore, by allowing for hop-by-hop switching with smarter switches between trunks, failures along one trunk could be accommodated. However, the network was still circuit-switched. A voice line could be used only for voice. Even where voice circuits were set aside for data links, the link is either fully in use or not at all. The granularity of the 64kbps audio line, the DS0, became a burden. Running applications that are not always on and have massive peak throughput but equally meek average throughput requirements meant that provisioning was always an expensive proposition: either dedicate enough lines to cover the peak requirement case, and pay for all of the unused capacity, or cap the capacity offered to the application. Furthermore, these circuits needed to be considered, managed, and monitored rather separately. The hard divisions between two circuits became a hard division between applications. Voice networks were famous for their reliability, strict clockwork operation—and complexity. They were not for easy-to-set-up, easy-to-move operations. The wires are drawn once and carefully, and the switches and intermediate equipment is set up by a team of dedicated and expensive experts who do nothing but voice all day. If you were serious about voice, you operated your own little phone company, complete with dedicated operators. If not, your only option was to have the phone company run your phone network for you.
Along came packet-switched networks. Sending small, self-contained messages between arbitrary endpoints on a network inherently made sense for computers. The idea of sending a message quickly, without tying up lines or going through cumbersome setup and teardown operations removed the restrictions on wasted lines. Although it was still true that lines could remain idle when not being used, the notion of allowing these packets of information into the line as the fundamental concept, rather than requiring continuous occupation and streaming, meant that lines that carried aggregated traffic from multiple users and multiple messages could be used more efficiently. If the messages were short enough, one line might do. No concerns about running out of lines and having the needed, or only, path to the receiver blocked. Instead, these messages could just be queued until space was available.
Along with this whole new way of thinking about occupying the resources came a different way of thinking about addressing and connecting the resources. In the early days, a phone number used to encode the exact topological location of the extension. Each exchange, or switch with switchboard operator, had a name and number, and calls were routed from exchange to exchange based on that number first. Changes to the structure or layout of the telephone system would require changes to the numbers. Packet-switching technologies changed that. Lines themselves lost their names and numbers. Instead, those names and numbers were moved to the equipment that glued the lines together. Every device itself now had the address. The binding of the addresses to the topology of the network remained, at some level. Devices could not be given any arbitrary address. Rather, they needed to have addresses that were similar to their neighbors. The notion of exchange-to-exchange routing was retained.
This notion, though, proved to be a burden. Changes to the network were quite possible, as either more devices needed addresses, or more new "exchanges" were added to the network. Either way, the problem of figuring out how to route messages through the network remained. The original design had each router know which lines needed to be used to send the messages along their way. The router might not know how the message should get to the final destination, but it always knew the next step, and could direct traffic along the right roads to the next intersection, where the next router took over. As the number of intersections increased, and the number of devices expanded, the complexity of maintaining these routing tables exploded. A way was needed for neighboring routers to find out about each other, and more importantly, to find out about what end devices they knew routes to. Thus, the routing protocol was born. These protocols spoke from router to router, exchanging information on a regular basis, ensuring that routers always had recent information on what destinations were valid and how to get there from here. But another thing happened. This idea of exchanging the routes had another benefit, in that it allowed the network itself to be restructured, or to fail in spots, and yet still be able to send traffic. Routers did not need to know the entire path to the destination, only the next hop. If a router knew two, different next hops for the same message, and one of the routes went down, the router could try the second one. If the router lost all of its paths to a particular set of destinations, the router before it could learn about that, and avoid using that path to get the messages through. If there was a way to get the message there, the network would find it, through the process of convergence, or agreement over time on the consistency of whether and how messages could be sent. The network became resilient, and point failures would not stop traffic from flowing.
This is the story of the Internet, and of all the protocols that make it work. Clearly, the story is simplified (and perhaps romanticized to highlight the point at hand), but the fundamentals are there. Circuit switching is difficult to manage, because it is incredibly wasteful and inflexible. Packet switching is much simpler to manage, and can recover from failures.
The Internet grew up on top of the lines offered by the circuit-switched technologies, but used a better way to dedicate the resources. It wasn't long before someone realized that voice itself could be put over these packet-switched lines. At first, that might sound wasteful, as using a digital line to carry a packet containing voice can never be more efficient than using that line to carry the same bits of voice directly because of the packet overhead. But packet networking technologies matured, and the throughputs offered on simple point-to-point links grew much faster than did the corresponding uses of the same copper line for digital voice-at least, in the enterprise. And the advantages of using amultipurpose technology allowed these voice over IP pioneers to use the network's flexibility and lack of dedication to one purpose to add to the voice over IP offerings quickly, without requiring retooling of physical wires. The ways in which provisioning was thought about changed, and the idea that voice and data networks can perhaps use the same resources became a compelling reason to try to save deployment and management costs.
There are a tremendous number of resources available for understanding the intricacies of how IP networks operate, including details on how to manage routing protocols and large trunk lines. Here, we will explore how voice fits into the packet-based IP network.

How to Measure Voice Quality Yourself


The Expensive, Accurate Approach: End-to-End Voice Quality Testers

As mentioned in the discussion of PESQ, existing tools can measure the quality of the voice network by directly pumping in prerecording voice samples andcomparing the output. These tools are either expensive or home-grown, and are used to test large networks as a part of a planning or predeployment phase.
This sort of testing is more of a tuning exercise, and-much like how piano tuning is a rare and complicated enough exercise that it is not performed frequently-direct end-to-end testing is not diagnostic. Telephone equipment testing companies do make the sort of equipment to perform this end-to-end inspection, and these tools can be rented. Unfortunately, it is very difficult to know where to invest in this sort of heavily proactive effort.
More likely, the voice quality is measured by having administrators walk around the network with some number of phones in question, ensuring themselves that whatever problems they may face will likely be manageable. The problem with both forms of proactive testing is that they normally occur on only lightly loaded networks, and thus are not able to measure the effect of network load on voice quality. Network load is generally the largest impact on voice quality, in fact, partly because voice mobility network managers do a good job of testing their networks before they launch them for basic problems, which they quickly correct, and partly because voice mobility networks are more likely to be robust enough out of the box for basic voice connectivity.

Network Specific: Packet Capture Tests

Most of the major packet capture tools, for wireline and for wireless, make modules that are able to indirectly infer the MOS values using E-model calculations. Sometimes, these work by tracing the voice setup protocols, such as SIP, and determining what RTP flows map to phone calls and the properties of the phone calls. Other times, these tools will just look directly at the RTP streams, and not try to find out what phone numbers the streams map to In both cases, the tools then use the sequence number and timestamp fields in the RTP stream to determine values such as loss, delay, and jitter. Using assumed values for the jitter buffer, with the option of having the user overwrite them, the tools then model the expected effect and produce a score.
The major issue with these tools is that they show quality only up to the point where they are inserted. An easy example of the problem is to look at wireless networks. On a Wi-Fi network, a packet capture tool may be able to directly determine what packets it sees and come up with a score. By looking at the Wi-Fi protocol, the tool may do a good job of inferring whether the mobile phone received the packet from the access point, and at what time, and may produce a reasonably close call quality number. On the other hand, the upstream flow is likely to look quite good from the point of view of the test tool, because there is only one network in between the client and the tool. The entirety of the network upstream from the client goes missing, and the upstream MOS value can be entirely misleading.
Some network infrastructure devices are able to do these inferences within themselves, as they pass the data through. This may be a reasonable thing to do, again depending on the point of insertion and how well they are able to capture information as late into the network as possible. It is important, when using all of these tools, for you to consult with the vendor or maker of the tools to find out where the tools are measuring. For a wireless controller with voice metric capabilities, for example, make sure that the downstream metrics are measured on the access point, based on what happened over the air, and not just passing through the controller. For wireless overlay monitoring, make sure that there is an option to do a similar capture using a wired mirror port on one of the switches, for cases in which voice quality might begin to suffer and the network needs direct attention. Overall, do not rely on just one tool, and believe what the users say-no matter what the tool tells you.


The Device Itself

The most accurate and reasonable way to measure voice quality is from the endpoints themselves. Both some handsets and PBXs offer the ability for the device to produce the one-way MOS value or R-value for the receive side at the device itself. These numbers are based entirely on E-model calculations, assuming best-case or known-default scenarios for the rest of the system, but are likely to be the most accurate. Of course, it is difficult to ask a user to determine what the voice quality is of a call while on it, especially given that voice quality is not something a user wants to measure. However, for diagnosing locations that are having troubles, this tool is valuable for the administrator herself, who is able to avoid having to guess as to whether the call sounds reasonable, and may be able to detect variations in the MOS value or R-value.
In the end, keep in mind that the absolute values produced by any of the methods deserve being taken with a grain of salt. As time goes on, the administrator of a voice mobility network should be able to learn what the real quality means for any given value the tool suggests, even when the tool is placing results a half a MOS point too high or too low. However, the variation of the scores, especially when the network has changed, can be a valuable tool for point the way towards the solution.

Jitter | What Makes Voice over IP Quality Suffer


Jitter is the variation in delays that the receiver experiences. Jitter is a nuisance that the user does not hear directly, because the phones employ a jitter buffer to correct for any delays. Jitter can be defined in a number of ways. One way is to use the standard deviation or maximum deviation around the mean delay per packet. Another way is to use the known arrival intervals (such as 20ms), and subtract consecutive delays of packets that were not lost from the known arrival time, then take the standard deviation or the maximum deviation. Either way, the jitter, measured in times or percentages against the mean, tells how variable the network is.
Jitter is introduced by variable queuing delays within network equipment. Phones and PBXs are well known for having very regular transmission intervals. However, the intervening network may have variable traffic. As the queue depths change and the network loads fluctuate, and as contention-based media such as Wi-Fi links clog with density, packets are forced to wait. Wireless networks are the biggest culprit for introducing delay into an enterprise private network. This is because wireless packets can be lost and retransmitted, and the time it takes to retransmit a packet can usually be measured in units of a millisecond.
A jitter buffer's job is to sit on the receiver and prevent the jitter from causing an underrun of the voice decoder. An underrun is an awkward period of silence that happens when the phone has finished playing the previous packet and needs another packet to play, but one has not yet arrived. These underruns count as a form of error or loss, even if every packet does make it to the receiver, and loss concealment will work to disguise them. The problem with jitter becomes that an underrun must be followed by an increase in delay of the same amount, assuming no packets are lost. This can be seen by realizing that the delayed packet will hold up the line for packets behind it.
Here, the value of the jitter buffer can be seen. The jitter buffer lets the receiver build up a slight delay in the output. If this delay is greater than the amount of actual jitter on the network, the jitter buffer will be able to smooth things out without underruning.
In this sense, the jitter buffer converts jitter directly into delay. If the jitter becomes too large, the jitter buffer may have limited room, and start dropping earlier samples in the buffer to let the call catch up to be closer to real time. In this way, the jitter buffer can convert the jitter directly into loss.
Because jitter is always converted into delay first, then loss, it does not have a direct impact on the E-model by itself, but instead can be folded in to the other measures. However, the complication arises because the user or administrator does not usually know the exact parameters of the jitter buffer. How many samples, how much delay, will the jitter buffer take before it starts to drop audio? Does the jitter buffer start off with a fixed delay? Does it build up the delay as jitter forces it to? Or does it try to proactively build in some delay, which can grow or shrink as the underruns occur? These all have an impact on the E-model call quality.
As a result, a rule of thumb here is to match the jitter tolerance to the delay tolerance. The network, at least, should not introduce more than 50ms of jitter.

Handoff Breaks | What Makes Voice over IP Quality Suffer


Handoffs cause consecutive packet losses. As mentioned in our previous discussion on packet loss, the impact of a handoff glitch can become large. The E-model does not make the best measurement of handoff break consternation, because it takes into account only the average burst length. Handoffs can cause burst loss far longer than the average, and these losses can delete entire words or parts of sentences.
Later chapters explore the details of where handoff breaks can occur. The two general categories are for intratechnology handoffs, such as Wi-Fi access-point to access-point, and intertechnology handoffs, such as from Wi-Fi to cellular. Both handoffs can cause losses ranging for up to a second, and the intertechnology handoff losses can be potentially far higher, if the line is busy or the network is congested when the handoff takes place.
The exact tolerance for handoff breaks depends on the mobility of the user, the density or cell sizes of the wireless technology currently in use, and the frequency of handoffs. Mobility tends to cut both ways: the more mobile the user is at the time of handoff, the more forgiving the user might be, so long as the handoff glitches stop when the user does. The density of the network base stations and the sizes of the cells determine how often a station hands off and how many choices a station has when doing so. These both add to the frequency of the glitches and the average delays the glitches see. Finally, the number of glitches a user sees during a call influences how they feel about the call and the technology.
There are no rules for how often the glitches should occur, except for the obvious one that the glitches should not be so many or for so long that they represent a packet loss rate beginning to approach a half of a percentage point. That represents one packet loss in a four second window, for 20ms packets. Therefore, a glitch of 100ms takes five packets, and so the glitch should certainly not occur more than once every 20 seconds. Glitches longer than that also run the risk of increasing the burst loss factor, and even more so run the risk of causing too many noticeable flaws in the voice call, even if they do not happen every few seconds. If, every two minutes, the caller is forced to repeat something because a choice word or two has been lost, then he would be right to consider that there is something wrong with the call or the technology, even though these cases do not fit well in the E-model.
Furthermore, handoff glitches may not always result in a pure loss, but rather in a loss followed by a delay, as the packets may have been held during the handoff. This delay causes the jitter buffer (jitter is explained in Section 3.2.4) to grow, and forces the loss to happen at another time, possibly with more delay accumulated.
A good rule of thumb is to look for technologies that keep handoff glitches less than 50ms. This keeps the delaying effect and the loss effect to reasonable limits. The only exception to this would be for handoffs between technologies, such as a fixed-mobile convergence handoff between Wi-Fi and cellular. As long as those events are kept not only rare but predictable, such as that they happen only on entering or exiting the building, the user is likely to forgive the glitch because it represents the convenience of keeping the phone call alive, knowing that it would otherwise have died. In this case, it is reasonable to not want the handoff break to exceed two seconds, and to have it average around a half of a second.

PESQ: How to Predict MOS Using Mathematics


Therefore, we turn to how the predictions of voice quality can actually be made electronically. ITU P. 862 introduces Perceptual Evaluation of Speech Quality, the PESQ metric. PESQ is designed to take into account all aspects of voice quality, from the distortion of the codecs themselves to the effects of filtering, delay variation, and dropouts or strange distortions. PESQ was verified with a number of real MOS experiments to make sure that the numbers are reasonable within the range of normal telephone voices.
PESQ is measured on a 1 to 4.5 scale, aligning exactly with the 1 to 5 MOS scale, in the sense that a 1 is a 1, a 2 is a 2, and so on. (The area from 4.5 to 5 in PESQ is not addressed.) PESQ is designed to take into account many different factors that alter the perception of the quality of voice.
The basic concept of PESQ is to have a piece of software or test measurement equipment compare two versions of a recording: the original one and the one distorted by the telephone equipment being measured. PESQ then returns with the expected mean opinion score a group of real listeners are likely to have thought.
PESQ uses a perceptual model of voice, much the same way as perceptual voice codecs do. The two audio samples are mapped and remapped, until they take into account known perceptual qualities, such as the human change in sensitivity to loudness over frequency (sounds get quieter at the same pressure levels as they get higher in pitch). The samples are then matched up in time, eliminating any absolute delay, which affects the quality of a phone call but not a recording. The speech is then broken up into chunks, called utterances, which correspond to the same sound in both the original and distorted recording. The delays and distortions are then analyzed, counted, and correlated, and a number measuring how far removed the distorted signal is from the original signal is presented. This is the PESQ score.
PESQ is our first entry into the area of mathematical, or algorithmic, determination of call quality. It is good for measuring how well a new codec works, or how much noise is being injected into the sample. However, because it requires comparing what the talker said and what the listener heard, it is not practical for real-time call quality measurements.

Mean Opinion Score and How it Sounds


The Mean Opinion Score, or MOS (sometimes redundantly called the MOS score), is one way of ranking the quality of a phone call. This score is set on a five-point scale, according to the following ranking:
  • 5. Excellent
  • 4. Good
  • 3. Fair
  • 2. Poor
  • 1. Bad
MOS never goes below 1, or above 5.
There is quite a science to establishing how to measure MOS based on real-world human studies, and the depth they go into is astounding. ITU P.800 lays out procedures for measuring MOS. Annex B of P.800 defines listening tests to determine quality in an absolute manner. The test requirements are spelt out in detail. The room to be used should be between 30 and 120 cubic meters, to ensure the echo remains within known values. The phone under test is used to record a series of phrases. The listeners are brought in, having been selected from a group that has never heard the recorded sentence lists, in order to avoid bias. The listeners are asked to mark the quality of the played-back speech, distorted as it may be by the phone system. The listeners' scores, on the one-to-five scale, are averaged, and this becomes the MOS for the system. The goal of all of this is to attempt to increase the repeatability of such experiments.
Clearly, performing MOS tests is not something that one would imagine can be done for most voice mobility networks. However, the MOS scale is so well known that the 1 to 5 scale is used as the standard yardstick for all voice quality metrics. The most important rule of thumb for the MOS scale is this: a MOS of 4.0 or better is toll-quality. This is the quality that voice mobility networks have to achieve, because this is the quality that nonmobility voice networks provide every day. Forgiveness will likely offered by users when the problem is well known and entirely relatable, such as for bad-quality calls when in a poor cellular coverage area. But, once inside the building, enterprise voice mobility users expect the same quality wirelessly as they do when using their desk phone.
Thus, when a device reports the MOS for a call, the number you are seeing has been generated electronically, based on formulas that are thought to be reasonable facsimiles of the human experience.

SDP and Codec Negotiations


RTP only carries the voice, and there must be some associated way to signal the codecs which are supported by each end. This is fundamentally a property of signaling, but, unlike call progress messages and advanced PBX features, is tied specifically to the bearer channel.
SIP uses SDP to negotiate codecs and RTP endpoints, including transports, port numbers, and every other aspect necessary to start RTP streams flowing. SDP, defined in RFC 4566, is a text-based protocol, as SIP itself is, for setting up the various legs of media streams. Each line represents a different piece of information, in the format of type = value.
Table 1 shows an example of an SDP description. This description is for a phone at IP address 192.168.0.10, who wishes to receive RTP on UDP port 9000. Let's go through each of the fields.
  • Type "v" represents the protocol version, which is 0.
  • Type "o" holds information about the originator of this request, and the session IDs. Specifically, it is divided up into the username, session ID, session version, network type, address type, and address. "7010" happens to be the dialing phone number. The two large numbers afterward are identifiers, to keep the SDP exchanges straight. The "IN" refers to the address being an Internet protocol address; specifically, "IP4" for IPv4, of "192.168.0.10". This is where the originator is.
  • Type "s" is the session name. The value given here, "A_conversation", is not particularly meaningful.
  • Type "c" specifies how the originator must be reached at—its connection data. This is a repetition of the IP address and type specifications for the phone.
  • Type "t" is the timing for the leg of the call. The first "0" represents the start time, and the second represents the end time. Therefore, there is no particular timing bounds for this call.
  • The "m" line specifies the media needed. In this case, as with most voice calls, there is only one voice stream from the device, so there is only one media line. The next parameters are the media type, port, application, and then the list of RTP types, for RTP. This call is an "audio" call, and the phone will be listening on port 9000. This is a UDP port, because the application is "RTP/AVP", meaning that it is plain RTP. ("AVP" means that this is standard UDP with no encryption. There is an "RTP/SAVP" option, mentioned shortly.) Finally, the RTP formats the phone can take are 0, 8, and 18.
  • The next three lines are the codecs that are supported in detail. The "a" field specifies an attribute. The "a=rtpmap" attribute means that the sender wants to map RTP packet types to specific codec setups. The line is formatted as packet type, encoded name/bitrate/parameters. In the first line, RTP packet type "0" is mapped to "PCMU" at 8000 samples per second. The default mapping of "0" is already PCM (G.711) with μ-law, so the new information is the sample rate. The second line asks for A-law, mapping it to 8. The third line asks for G.729, asking for 18 as the mapping. Because the phone only listed those three types, those are the only types it supports.
  • The last line is also an attribute. "a=ptime" is requesting that the other party send 20ms packets. The other party is not required to submit to this request, as it is only a suggestion. However, this is a pretty good sign that the sender of the SDP message will also send at 20ms.
Table 1: Example of an SDP Description
v=0
o=7010 1352822030 1434897705 IN IP4 192.168.0.10
s=A_conversation
c=IN IP4 192.168.0.10
t=0 0
m=audio 9000 RTP/AVP 0 8 18
a=rtpmap:0 PCMU/8000/1
a=rtpmap:8 PCMA/8000/1
a=rtpmap:18 G729/8000/1
a=ptime:20
The setup message in Table 1 was originally given in a SIP INVITE message. The responding SIP OK message from the other party gave its SDP settings.
Table 2 shows this example response. Here, the other party, at IP address 10.0.0.10, wants to receive on UDP port 11690 an RTP stream with the three codecs PCMU, GSM, and PCMA. It can also receive a format known as "telephone-event." This corresponds to the RTP payload format for sending digits while in the middle of a call (RFC 4733). Some codecs, like G.729, can't carry a dialed digit as the usual audio beep, because the beep gets distorted by the codec. Instead, the digits have to be sent over RTP, embedded in the stream. The sender of this SDP is stating that they support it, and would like to be sent in RTP type 101, a dynamic type that the sender was allowed to choose without restriction.
Table 2: Example of an SDP Responding Description
v=0
o=root 10871 10871 IN IP4 10.0.0.10
s=session
c=IN IP4 10.0.0.10
t=0 0
m=audio 11690 RTP/AVP 0 3 8 101
a=rtpmap:0 PCMU/8000
a=rtpmap:3 GSM/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=silenceSupp:off----
a=ptime:20
a=sendrecv
Corresponding to this is the attribute "a=fmtp", which applies to this 101-digit type, "fmtp" lines don't mean anything specific to SDP; instead, the request of "0-16" gets forwarded to the telephone event protocol handler. It is not necessary to go into further details here on what "0-16" means. The "a=silenceSupp" line would activate silence suppression, in which packets are not sent when the caller is not talking. Silence suppression has been disabled, however. Finally, the "a=sendrecv" line means that the originator can both send and receive streaming packets, meaning that the caller can both talk and listen. Some calls are intentionally one-way, such as lines into a voice conference where the listeners cannot speak. In that case, the listeners may have requested a flow with "a=recvonly".
After a device gets an SDP request, it knows enough information to send an RTP stream back to the requester. The receiver need only choose which media type it wishes to use. There is no requirement that both parties use the same codec; rather, if the receiver cannot handle the codec, the higher-layer signaling protocol needs to reject the setup. With SIP, the called party will not usually stream until it accepts the SIP INVITE, but there is no further handshaking necessary once the call is answered and there are packets to send.
For SRTP usage with SIPS, SDP allows for the SRTP key to be specified using a special header:
a=crypto:1 AES_CM_128_HMAC_SHA1_32 Þ
inline:c3bFaGA+Seagd117041az3g113geaG54aKgd50Gz
This specifies that the SRTP AES counter with HMAC_SHA1 is to be used, and specifies the key, encoded in base-64, that is to be used. Both sides of the call send their own randomly generated keys, under the cover of the TLS-protected link. This forms the basis of RTP/SAVP.

Real-time Transport Protocol | RTP


The codec defines only how the voice is compressed and packaged. The voice still needs to be placed into well-defined packets and sent over the network.
The Real-time Transport Protocol (RTP), defined in RFC 3550, defines how voice is packetized on most IP-based networks. RTP is a general-purpose framework for sending real-time streaming traffic across networks, and is used for nearly all media streaming, including voice and video, where real-time delivery is essential.
RTP is usually sent over UDP, on any port that the applications negotiate. The typical RTP packet has the structure given in Table 1.
Table 1: RTP Format 
Flags
Sequence Number
Timestamp
SSRC
CSRCs
Extensions
Payload
2 bytes
2 bytes
4 bytes
4 bytes
4 bytes × number of contributors
variable
variable
The idea behind RTP is that the sender sends the timestamp that the first byte of data in the payload belongs to. This timestamp gives a precise time that the receiver can use to reassemble incoming data. The sequence number also increases monotonically, and can also establish the order of incoming data. The SSRC, for Synchronization Source, is the stream identifier of the sender, and lets devices with multiple streams coming in figure out who is sending. The CSRCs, for Contributing Sources, are other devices that may have contributed to the packet, such as when a conference call has multiple talkers at once.
The most important fields are the timestamp (see Table 2) and the payload type (see Table 3). The payload type field usually specifies the type of codec being used in the stream.
Table 2: The RTP Flags Field 
 
Version
Padding
Extension (X)
Contributor Count (CC)
Marked
Payload Type (PT)
Bit:
0-1
2
3
4-7
8
9-15
Table 3 shows the most common voice RTP types. Numbers greater than 96 are allowed, and are usually set up by the endpoints to carry some dynamic stream.
Table 3: Common RTP Packet Types 
Payload Type
Encoded Name
Meaning
0
PCMU
G.711 with μ-law
3
GSM
GSM
8
PCMA
G.711 with A-law
18
G729
G.729 or G.729a
When the codec's output is packaged into RDP, it is done so to both avoid splitting necessary information and causing too many packets per second to be sent. For G.711, an RTP packet can be created with as many samples as desired for the given packet rate. Common values are 20ms and 30ms. Decoders know to append the samples across packets as if they were in one stream. For G.729, the RTP packet must come in 10ms multiples, because G.729 only encodes 10ms blocks. An RTP packet with G.729 can have multiple blocks, and the decoder knows to treat each block separately and sequentially. G.729 phones commonly stream with RTP packets holding 20ms or larger, to avoid having too many packets in the network.

Secure RTP

RTP itself has a security option, designed to allow the contents of the RTP stream to be protected while still allowing the quick reassembly of a stream and the robustness of allowing parts of the stream to be lost on the network. Secure RTP (SRTP) uses the Advanced Encryption Standard (AES) to encrypt the packets. (AES will later have a starring role in Wi-Fi encryption, as well as for use with IPsec.) The RTP stream requires a key to be established. Each packet is then encrypted with AES running in counter mode, a mode where intervening packets can be lost without disrupting the decryptability of subsequent packets in the sequence. Integrity of the packets is ensured by the use of theHMAC-SHA1 keyed signature, for each packet.
How the SRTP stream gets its keys is not specified by SRTP. However, SIPS provides a way for this to be set up that is quite logical.