Telecom Made Simple: 2012

Reducing Security Handoff Overhead with Opportunistic Key Caching

The good news is that the 802.1X mechanisms can be taken out of the picture for handoffs, for wireless architectures with a controller (or large number of radios in one access point). This mechanism, available today for many vendors, is known as opportunistic key caching (OKC). The name comes from the main concept underlying the technology. Once a client performs the authentication with the RADIUS server, and has a PMK, there is no reason for it to have to negotiate a new one just to handoff and create a new PTK just for that access point. The term "opportunistic" is used because the mechanism was designed to be a simple extension of 802. Hi, and the client is not made aware that OKC is enabled. If it works, it works. If not, no problems arise except the increased time required for doing the handshake.

The main protocol for OKC is identical to the ordinary key caching. The only difference is that whereas ordinary key caching requires that the client is returning to an access point where it had already performed 802. IX, opportunistic key caching requires only that the new access point somehow have access to the PMK, even though it was created on a different access point.

How can this work? The PMK, if you recall, does not have any information unique to the wireless network within it. It is a function purely of the EAP protocol in use between the wireline RADIUS server and the wireless client. There is no intrinsic reason that the same PMK cannot be used for different access points, as long as the following two restrictions are held to: the PMK must never be transmitted as plaintext or using weak encryption, and the PMK must not have expired.

In practice, opportunistic key caching implementations never move around the PMK. Instead, these implementations take advantage of the architecture of the WPA2 protocol and how it interacts with 802. IX. 802. IX doesn't know about clients and access points. Instead, it uses a different language, in which the role of the user is held by a supplicant, and the role of the network is held by an authenticator. The mapping of the supplicant to real devices is clear: the supplicant is a part of the client. The authenticator, on the other hand, has flexibility built in. For standalone access point architectures, the authenticator is a part of the access point. For controller-based architectures, however, the authenticator is almost always in the controller.

Now we get a sense for the scale of opportunistic key caching. The PMK was originally created in the authenticator, and most opportunistic key caching architectures leave the PMK inside the authenticator, never to come out. For controller-based architectures, the controller generates the PTK within the authenticator, and then distributes it to the encryption engine, which may be located locally in the controller or in the access points. With opportunistic key caching, then, the only change is to allow a client with a PMK to associate to a new access point, and to use the PMK for the new connection as if it had been negotiated on that access point.

There is no addition of protocols or state changes in opportunistic key caching, which explains why it is so prevalent within network implementations. The only changes are to clients, who have to create a new PMKID, based on the original PMK, when they associate to a new access point, and to the authenticator, which needs to look past that a PMKID was not created for the PMK, create the new one, and then continue as if nothing unusual had happened.

You should look for wireless clients and network infrastructure that supports opportunistic key caching when rolling out a voice mobility network. OKC has been generally embraced by the industry, though there are a few notable exceptions, and is generally used as the solution to the 802.1X overhead.

The Wi-Fi Break-Before-Make Handoff

Basic Wi-Fi handoffs are always either break-before-make or just-in-time. In other words, there is no ability for a wireless phone to decide on a handoff and establish a relationship with a new access point without disconnecting from the previous one. The rules of 802.11 are rather simple here: no client is allowed to associate (send an Association message to one while maintaining data connectivity to another) to two access points at the same time. The reason for this is to remove any ambiguity as to which access point should forward wireline traffic destined to the client; otherwise, both access points would have the requirement of receiving the client's traffic, and therefore would not work in a switched wireline environment.

However, almost all of the important protocols for Wi-Fi happen only after a data connection has been established. This prevents clients from gaining much of a head start on establishing a connection when the old one is at risk.

Let's look at the contents of the Wi-Fi handoff protocol itself step by step. It will be helpful for further information.

Once a client has decided to hand off, it need not break the connection to the original access point, but it must not use it any longer.
The client has the option of sending a Disassociation message to the old access point, a good practice that lets the old access point free up network resources.
At this point, if the new access point is on a different channel, the client will change the channel of its receiver.
If the new channel is a DFS channel, the client is required to wait until it receives a beacon frame from the access point, unless it has recently heard one as a part of a passive scanning procedure.
The client will send an Authentication message to the new access point, establishing the beginnings of a relationship with this new access point, but not yet enabling data services.
The access point will respond with its own Authentication message, accepting the client. A rejection can occur if load balancing is enabled, and the access point decides that it is oversubscribed, or if key state tables in the access point are full.
The client will send a Reassociation Request message to the access point, requesting data services.
The access point will send a Reassociation Response message to the access point. If the message has a status code for success, the client is now associated with and connected to this access point, and only this access point. Controller-based wireless architectures will usually ensure this by immediately destroying any connection that may have been left over if step 2 has not been performed. The access point may reject the association if it is oversubscribed, or if the additional services the client requests (mostly security or quality-of-service) in the Reassociation Request will not be supported.

At this point, the client is associated and data services are available. Usually, the access point or controller behind it will send a broadcast frame, spoofed to appear as if it were sent by the client, to the connected Ethernet switch, informing it of the client's presence on that particular link and not on any one that may have been used previously.

If no security is employed, skip ahead to the admission control mechanisms, towards the end of the list. If PSK security is employed, skip ahead to the four-way handshake. Otherwise, if 802.1X and RADIUS authentication is employed (WPA/WPA2 Enterprise), we'll continue immediately next.
The access point and client can only exchange EAP messages at this point. The client may solicit the EAP exchange with an optional EAP Start message.
The access point will request the client to log in with an EAP Request Identity message.
Depending on the EAP method required by the RADIUS server on the network, the client and access point will continue to exchange a number of data frames, all EAPOL.
The access point relays the RADIUS server's EAP Success or EAP Failure message. If this is a failure, the access point will also likely send a Deauthentication or Disassociation message to the client, to kick it off of the access point.

At this point, the client and access point have agreed on the pairwise master key (PMK), based on the key material generated during the RADIUS exchange and sent to the access point when the authentication process concluded. But, the access point and client still need to generate a per-connection, pairwise transient key (PTK), which will be used to do the actual encryption. Pre-shared key (PSK) networks skipped the listed EAP exchanges, and use the PSK as the master key.
The access point send the first message in the RSN (802. Hi) four-way handshake. This is an EAPOL Key frame.
The client sends the second message in the four-way handshake.
The access point sends the third message in the four-way handshake.
The client sends the fourth message in the four-way handshake.

At this point, all data services are enabled, and the client and access point can exchange data frames. However, if a call is in progress, and WMM Admission Control is enabled, the client is required to request the voice resources before it can send or receive a single voice packet with priority. Until this point, both sides may either buffer the packets or send the voice packets as best-effort.
The client sends the access point an ADDTS Request Action frame, with a TSPEC that specifies the over-the-air resources that both the upstream and downstream part of the voice call will occupy.
The access point weighs whether it has enough resources to accept or deny the request. It sends an ADDTS Response Action frame with the results.
If the request was successful, the client and access point will be sending voice traffic and the call successfully handed off. On the other hand, if the request fails, the client will disconnect from the access point with a Disassociation message, because, although it is allowed to remain on the access point, it can't send or receive any voice traffic.

Hopefully, everything went well and the handoff completed. On the other hand, if any of the processes failed, the connection is broken. The old connection was abandoned early on—in step 8 for sure and step 2 for more charitable clients. In order to not drop the phone call, the phone will need to restart the process from the beginning with another access point—perhaps the original access point it just left, if none is available.

You will notice that the client has a lot of work to do to make the handoff successful, and there are many places where the procedure can go wrong. Even if every request were to be accepted, any loss of some of the messages can cause long timeouts, often up to a second, as each side waits to make sure that no messages are passing each other by.

If nothing at all is done to optimize this transition, the handoff mechanics can take an additional second or two, on top of the second or so taken by the scanning process before the handoff decision was made. In the worst case, the 802.1X communication can take a number of seconds.

Part of the issue is that the mechanisms are nearly the same for a handoff as they are for when the client initially connects. This lack of memory within the network within basic Wi-Fi prevents any optimizations and requires a fresh start each time.

When Scanning Happens | Inter-Access Point Handoffs

The client's handoff is only as good as its scanning table. The more the client scans, the more accurate the information it receives, and the better decision the client can make, thus ensuring a more robust call. However, scanning can cost as much in call quality as it saves, and most certainly diminishes battery life. So how do phones determine when to scan?

The most obvious way for a client to decide to scan is for it to be forced to scan. If the phone loses connection with the access point that it is currently attached to, then it will have no choice but to reach out and look for new options. Clients mainly determine that they have lost the connection with their current access point in three different ways.

The first method is to observe the beacons for loss. As mentioned earlier, beacon frames are transmitted on specific intervals, by default every 102.4ms. Because the beacons have such a strict transmission pattern, clients—even sleeping clients—know when to wake up to catch a beacon. In fact, they need to do this regularly, as a part of the power saving mechanisms built into the standard. A client can still miss a beacon, for two reasons: either the beacon frame was collided with (and, because beacon frames are sent as broadcast, there are no retransmissions), or because the client is out of the range that the beacons' data rates allow. Therefore, clients will usually observe the beacon loss rate. If the client finds itself unable to receive enough beacons according to its internal thresholds, it can declare the access point either lost or possibly suffering from heavy congestion, and thus trigger a new scan, as well as deprioritize the access point in the scanning table. The sort of loss thresholds used in real clients often are based on a combination of two or more different types of thresholds, such as triggering a scan if a certain number of beacons are lost consecutively, as well as triggering if a certain percentage is lost over time. These thresholds are likely not to directly specifiable by the user or administrator.

The second method is to observe data transmissions for loss. This can be done for received or transmitted frames. However, it is difficult for a client to adequately or accurately determine how many receive frames have been lost, given that the only evidence of a retransmission prior to a lost frame is the setting of the Retry bit in the frame's header, something that is not even required in the newer 802.1 In radios. Therefore, clients tend to monitor transmission retries. The retry process is invoked for a frame. Retransmissions are performed for both collisions and adapting to out-of-range conditions— because the transmitter does not know which problem caused the loss, both are handled by the transmitter simultaneously reducing the transmit data rate, in hopes of extending range, and increasing backoff, in hopes of avoiding further collisions for this one frame. Should a series of frames back-to-back be retransmitted until they time out, the client may decide that the root cause is for being out of range of the access point. Again, the thresholds required are not typically visible or exposed to the user or administrator.

Voice clients tend to be more proactive in the process of scanning. The two methods just described are for when the client has strong evidence that it is departing the range of the access point. However, because the scanning process itself can take as long as it does, clients may choose to initiate the scan before the client has disconnected. (This may sound like the beginnings of a make-before-break handoff scheme, but read on to Section 6.2.3, where we see that such a scheme does not, in fact, happen.) Clients may chose to start scanning proactively when the signal strength from the access point begins to dip below a predetermined threshold (the signal strength itself is usually measured directly for the beacons). Or, they may take into account increasing—but not yet disruptive—losses for data. Or, they may add into account observed information about channel conditions, such as an increasing noise floor or the encountering of a higher density of competing clients, to trigger the scan. In any event, the client is attempting to make some sort of preprogrammed expense/reward tradeoff. This tradeoff is often related to the problems of handoff, as mentioned shortly.

Scanning may also happen in the background, for no reason at all. This is less common in voice clients, where the desire to ensure battery life acts as a deterrent, but nevertheless is employed from time to time. The main reason to do this sort of background scanning is to ensure that the client's scanning table is generally not as stale, or to serve as a failsafe in case the triggered scanning behavior does not go off as expected. One of the chief problems with determining when to scan is that the client has no way of knowing whether it is moving or how fast it may be moving. A phone held in the hands of a forklift driver can rapidly go from having been standing still for many minutes to racing by at 15 miles per hour in a warehouse. This sort of scanning, not being triggered, is the least likely to lead to a change in access point selection, but may still serve its appropriate place in a network. For data clients, as a comparison, this form of background scanning, triggered for no reason, is often driven by the operating system. Windows-based systems often scan, for example, every 65 seconds, just to ensure that the operating system has a good sense of the networks that are available, in case the user should want to hop from one network to another. This sort of scanning causes a noticeable hit in performance for a short period of time on a periodic basis.

The Scanning Process | Inter-Access Point Handoffs

The scanning table's contents come from beacons and probe requests. Scanning is a process that can be requested explicitly the user—often by performing an operation that is labeled "Reconnect." "Update," or "Scan." But far more often, scanning is a process that happens in the background or when the client decides that it is needed. To understand why the client makes those choices, we will need to look at the mechanisms of scanning itself.

There are two ways that the scanning table can be updated. When a client is associated to an access point, it has the ability to gather information about other access points on that channel. Especially when the client is not in power save mode, the client will usually ask its hardware to let it receive all beacon frames from any access point. Each beacon frame is then used to update the scanning table entry for that access point.

On the other hand, the client may want to survey other channels to find out what other access point options are out there. To do this, the client clearly needs to leave the channel of its access point for at least a small amount of time. Therefore, before engaging in this process, the client will usually tell the access point that it is going into power save mode, even though it is doing no such thing. That way, the access point will buffer traffic for the client, who can then look around the network with impunity.

When the client changes channels, it has two methods it can use to find out about the access points. The quickest method is to send out the probe request mentioned earlier. This probe request contains the SSID the client desires (with the option of a null SSID, an empty string, if the client wants to learn about all SSIDs), and is picked up by all access points in range that support the SSID and wish to make themselves known to the client. Each access point that wishes to answer and that supports the SSID in question will respond with the probe response, a frame that is nearly identical to a beacon but is sent, unicast, directly to the client who asked for it. This procedure is called active scanning, though it can also be called probing, given the name of the frames that carry out the procedure. The other option is called passive scanning, and, as the name suggests, involves sending no frames out by the client. Instead, the client waits around for a beacon. Keep in mind that passive scanning clients do not know, ahead of time, how many access points are on a channel or when these access points may transmit the beacons. Therefore, a client may need to wait for at least one beacon period to maximize its chances of seeing beacons from every access point of possible interest.

In these two ways, the client goes from channel to channel, collecting as much information as possible about the available networks.

Clients may choose between active or passive scanning for a number of reasons. The advantage of active scanning is that the client will get definitive answers about the access points that are on that channel and in range in short order. Sometimes the client needs to send more than one probe request, just to make sure that none of those broadcast frames were lost because of transient RF effects or collisions. But the process itself concludes rather quickly. Furthermore, active scanning with probe requests is the only way to learn about which access points serve SSIDs that are hidden, where hidden SSIDs are not put in beacons and require the user to enter the SSID by hand. On the other hand, active scanning comes with two major penalties. The first one is for sheer network overhead. A probe request can trigger a storm of probe responses to the client, all of which take up valuable airtime. Especially when there is a network fluctuation (access point reboots, power outages, or RF interference), all of the probes pile onto an already fragile network, making traffic significantly worse. The second penalty is that active scanning is simply not allowed on the majority of the 5GHz channels. Any channel that is in a DFS band cannot be used with active scanning. Instead, the client is always required to wait for a beacon (an enabling signal), to know that the channel is allowed for operation, does not have a radar, and thus can be used. (Note that, once a client has an enabling signal, it is allowed to proceed with a probe request to discover hidden SSIDs. However, the time hit has been taken, and the process is no faster than a normal passive scan.)

Therefore, to better understand scanning, we need to look at the timing of scanning. Active scanning, of course, is the quicker process, but it too has a delay. Active scanning is limited by a probe delay, required by the standard to prevent clients from tuning into a channel in the middle of an existing transmission. The potential problem is that a client abruptly tuning into a channel might not be able to detect that a transmission is under way—carrier sense mechanisms that are based on detecting the preamble will miss out, and thus produce a false reading of a clear channel. Thus, if the client were then to send a probe request, the client could very well destroy the ongoing transmission and lose out on the access points' seeing the probe request, because of a collision. As it turns out, many voice clients set that probe delay to a trivial value, in order to not have to wait. But the common value for that delay is 12ms, which is a long time in the world of voice. Passive scanning is worse. Most access points send their beacons every 102.4ms, or as close as they can get. This means that a client who tunes to a channel has a good chance of having to wait 50ms just to get a beacon, and may have to wait the entire 100ms in the worst case, for just that one access point.

The timescale that dominates, for voice mobility, is the voice packet arrival interval. Normally, that value is 20ms (though it can be 30ms in some cases). A client will usually want to get all of the scanning it can get done in those 20ms, so that it can return to its original channel and not miss the next voice packet. Certainly, the client will not want to take 100ms unless it has to, because 100ms is a long enough jitter that it can be quite noticeable. Again, this tends to make active scanning the choice for voice clients, who are always in a hurry to learn about new access points.

If the client is going to scan between the voice packets, then the client's ability to scan will probably be limited to one channel at a time. When limited this way, the client may take up to a second, easily, to scan every possible channel. There are 11 channels in 2.4GHz, 9 non-DFS channels in 5GHz, and 11 more in the DFS bands, for a total of 31 channels to scan (or 23 channels if clients make the assumption that service is provided only on channels 1, 6, and 11 in the 2.4GHz band). Of course, scanning is also a battery-intensive process, and so a client may choose to spread out the scanning activity over time.

Furthermore, the process of changing channels is not always instantaneous. Depending on the radio chip vendor, some clients will have to wait through a multimillisecond radio settling and configuration time, reprogramming the various aspects of the radio in order to ensure proper transmission on the new channel. This adds additional padding time to the individual scanning channel transitions.

Overall, this scanning delay is a major source of handoff delays, and some methods for reducing the scanning time have been created, which we will examine shortly.

The Scanning Table

Let's look at the scanning table in a bit more detail. This table is primarily a list of access point addresses (BSSIDs), and the parameters that the access point advertises. The 802.11 standard lists at least some parameters that may be useful to hold in the client's scanning table, as in Table 1.

Table 1: Scanning table contents from 802.11
Field	Meaning
BSSID	The Ethernet address of the access point's service for this SSID
SSID	The SSID text string
BSS Type	Whether the access point is a real access point, or an ad hoc device
Beacon Period	Number of microseconds between beacons
DTIM Period	How many beacons must go by before broadcast/multicast frames are sent
Timestamp	The time the last beacon or probe response was scanned for this client
Local Time	The value of the access point's time counter
Physical Parameters	What type of radio the access point is using, and how it is configured
Channel	The channel of the access point
Capabilities	The capabilities the access point advertises in the Capabilities field
Basic Rate/MCS Set	The minimum rates (and MCS for 802.11 n) that this client must support to gain entry
Operational Rate/MCS Set	The allowed rates (and MCS for 802.11n) that this client can use once it associates
Country	The country and regional information for the radio
Security Information	The required security algorithms
Load	How loaded the access point reports itself to be
WMM Parameters	The WMM parameters that the client must use once it associates
Other Information	Depends on the standards that the client and access point supports

This table contains the fields taken from the access point's beacons and probe responses. Most of the information is necessary for the client to possess before it can associate, because this information contains parameters that the client needs to adopt upon association. By looking at this table, clients can easily see which access points have the right SSID, but will not allow the client to associate. Examples are for access points that require a higher grade of security than the client is configured for, or require a more advanced radio (such as 802.1 In) than the client supports. Most of the time, however, a properly configured network will not advertise anything that would prevent a properly configured client from entering.

In addition to all of this mostly static, configuration information that the access point reports, clients may collect other information that they may themselves find useful when deciding to which access point they should associate. This information is unique to the client, based on environmental factors. Generally, this information (not that in Table 1) is far more important in determining how a client chooses where to hand off or associate to. Table 2 contains some more frequent examples of information that different clients may choose to collect. Again, there is no standard here; clients may collect whatever information they want. Roughly, the information they collect is divided into two types: information observed about the access point, and information observed about the channel the access point is on. This split is necessary, because clients have to choose which channel to use as a part of choosing which access point to associate to. Properties like noise floor or observed over-the-air activity belong to the channel at the point in place and time that the client is in. On the other hand, some properties belong directly to the access point without regard to channel, such as the power level at which the client sees the access point's beacon frames. Furthermore, some of the per-access-point information may have been collected from previous periods when the client had been associated to that access point, and measured the quality of the connection.

Table 2: Other possible scanning table contents
Field	Meaning
Signal Strength	The power level of the beacon or probe response from the access point
Channel Noise	The measured noise floor value on the channel the access point is on
Channel Activity	How often the channel the access point is on is busy
Number of Observed Clients	How many clients are on the channel the access point is on
Beacon Loss Rate	How often beacons are missed on that channel, even though they are expected
Probe Request Loss Rate	How many times probe requests had to be sent to get a probe response
Previous Data Loss Rate	If associated earlier, how much loss was present between the access point and client
Probe Request Needed	Whether the client needed to send a probe request

The scanning table is something that the client maintains over time, as a fluid, "living" menu of options. One of the challenges the client has is in determining how old, or stale, the information may be—especially the performance information—and whether it has observed that channel or access point long enough to have some confidence in what it has seen. This is a constant struggle, and different clients (even different software versions from the same client vendor) can have widely different ways of judging how much of the table to trust and whether it needs to get new information. This is one of the sources of the variability present in Wi-Fi.

The Difference between Network Assistance and Network Control

If you have read the sections on cellular handoff, you'll know that there are broadly two different methods for phone handoffs to occur. The first method, network control, is how the network determines when the phone is to hand off and to which base station the phone is to connect. In this method, the mobile phone may participate by assisting in the handoff process, usually by providing information about the radio environment. The second method, network assistance, is where the network has the ability to provide that assistance, but the mobile phone is fundamentally the device that decides.

For transitions across basic service sets (BSSs) in Wi-Fi, the client is in control, and the network can only assist. Why is this? An early design decision in Wi-Fi was made, and the organization broke away from the comparatively long history of cellular networking. In the early days of Wi-Fi, each cell was unmanaged. An access point, compared to a client, was thought of as the dumber of the two devices. Although the access point was charged with operating the power saving features (because it is always plugged in), the client was charged with making sure the connection to the network stayed up. If anything goes wrong and a connection drops, the client is responsible for searching out for one of any number of networks the client might be configured to connect to, and the network needed to learn only about the client at that point. It makes a fair amount of sense. Cellular networks are managed by service providers, and the force of law prevents people from introducing phones or other devices that are not sanctioned and already known about by the service provider. Therefore, a cell phone could be the slave in the master/slave relationship. On the other hand, with Wi-Fi putting the power of the connection directly into the hands of the client, the network never needs to have the client be provisioned beforehand, and any device can connect. In many ways, this fact alone is why Wi-Fi holds its appeal as a networking technology: just connect and go, for guest, employee, or owner.

This initial appeal, and tremendous simplicity which comes with it, has its downsides, and quickly is meeting its limitations. Cellular phones, being managed entities, never require the user to understand the nature of the network. There are no SSIDs, no passphrases to enter. The phone knows what it is doing, because it was built and provisioned by the service provider to do only that. It simply connects, and when it doesn't, the screen shows it and users know to drive around until they find more bars. But in Wi-Fi, as long as the handset owns the process of connecting, these other complexities will always exist.

Now, you might have noticed that SSIDs and passwords have to do only with selecting the "service provider" for Wi-Fi, and once the user has that down (which is hopefully only once, so long as the user is not moving into hotspots or other networks), the real problem is with the BSSID, or the actual, distinct identities of each cell. That way of thinking has a lot to it, but misses the one point. The Wi-Fi client has no way of knowing that two access points—even with the same SSID—belongs to the same "network." In the original Wi-Fi, there is not even a concept of a "network," as the term is never used. Access points exist, and each one is absolutely independent. No two need to know about each other. As long as some Ethernet bridge or switch sits behind a group of them, clients can simply pass from one to the other, with no network coordination. This is what I mean, then, by client control. In this view of the world, there really is no such thing as a handoff. Instead, there is just a disconnection. Perhaps, maybe, the client will decide to reconnect with some access point after it disconnects from the first. Perhaps this connection will even be quick. Or perhaps it will require the user to do something to the phone first. The original standards remain silent—as would have phones, had the process not been improved a bit.

Network assistance can be added into this wild-west mixture, however. This slight shift in paradigm by the creators of the Wi-Fi and IEEE standards is to give the client more information, providing it with ways of knowing that two access points might belong to the same network, share the same backend resources, and even be able to perform some optimizations to reduce the connection overhead. This shift doesn't fundamentally change the nature of the client owning the connection, however. Instead, the client is empowered with increasingly detailed information. Each client, then, is still left to itself to determine what to do and when to do it. It is an article of faith, if you will, that how the client determines what to do is "beyond the scope of the standard," a phrase in the art meaning that client vendors want to do things their own way. The network is just a vessel—a pipe for packets.

You'll find, as you explore voice mobility deployments with Wi-Fi as a leg, that this way of thinking is as much the problem as it is a way to make things simple. Allowing the client to make the choice is putting the steering wheel of the network—or at least, a large portion of the driving task—in the hands of hundreds of different devices, each made by its own manufacturer in its own year, with its own software, and its own applications. The complexity can become overwhelming, and the more successful voice mobility networks find the right combinations of technologies to make that complexity manageable, or perhaps to make it go away entirely.

Inter-Access Point Handoffs

In a voice mobility network with Wi-Fi as a major component, we have to look at more than just the voice quality on a particular access point. The end-user of the network, the person with a phone in hand, has no idea where the access points are. He or she just walks around the building, going in and out of range of various access points in turn, oblivious to the state of the underlying wireless network. All the while, the user demands the same high degree of voice quality as if he or she had never started moving.

So now, we have to turn our focus towards the handoff aspect of Wi-Fi voice networks. Looking back on how Wi-Fi networks are made of multiple cells of overlapping coverage, we can see that the major sources for problems with voice are going to come from four sources:

How well the coverage extends through the building
How well the phone can detect when it is exiting the coverage of one access point
How well the phone can detect what other options (access points) it has available
How quickly the phone can make the transition from the old access point to the new one

Let's try to gain some more appreciation of this problem. Figure 1 shows the wireless environment that a mobile phone is likely to be dwelling within.

Figure 1: The Handoff Environment

As the caller and the mobile phone move around the environment, the phone goes into range and out of range of different access points. At any given time, the number of access points that a client can see, and potentially connect to, can be on the order of a dozen or more in environments with substantial Wi-Fi coverage. The client's task: determine whether it is far enough out of range of one access point that it should start the potentially disruptive process of looking for another access point, and then make the transition to a new access point as quickly as possible. The top part of Figure 1 shows the phone zigzagging its way through a series of cells, each one from an access point on a different channel. Looking at the same process from the point of view of the client (who knows only time), you can see how the client sees the ever-varying hills and valleys of the differing access points' coverage areas. Many are always in range; hopefully, only one is strong at a time.

The phone is a multitasker. It must juggle the processes of searching for new access points and handing off while maintaining a good voice connection. In this section, we'll go into details on the particular processes the phone must go through, and what technologies exist to make the process simpler in the face of Wi-Fi. But first, we will need to get into some general philosophy.

Active Voice Quality Monitoring

A large part of determining whether a voice mobility network is successful is in monitoring the voice quality for devices on the network. When the network has the capability to measure this for the administrator on an ongoing basis, the administrator is able to devote attention to other, more pressing matters.

Active voice quality monitoring comes in a few flavors. SIP-based schemes are capable of determining when there is a voice call. This is often used in conjunction with SIP-based admission control. With SIP calls, RTP is generally used as the bearer protocol to carry the actual voice, SIP-based call monitoring schemes can measure the loss and delay rates for the RTP traffic that makes up the call, and report back on whether there are phones with suffering quality. In these monitoring tools, call quality is measured using the standard MOS or R-value metrics.

SIP-based schemes can be found in a number of different manifestations. Wireline protocol analyzers are capable of listening in on a mirror port, entirely independent of the wireless network, and can report on upstream loss. Downstream loss, however, cannot be detected by these wireline mechanisms. Wireless networks themselves may offer built-in voice monitoring tools. These leverage the SIP-tracking functions already used for firewalling and admission control, and report on the quality both measured by uplink and downlink loss. Purely wireless monitoring tools that monitor voice quality can also be employed. Either located as software on a laptop, or integrated into overlay wireless monitoring systems, these detect the voice quality using over-the-air packet analysis. They infer the uplink and downlink loss rates of the clients, and use this to build out the expected voice quality. Depending on the particular vendor, these tools can be thrown off when presented with WPA- and WPA2-encrypted voice traffic, although that can sometimes be worked around.

Voice call quality may also be monitored by measurements reported by the client or other endpoint. RTCP, the RTP Control Protocol, may be transmitted by the endpoints. RTCP is able to encode statistics about the receiver, and these statistics can be used to infer the expected quality of the call. RTCP may or may not be available in a network, based on the SIP implementation used at the endpoints. Where available, RTCP encodes the percentage of packets lost, the cumulative number of packets lost, and interarrivai packet jitter, all of which are useful for inferring call quality. At a lower layer, 802.11k, where it is supported, provides for the notion of traffic stream metrics. These metrics also provide for loss and delay, and may also be used to determine call quality. However, 802.11k requires upgrades to the client and access point firmware, and so is not as prevalent as RTCP, and nowhere near as simple to set up as overlay or traffic-based quality measurements.

^[*]Of course, there had to be a catch. Some devices can carry two calls simultaneously, if they renegotiate their one admitted traffic stream to take the capacity of both. Because WMM Admission Control views flows as being only between clients and access points, the ultimate other endpoint of the call does not matter. However, this is not something you would expect to see in practice.

Spectrum Management | Voice Mobility with Wi-Fi

Spectrum management is the technology used by virtualization architectures to manage the available wireless resources. Unlike radio resource management, which is focused on adjusting the available wireless resources on a per-access-point basis, to ensure that the clients of that access point receive reasonable service without regard to the neighbors, spectrum management takes a view of the entire unlicensed Wi-Fi spectrum within the network, and applies principles of capacity management to the network to organize and optimize the layout of channel layers. In many ways, spectrum management is radio resource management, applied to the virtualized spectrum, rather than individual radios.

Spectrum management focuses on determining which broad swaths of unlicensed spectrum are adequate for the network or for given applications within the network. One advantage of channel layering is that channels are freed from being used to avoid interference, and thus can be used to divide the spectrum up by purposes. Much as regulatory bodies, such as the FCC, divide up the entire radio spectrum by applications, setting aside one band for radio, another for television, some for wireless communications, and so on, administrators of virtualization architectures can use spectrum management to divide up the available channels into bands that maximize the mutual capacity between applications by separating out applications with the highest likely bandwidth needs onto separate channel layers.

The constraints of spectrum management are fairly simple. A deployment has only a given number of access points. The number and position of the access points limits the number of independent channel layers that can be provided over given areas of the wireless deployment area. It is not necessary for every channel layer to extend across the entire network—in fact, channel layers are often created more in places with higher traffic density, such as libraries or conference centers. The number of channel layers in a given area is called the network thickness. Spectrum management can detect the maximum number of channel layers that can be created given the current deployment of access points, and is then able to determine when to create multiple layers by spreading channel assignments of close access points, or when to maximize signal strength and SNR by setting close access points to the same channel. Thus, spectrum management can determine the appropriate thickness for each given square foot. For 802.1 In networks, spectrum management is able to work with channel widths, as well as band and channel allocations, and is thus able to make very clear decisions about doubling capacity by arranging channels as needed.

Spectrum management also applies the neighboring-interference-avoidance aspects that RRM uses to prevent adjacent networks from being deployed in the same spectrum, if it can at all be avoided. Because there is no per-channel performance compromise in compressing the thickness of the network, spectrum management can avoid some of the troublesome aspects of radio resource management when dealing with edge effects from multiple, independent networks. Furthermore, spectrum management is not required to react to transient interference, as the channel layering mechanism is already better suited to handle transient changes through RF redundancy. This allows spectrum management to reserve network reconfigurations for periods of less network usage and potential disruption (such as night), or to make changes at a deliberate pace that ensures network convergence throughout the process.

Voice-Aware Radio Resource Management

The concept of voice-aware radio resource management is to build upon the measurements used for determining network capacity and topology, and integrate them into the decision-making process for dynamic microcell architectures.

Basic radio resource management is more concerned with establishing minimum levels of coverage while avoiding interference from neighboring access points and surrounding devices. This is more suitable for data networks. Voice-aware RRM shifts the focus towards providing a more consistent coverage that voice needs, often adjusting the nature of the RRM process to avoid destroying active voice calls. Voice-aware RRM is a crucial leg of voice mobility deployments based on microcell technology. (Layered or virtualized deployments do not use the same type of voice-aware RRM, as they have different means of ensuring high voice quality and available resources.)

The first aspect of voice-aware radio resource management is ironically to disable radio resource management. Radio resource management systems work by the access points performing scanning functions, rather similar to those performed by clients when trying to hand off. The access point halts service on a channel, and then exits the channel for a short amount of time to scan the other channels to determine the power levels, identities, and capacities of neighboring access points. These neighboring access points may be part of the same network, or may belong to other interests and other networks. Unlike with client scanning, in which the client can go into power save to inform the access point to buffer frames, however, access point scanning has no good way for clients to be told to buffer frames. Moreover, whereas client scanning can go off channel between the packets of the voice call, only to return when the next packet is ready, an access point with multiple voice calls will likely not have any available time to scan in a meaningful way. In these cases, scanning needs to be disabled. In RRM schemes without voice-aware services, administrators often have to disable RRM by hand, thus nullifying the RRM benefits for the entire network. Voice-aware RRM, however, has the capability to turn off scanning on a temporary basis for each access point, when the access point is carrying voice traffic. There are unfortunately two downsides to this. The first is that RRM is necessary for voice networks to ensure that coverage holes are filled and that the network adapts to varying density. Disabling the scanning portion of RRM disables RRM, effectively, and so voice-aware RRM scanning works best when each given access point does not carry voice traffic for uninterrupted periods of time. Second, RRM scanning is usually the same process by which the access points scan for wireless security problems, such as rogue access points and various i ntrusions. Disabling scanning in the presence of voice leaves access points with voice more vulnerable, which is unacceptable for voice mobility deployments. Here, the solution is to deploy dedicated air monitors, either as additional access points from the same network vendor, but set to monitor rather than serve, or from a dedicated WLAN security monitoring vendor, as an independent overlay solution.

The second aspect of voice-aware radio resource management is in using coverage hole detection and repair parameters that are more conservative. Although doing so increases the likelihood of co-channel interference, which can have a strong downside to voice mobility networks as the network scale and density grows, it is necessary to ensure that the radio resource management algorithms for microcells do not leave coverage holes stand by idle. Coverage holes disproportionately affect the quality of voice traffic over data traffic. Increasing the coverage hole parameters ensures that these coverage holes are reduced. Radio resource management techniques often detect the presence of a coverage hole by inferring them from the behavior of a client. RRM assumes that the client is choosing to hand off from an access point when the loss becomes too high. When this assumption is correct, the access point will infer the presence of a coverage hole by noticing when the loss rate for a client increases greatly for extended periods of time. This is used as a trigger that the client must be out of range, and informs the access point to increase its power levels. It is better for voice mobility networks for the coverage levels to be increased prior to the voice mobility deployment, and then for the coverage hole detection algorithm to be made less willing to reduce coverage levels. Unfortunately, the coverage hole detection algorithms in RRM schemes are proprietary, and there are no settings that are consistent from vendor to vendor. Consult your microcell wireless network manufacturer for details on how to make the coverage hole detection algorithm be more conservative.

The final aspect of voice-aware RRM is for when proprietary extensions are used by the voice client and are supported by the network. These extensions can provide some benefit to microcell deployments, as they allow the network to alter some of the tuning parameters that clients use to hand off. Unfortunately, the aspects of voice-aware radio resource management trade off between coverage and quality of service, and so operating these networks can become a challenge, especially as the density or proportion of network use of voice increases. Monitoring tools for voice quality are especially important in these networks

Power Control | Voice Mobility with Wi-Fi

Power control, also known as transmit power control (TPC), is the ability of the client or the access point to adjust its transmit power levels for the conditions. Power control comes in two flavors with two different purposes, both of which can help and hurt a voice mobility network. The first, most common flavor of power control is vested in the client. This TPC exists to allow the client to increase its battery life. When the client is within close range to the access point, transmitting at the highest power level and data rate may not be necessary to achieve a similar level of voice performance. Especially as the data rates approach 54Mbps for 802.11a and 802.11g, or higher for 802.1 In, the preamble and per-packet backoff overhead becomes in line with the over-the-air resource usage of the voice data payload itself. For example, the payload of a voice packet at the higher data rates reduces to around 20 microseconds, on par with the preamble length for those data rates. In these scenarios, it makes sense for the client to back off on its power levels and turn off portions of the radio concerned with the more processing-intensive data rates, to extend battery life while in a call. To do this, the client will just directly reduce its transmit power levels, as a part of its power saving strategy. This mechanism can be used for good effect within the network, as long as the client is able to react to an increase in upstream data loss rates quickly enough to restore power levels should the client have turned power levels down too low for the range, or if increasing noise begins to permeate the channel.

The other TPC is vested within the network. Microcell networks, specifically, use access point TPC to reduce the amount of co-channel interference without having to relocate or disable access points. By reducing power levels, cell sizes in every direction are reduced, keeping in line with the goals of microcell. Reducing co-channel interference is necessary within microcell networks, to allow a better isolation of cells from fluctuations in their neighboring cells, especially those related to the density of mobile clients.

Network TPC has some side effects, however, that must be taken into account for voice mobility deployments. The greatest side effect is the lack of predictability of coverage patterns for the access points. This can have a strong effect on the quality of voice, because voice is more sensitive than data to weak coverage, and areas where voice performs poorly can come and go with the changing power levels, of both the access point the client is associated to and of the neighbors. Unfortunately, power levels in microcell networks usually fluctuate on the order of a few seconds or a few minutes, especially when clients are associated, as the network tries to adapt its coverage area to avoid causing the increase in packet rate and traffic caused by the clients from affecting neighboring cells. Site surveys, which are performed to determine the coverage levels of the network, are always snapshots in time and cannot take TPC into account. However, the TPC variation is necessary for proper microcell operation, and unfortunately needs to happen when phones are associated and in calls. Therefore, it can cause a strong network-induced variation in call quality. It is imperative, in microcell deployments, for the coverage and call quality to be continuously monitored, to ensure that the TPC algorithms are behaving properly. Follow the manufacturer's recommendations, as you may find in Voice over WLAN design guides, to ensure that problems can be detected and handled accordingly.

Understanding the Balance | Load Balancing

Explicit in the concept of load balancing is that it is actually possible to balance load—that is, to transfer load from one access point to another in a predictable, meaningful fashion. To understand this, we need to look at how the load of a call behaves from one access point to another, assuming that neither the phone nor the access points have moved.

Picture the environment in Figure 1. There are two access points, the first one on channel 1 and the second on channel 11. A mobile phone is between the two access points, but physically closer to access point 1. The network has two choices to distribute, or balance, the load. The network can try to guide the phone into access point 1, as shown in the top of the figure, or it can try to guide the phone into access point 2, as shown in the bottom of the figure.

Figure 1: Load Balancing across Distances

This is the heart of load balancing. The network might choose to have the phone associate to access point 2. We can imagine that access point 1 is more congested—that is, it has more phone calls currently on it. In the extreme case, access point 1 can be completely full, and might be unable to accept new calls. The advantage of load balancing is that the network can use whatever information it sees fit—usually, loads—to guide clients to the right access points.

There are a few wrinkles, however. It is extremely unlikely that, in a non-channel-layered environment, the phone is at the same distance from each of the two access points. It is more likely that the phone is closer to one access point than another. The consequence of the phone being closer to an access point is that it can get higher data rates and SNR, which then allows it to take less airtime and less resources. It may turn out that, if the network chooses to move the phone from access point 2 to access point 1, that the increase in data rate because the phone is closer to access point 1 allows possibly two calls in for access point 2. In this case, the same call produces unequal load when it is applied to different access points, all else being equal.

For this reason, within-band load balancing has serious drawbacks for networks that do not use channel layering. Load balancing should be thought of as a way to distribute load across equal resources, but within-band load balancing tends to work rather differently and can lead to performance problems. If the voice side of the network is lightly used—such as having a small CAC limit—and if the impact of voice on data is not terribly important, then this sort of load balancing can work to ease the rough edges on networks that were not provisioned properly. However, for more dense voice mobility networks, we need to look further. The concept of load balancing among near equals does exist, however, with band load balancing. Band load balancing can be done when the phones support both the 2.4 GHz and 5 GHz bands (some newer ones do) and the access points are dual-radio, having one 2.4 GHz radio and another 5 GHz radio in the same access point. In this case, the two choices are collocated: the client can get similar SNR and data rates from either radio, and the choice is much closer to one-to-one. Figure 2 illustrates the point.

Figure 2: Load Balancing across Bands

A variant of band load balancing is band steering. With band steering, the access point is not trying to achieve load balancing across the two bands, but rather is prioritizing access to one band over the other—usually prioritizing access to the 5GHz band for some devices. The notion is to help clear out traffic from certain devices, such as trying to dedicate one band for voice and another for data. Using differing SSIDs to accomplish the same task is also possible, and works across a broader range of infrastructures.

There are differences between the two bands, of course, most notably that the 5 GHz band does not propagate quite as far as the 2.4GHz band. The 5GHz band also tends to be unevenly accessed by multiband phones: sometimes, the phones will avoid the 5GHz band unless absolutely forced to go there, leading to longer connection times. On the other hand, the 2.4GHz band is subject to more microwave interference. And, finally, this mechanism will not work for single-band phones. (The merits of each band for voice mobility are summarized later in this chapter.) Nevertheless, band load balancing is an option for providing a more even, one-to-one balance.

For environments with even higher densities, where two channels per square foot are not enough, or where the phones support only one band or where environmental factors (such as heavy microwave use) preclude using other bands, channel layering can be employed to provide three, four, or many more choices per square foot. Channel layering exists as a benefit of the channel layering wireless architecture, for obvious reasons, and builds upon the concept of band load balancing to createcollocated channel load balancing. The key to collocated channel load balancing is that the access points that are on different channels are placed in roughly the same areas, so that they provide similar coverage patterns. Because channels are being taken from use as preventatives for co-channel interference and are instead being deployed for coverage, channel layering architectures are best suited for this. In this case, the phone now has a choice of multiple channels per square foot, of roughly similar, one-to-one coverage. Figure 3 illustrates this.

Figure 3: Collocated Channel Load Balancing with Channel Layering

Bear in mind that the load balancing mechanisms are in general conflict with the client's inherent desire to gain access to whatever access point it chooses and to do so as quickly as possible (see Section 6.2.2). The network is required to choose an access point and then must ignore the client, if it should come in and attempt to learn about the nonchosen access points. This works reasonably well when the client is first powered up, as the scanning table may be empty and the client will blindly obey the hiding of access points as a part of steering the load. On the other hand, should the client already have a well-populated scanning table—as voice clients are far more likely to do—load balancing can become a time-consuming proposition, causing handoff delays and possible call loss. Specifically, what can happen is that the client determines to initiate a handoff and consults the information in its scanning table, gathered from a time when all of its entries were options, based on load. The client can then directly attempt to initiate a connection with an access point, sending an Authentication or Reassociation frame (depending on whether the client has visited the access point before) to an access point that may no longer wish to serve the client. The access point can ignore or reject the client at that point, but usually clients are far less likely to abandon an access point once they choose to associate than when they are scanning. Thus, the client can remain outside the access point, persistently knocking on the door, if you will, unwilling to take the rejection or the ignoring as an answer for possibly long periods of time. This provides an additional reason why load balancing in an environment where multiple handoffs are likely can have consequences for the quality of voice calls.

Telecom Made Simple