Therefore, we turn to how the predictions of voice quality can actually be made electronically. ITU P. 862 introduces Perceptual Evaluation of Speech Quality, the PESQ metric. PESQ is designed to take into account all aspects of voice quality, from the distortion of the codecs themselves to the effects of filtering, delay variation, and dropouts or strange distortions. PESQ was verified with a number of real MOS experiments to make sure that the numbers are reasonable within the range of normal telephone voices.
PESQ is measured on a 1 to 4.5 scale, aligning exactly with the 1 to 5 MOS scale, in the sense that a 1 is a 1, a 2 is a 2, and so on. (The area from 4.5 to 5 in PESQ is not addressed.) PESQ is designed to take into account many different factors that alter the perception of the quality of voice.
The basic concept of PESQ is to have a piece of software or test measurement equipment compare two versions of a recording: the original one and the one distorted by the telephone equipment being measured. PESQ then returns with the expected mean opinion score a group of real listeners are likely to have thought.
PESQ uses a perceptual model of voice, much the same way as perceptual voice codecs do. The two audio samples are mapped and remapped, until they take into account known perceptual qualities, such as the human change in sensitivity to loudness over frequency (sounds get quieter at the same pressure levels as they get higher in pitch). The samples are then matched up in time, eliminating any absolute delay, which affects the quality of a phone call but not a recording. The speech is then broken up into chunks, called utterances, which correspond to the same sound in both the original and distorted recording. The delays and distortions are then analyzed, counted, and correlated, and a number measuring how far removed the distorted signal is from the original signal is presented. This is the PESQ score.
PESQ is our first entry into the area of mathematical, or algorithmic, determination of call quality. It is good for measuring how well a new codec works, or how much noise is being injected into the sample. However, because it requires comparing what the talker said and what the listener heard, it is not practical for real-time call quality measurements.

