NIQA Final Publish

8/9/2019 NIQA Final Publish

1/3

NIQA Non-Intrusive voice Quality Analyzer

Sevana OYEmail: [email protected]: +358 9 23164165

http://www.sevana.fi

Modern standard methods for evaluating quality of transmitted speech

Voice quality is one of the main characteristics of speech transmission systems. When analyzing voice quality onemust not only consider audio signal degradation caused by transmission over telecom channels, but also specifics ofspeaker's voice, conditions of listener's hearing and variation of these parameters in time.

The most known methods for quality evaluation of voice transmission systems were developed by TelecommunicationStandardization Sector of International Telecommunications Union (ITU-T) in the middle of 90-s. Results of this workare presented in Recommendation P.800 (P.830) Methods for subjective determination of transmission quality [1,2]. This document describes conditions for voice quality testing, audio contents, scoring and methods to evaluate

results. Typically Methods for subjective determination of transmission quality are used to obtain mean subjectivequality score according to five-digit scale (Mean Opinion Score - MOS).

Unfortunately P.800 recommendation tests may lead to ambiguous results. Recommendation is warning aboutcomparing MOS scores received under different conditions and consider such approach incorrect. Besides thatpreforming tests according to P.800 takes a lot of time and requires a lot of testers involved in the process.

In order to move from subjective (MOS) scores to objective ones and to automate the quality measurement, ITU-T hasdeveloped the P.861 recommendation, which is based on low level quantitative measurements [3]. RecommendationP.861 is a follow-up of PSQM method (Perceptual Speech Quality Measurement), developed by KPN Research anddevoted to objective analysis of speech codecs performance with alow level of degradation.However, it is impossible to utilize PSQM for evaluation of work of a real communication system because the methoddoes not consider all the important factors influencing human perception. Among these factors are delay, jitter, packetloss as well as signal level clipping.

In February 2001 ITU-T has issued another recommendation ITU-T P.862 [4], which describes a more advancedalgorithm for voice quality testing PESQ (Perceptual Evaluation of Speech Quality). The algorithm includes leveland time aligning, human perception and cognitive modeling. Due to these additional operations the approachconsiders signal amplification/ attenuation in a communication system, time delays and jitter as well as spectrumbands, which are the most significant for human perception. Based on cognitive modeling PESQ also recalculatesobjective quality score into MOS values.

A disadvantage of PESQ as well as other similar algorithm is the fact that they are based on comparing of two signals:original and transmitted through a communication system. This approach may create a range of difficulties connectedwith setting and preforming voice quality testing. One requires to arrange signal recording on both sides of thetelecommunication system as well as records transmission to the test system. Besides this real time quality monitoringin such approach appears quite difficult as well.

In order to solve the challenging issues mentioned above ITU-T has developed a new recommendation P.563 [5]introduced in May 2004. This recommendation determines algorithm for evaluating speech quality by listening tocommunication sessions. The algorithm takes into account single-side distortions, speech trunk parameters, noise andspeech naturalness. Developers of P.563 call attention that P.563 does not provide overall quality estimation ofspeech transmission. Distortions driven by delays, echo, loss of loudness and everything related to two-sidedinteraction cannot be taken into consideration by this method.

It's widely thought that P.563 provides a high level of correlation between automated and expert quality scores.However, simple tests based on ITU-T sound database for codec testing [6] may raise some doubts about theconsistence of the algorithm provided together with its description.
mailto:[email protected]:[email protected]


2/3

Table.1. Comparison between results of P.563 and expert estimations

MOS Range Ava rage Score Average error

MOS P.563

4 5 4,25 2,45 1,79

3 4 3,42 1,70 1,69

2 3 2,56 1,71 0,97

1 2 1,68 1,49 0,55

The problem discovered in the distributed P.563 algorithm implementation required development of an alternativesolution. Further down one can find one of possible solutions that is implemented in Sevana NIQA (Non-IntrusiveQuality Analyzer).

General Structure of Sevana NIQA

NIQA's (Non-Intrusive Quality Analyzer) approach is based on a database of trained etalons called associations. Eachassociation corresponds to a group of files that have close expert estimations of sound quality and common set ofreasons for sound quality degradation. For each association NIQA calculates and stores a distribution of parameters'values.

Basic algorithm showing how NIQA obtains sound quality scores is represented on the picture below.

Loading sound data. Excluding low level pauses. Audio signal energy

normalization.

Detecting signal energy level threshold. VAD algorithm initialization.

Separating signal into active and passive components.

Calculating signal parameters in time domain.

Calculating signal spectrum.

Detecting DTMF

Psy-filtering. First level of psycho-acoustic model.

Splitting spectrum into tone/noise components.

Level normalization. Second level of psycho-acoustic model.

Transforming levels into quantitative range of loudness. Third level

of psycho-acoustic model.

Calculating signal spectrum parameters.

Search and selection from operational associations database.

Score calculation.

Output of quality score and list of matched

Signalparameters

Associations

database


3/3

When loading sound signal the system excludes all fragments with low energy level (according to threshold). Theexcluded fragments correspond to absolute silence and are considered irrelevant for obtaining sound quality score.

At the next phase the signal is split into frames used in voice activity detection algorithm (VAD). The system calculatesenergy values for each frame what increases accuracy of VAD. With the help of VAD algorithm the signal divides toactive and inactive components that are processed separately. The system builds level histograms for both active andinactive signal components.

By discrete cosine transform (DCT) the system obtains signal spectrum and checks the active components frames forDTMF presence and then excludes the frames that are similar to DTMF from further processing.

Next stage applies the first level of psycho-acoustic model to the signal spectrum. This model checks different types ofmasking (including pre-masking and post-masking). According to clear peaks of spectrum energy the system splits thesignal into tone and noise components.

Second level of psycho-acoustic model performs energy normalization of the signal energy levels are transformedinto loudness levels at 1kHz. Third level of psycho-acoustic model transforms loudness levels into several detectablegrades of loudness that allow to ignore sound signal changes, which are not recognized by human ear.

The next step is to split signal spectrum into bands that are critical to human ear perception and calculate parameters

both on and out of the bands. Based on the computed signal parameters the system selects most similar associationsfrom the database and performs matching. According to selected associations the system determines how much eachof them influence the overall quality and then generates the final voice quality score as a combination of scores forselected associations and according to correspondent weights.

Sevana NIQA Testing and Evaluation

Sevana NIQA has been tested utilizing the same ITU-T speech database that is used for conformance testing ofP.563 algorithm. In the tests we used a total of 376 English language recordings. All recordings were sorted into 4groups depending on their MOS scores (represented in the documentation attached to the sound database). For allgroups of recordings we determined average expert scores and average NIQA scores (Table 2). In order to illustratecomparison with P.563 we also calculated average errors for P.563 and NIQA scores for the same tests.

Table.2. Comparison of NIQA scores against expert estimations

MOS Range Average Score Average Error

MOS NIQA NIQA P.563

4 5 4,25 3,44 0,83 1,79

3 4 3,42 3,06 0,51 1,69

2 3 2,56 2,61 0,43 0,97

1 2 1,68 2,36 0,68 0,55

The results clearly show that NIQA allows receiving much higher accuracy between generated quality scores andexpert estimations than P.563. NIQA scores are less precise only for records with very low MOS scores (in the rangefrom 1 to 2). In all other cases NIQA provides 2-3 times higher quality scores precision compared to MOS values.

References1. Methods for subjective determination of transmission quality // ITU-T Recommendation P.800 /http://www.itu.int/rec/T-REC-P.800/en2. Subjective performance assessment of telephone-band and wideband digital codecs // ITU-T RecommendationP.830 / http://www.itu.int/rec/T-REC-P.830/en3. Objective quality measurement of telephone-band (300-3400 Hz) speech codecs // ITU-T Recommendation P.861 /http://www.itu.int/rec/T-REC-P.861/en4. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment ofnarrow-band telephone networks and speech codecs // ITU-T Recommendation P.862 / http://www.itu.int/rec/T-REC-P.862/en5. Single-ended method for objective speech quality assessment in narrow-band telephony applications // ITU-T

Recommendation P.563 / http://www.itu.int/rec/T-REC-P.563-200405-I/en6. ITU-T coded-speech database // Supplement 23 to ITU-T P-series Recommendations / http://www.itu.int/rec/T-REC-P.Sup23-199802-I/en

NIQA Final Publish

Documents

Transcript of NIQA Final Publish