Realtime Pitch

6
Real Time Speech Classification and Pitch Detection JONATHAN A. MARKS, MEMBER IEEE. INTERNATIONAL DIGITAL CORPORATION P.O. Box 6270 Johannesburg 2000 Absrrocl An accurate silence-unvoiced-voiced classification an d pitch debctio n algorithm is described and its implementation for real time applications on a Texas InstrumentsTMS320C2 5 digital signal processo r is e valuated. Speech classifica tion is separated into silence detection and voiced-unvoiced classification. Only the sig- nal’s e nergy level and zero crossing mte are used in both classification processes. Pitch detection need only.operate on voiced periods of speech. An elaborate peak picking technique is used to successively home in on the peaks that bound the pitch periods. Tests are performed on the found peaks to ensure that they are pitch per- iod peaks. A real time implementation strategy is developed that combines silence detection with the signal acquisition and tightly couples voiced-unvoiced classifica- tion with pitch detection. The silence detection task is interrupt driven and the pitch detection task loops continuously. The execution speed and accuracy results scores for this algorithm are shown to compare favourablywitb those for other such algo- rithms published in Rahiner et al. [3]. I. INTRODUCTION SPEECH CLASSIFICATION (silence, unvoiced and voiced) and pitch detectio n are fundamental to many digital speech processing applications, such as speaker and speech recognition, and bit rate re- duction for storage and transmission. Many such algorithms exist [ 114 161 Most are too computationally intensive for practical real time applications. The faster executing algorithms usuall y do not exhibit the accuracy required by many real time speech processing applications. Jayant [ 171, considers real time implementations feasible if they can be accomplished with a single microprocessor. This paper describes a speech classificati on and segm entation alg- orithm im plemented for real time processing on a Texas Instruments Th4S32OC25 Digital Signal Processor and evaluates its accuracy and speed performance. This section examines certain speech signal char- acteristics and concludes with a discuss ion of the pro blem s associated with speech classificati on and pitch detection. Section I1 reviews some speech classificati on and pitch detection techniques and selects those used in this algorithm. In Section 111, the real time imple ment ation strategy for the algorithm is developed. Section IV describes the sel- ection of speakers, the spoken messages and how they were rec orded. Section V evaluates the accuracy of the algorithm by examining its silence detection, voiced-unvoiced classification, and pitch detectio n performance. These results are compared to those obtained for other such algorithms found in R abiner e t al. [3]. In Section VI the execu- tion speed performance of the algorithm is evaluated. Section VI1 concludes with a discussion on the major results obtained from the evaluation of the algorithm. A. Speech Sipal CharacteriFtics Speech is the most natural form of communication known to m an, yet it is one of the most complex signals to analyse or m odel. Segments of speech can be described (or classified) in term s of the sounds they produce. Broadly, there a re four catagories: Silence. Unvoiced utterances. Voiced utterances. Plosives. Silence is that part of the signal where no speech is present. Un voiced sounds, such as he / di n ‘salt’, are created by air passing through the vocal tract without the vocal cords vibrati ng. V oiced sounds, such /AH/ ing it to vibrate. Plosives, such as the /b/ in ‘bee’, are cr eated by the sudden release of air from a constriction created in the vocal tract. Unvoiced speech exhibits low signal energy, no pitch, and a fre- quency spectrum biased towards the higher frequencies of the audio band, normally peakingat four to five kilohertz. Voiced speech, on the other hand, has greater signal energy, pitch, and a spectru m biased to- wards the lower frequencies. Silence (background noise) is assumed to have a flat frequency response and low energy levels. Plosives are transient in nature with relatively hig h energy levels and l ow frequency content. B Problems n Speech Claw‘fication nd Pitch Detection It is generally recognised that accurate silence-unvoiced-voiced classificat ion and pitch dete ction is extremely difficult for the follow- ing reasons: Unvoiced and voiced speech usually over lap and flow into each other, making it nearly impossible to pin-point the exact times where these transitions take place. Speech occasionally produces waveforms that cannot be classi- fied either voiced or unvoiced, even by visual inspection of amp- litude time curves. Low level voiced speech could be easily confused with unvoiced speech especially when the formant frequencies are high and prominent. The glottal excitation wavefor m of voiced speech is not a perfect train of periodic pulses. It varies in frequency, amplitude and shape. The s hape of the glottal excitation wavefor m is changed by the interaction of the vocal tract. The vocal tract cr eates form ant fre- quencies and other harmonics. If these forma nts and harmonics contain comparable energy to the pitch frequency energy, they interfere with the pitch detection criteria and can be confused with the pitch frequency. The vocal tract is continuously changing shape, and often builds up constrictions which block th e flow of air through it. When the air is released, it rushes out of the vocal tract, creating a plosive. Plosives have similar am plitude characteristics to voiced speech, but, like unvoiced speech, have no pitch. Therefore t is possible for plosives to be confused as either voiced or unvoiced speech. SPEECH CLASSIFICATION AND PITCH DET ECTION TE CHNIQUES In his section, vari ous techniques used to classify speech and loc- ate the pitch period in voiced speech are discussed. The review of techniques is not exhaus tive and is biased towards those m ethods that lend themselves to real time implementation. COMSIG 88 PRETORIA TH0219 6/88/0000 0001 1.00 @ 1988 IEEE 1

Transcript of Realtime Pitch

Page 1: Realtime Pitch

8/13/2019 Realtime Pitch

http://slidepdf.com/reader/full/realtime-pitch 1/6

Real Time Speech Classification and Pitch Detection

JONATHAN A. MARKS, MEMBER IEEE.

INTERNATIONAL DIGITALCORPORATION

P.O. Box 6270

Johannesburg

2000

Absrrocl An accurate silence-unvoiced-voiced classification an d pitch debctio nalgorithm is described and its implementation for real time applications on a Texas

InstrumentsTMS320C2 5 digital signal processo r is evaluated. Speech classifica tionis separated into silence detection and voiced-unvoiced classification. Only the sig-nal’s energy level and zero crossing mte are used in both classification processes.Pitch detection need only.operate on voiced periods of speech. An elaborate peakpicking technique is used to successively home in on the peaks that bound the pitchperiods. Tests are performed on the found peaks to ensure that they are pitch per-iod peaks. A real time implementation strategy is developed that combines silencedetection with the signal acquisition and tightly couples voiced-unvoiced classifica-tion with pitch detection. The silence detection task is interrupt driven and the pitchdetection task loops continuously. The execution speed and accuracy results scoresfor this algorithm are shown to compare favourablywitb those for other such algo-

rithms published in Rahiner et al. [3].

I. INTRODUCTION

SPEECH CLASSIFICATION (silence, unvoiced and voiced) andpitch detectio n are fundamental to many digital speech processing

applications, such as speaker and speech recognition, and bit rate r e-duction for storage and transmission. Many such algorithms exist[114161 Most are t oo computationally intensive for practical real tim eapplications.The faster executing algorithms usually do not exhibit theaccuracy required by many real time speech processing applications.Jayant [171, considers real time implementations feasible if they canbe accomplished with a single microprocessor.

This paper describes a speech classification and segm entation alg-orithm im plemented for real time processing on a Texas InstrumentsTh4S32OC25 Digital Signal Processor and evaluates its accuracy andspeed performance. This section examines certain speech signal char-acteristics and concludes with a discussion of the pro blem s associatedwith speec h classification and pitch detection. Section I1 reviews somespeech classification and pitch detection techniques and selects thoseused in this algorithm. In Section 111, the real time imple ment ationstrategy for the algorithm is developed. Section IV describes the sel-ection of speakers, the spoken messages and how they were rec orded.Section V evaluates the accuracy of the algorithm by exam ining itssilence detection, voiced-unvoiced classification, and pitch detectio nperformance. These results are compared to those obtained for othersuch algorithms found in R abiner e t al. [3]. In Section VI the execu-tion speed performance of the algorithm is evaluated. Section VI1

concludes with a discussion on the major results obtained from theevaluation of the algorithm.

A. Speech S i pa l CharacteriFtics

Speech is the most natural form of communication known to m an,yet it is one of the most complex signals to analyse or m odel. Segmentsof speech can be described (or classified) in term s of the sounds theyproduce. Broadly, there a re four catagories:

Silence.Unvoiced utterances.Voiced utterances.Plosives.

’Silence is that part of the signal where no speech is present. Un

voiced sounds, such as he / di n ‘salt’, are created by air passing throughthe vocal tract without the vocal cords vibrating. V oiced sounds, suchas the /AH/ in ‘and’, are cr eated by air passing through the glottis caus-ing it to vibrate. Plosives, such as the /b/ in ‘bee’, are cr eated by thesudden release of air from a constriction created in the vocal tract.

Unvoiced speech exhibits low signal energy, no pitch, and a fre-quency spectr um biased towards the higher freque ncies of the audioband, normally peakingat four to five kilohertz. Voiced speech, on theother hand, has greater signal energy, pitch, and a spectru m biased to-wards the lower frequencies. Silence (background noise) is assumedto have a flat frequency response and low energy levels. Plosives aretransient in nature with relatively high energy levels and low frequencycontent.

B Problems n Speech Claw‘fication nd Pitch Detection

It is generally recognised that accurate silence-unvoiced-voicedclassification and pitch dete ction is extremely difficult for the follow-ing reasons:

Unvoiced and voiced speech usually overlap and flow into eachother, making it nearly impossible to pin-point the exact timeswhere these transitions take place.Speech occasionally produces waveforms that ca nnot b e classi-fied either voiced or unvoiced, even by visual inspection of amp-litude time curves.Low level voiced speech could be easily confused with unvoicedspeech especially when the formant frequencies are high andprominent.Th e glottal excitation waveform of voiced speech is not a perfect

train of periodic pulses. It varies in frequency, amplitude andshape.The s hape of the glottal excitation waveform is changed by theinteraction of the vocal tract. The vocal tract cr eates form ant fre-quencies and other harmonics. If these forma nts and harmonicscontain comparable energy to the pitch frequency energy, theyinterfere with the pitch detection criteria and can be confusedwith the pitch frequency.Th e vocal tract is continuously changing shape, and often buildsup constrictions which block th e flow of air through it. When theair is released, it rushes out of the vocal tract, creating a plosive.Plosives have similar am plitude characteristics to voiced speech,but, like unvoiced speech, have no pitch. Therefore t is possiblefor plosives to be confused as either voiced or unvoiced speech.

11. SPEECH CLASSIFICATION AND PITCH DET ECTION TE CHNIQUES

In his section, various techniques used to classify speech and loc-ate the pitch period in voiced speech are discussed. The review oftechniq ues is not exhaustive and is biased towards those m ethods thatlend themselves to real time implementation.

COMSIG 88 PRETORIA TH0219 6/88/0000 0001 1.00@ 1988 IEEE 1

Page 2: Realtime Pitch

8/13/2019 Realtime Pitch

http://slidepdf.com/reader/full/realtime-pitch 2/6

A. Speech Clarsijication Techniques

tegories:Speech signal classification echniques can be divide d into two ca-

Those that apply directly to the pitch detection problem, ie.speech is usually classified unvoiced wh en th e pitch canno t be

Those that extract parameters from th e speech signal and attemp tto recognise patterns. That is, they are based o n differences instatistical distributions and/or ch aracteristic features of silence,unvoiced and voiced utterances [18], [19], [20], [Zl] [22].

To continuously classify speech ac curately, the algorithm mustadapt to the inco ming speech signal. It is difficult to make the m eth-ods in the form er category adaptive to classificationas heir primaryconcern is pitch detection. It is generally easier to make the tech niquesin the latter category adapt to the incoming signal.

found PI, [71,[91, [Ill, [W.

The most commonly used speech classificationparameters are:

Relative energy level.Zero crossing rate.First autocorrelation coefficient.First linear predictor coding (LPC) predictor coefficient.

0 Normalised prediction error.

The relative energy level of voiced speech is usually higher tha nthat of unvoiced speech and silence. The zero crossing rate is ameasure of frequency content in the signal. Unvoiced speech ex hibits

a higher zero crossing rate than voiced speech or silence. Voicedspeech does not change as rapidly as unvoiced speech. Thereforevoiced speech often has a higher first autocorrelation coefficient anda smaller prediction error tha n unvoiced speech. The first LPC coef-ficient, which is the cepstrum value of the signal at one unit de lay, issmaller for voiced speech than for unvoiced speech.

Atal and Rabiner [22] apply pattern recognition techniques usingprobability theory to the five speech parameters. Mwangi and Xydeas[IS] apply Fuzzy Set theory to th ese five speech parameters, and usea pattern recognition approach to classify speech signals. These tech-niques do not lend them selves to real time processing because of theirhigh computational intensity.

The adopted classificationapproach only uses the relative energylevel and zero crossing ate of the speech signal. These param eters ar ebased on time domain measurements and require the least computa-tional overheads to b e extracted from t he waveform. By using thesetwo parameter s and making the algorithm adapt to the ‘conversation

level’ of the signal, it is possible to achieve the desired accuracy.In the impleme nted algorithm, speech classification is divided

into two separate tasks:

Silence Detection.Voiced-Unvoiced Classification.

To detect silence, the relative energy level is a useful parameterto determ ine whether a spee ch signal is present or not. Davis [23] andSims [21] have both implem ented silence detection algorithms usingonly the signal energy level. Davis recomm ends the use of the zerocrossing rate to d etect silence periods accurately. Rabiner and Schafer[l] propose a silence detection algorithm using both th e relative en-ergy level and zero crossing rate. The zero crossing rate is requ ired todiscern between low amplitude fricatives and silence periods; bothcould have sim ilar energy levels but fricativeshave a higher frequencycontent.

In unvoiced-vo iced speech classification, it some times happensthat voiced speec h is classified unvoiced and vice versa. This is due tothe overlap in th e relative signal level and zero crossing rate valuesthat is used to classify the s peech signal. By using the princi ple of hys-teresis and delayingconfirmation of the classificationuntil more of the

incoming signal has been considered, this problem is largely over-come.

Fig. 1 provides a graphical explanation of the hysteresis principleby plotting the zero cro ssing rate versus the relative energy level. Theunvoiced level, unvoiced line, voiced level, and voiced line are deter-mined empirically. Should the signal energy level be greater than thevoiced level or less than th e unvoiced level, then the signal is classi-fied voiced or unvoiced respectively. If the relative signal level fallsbetween th e unvoiced level and the voiced level, then the zero cross-ing rate is considered. If the point described by the relative energylevel and the zero crossing rate falls above the unv oiced line or below

the voiced line, then t he signal is classified unvoiced or voiced respec-tively. If th e signal value falls between the voiced and un voiced lines,then the signal assumes its prior classification.

>

IVoiced ConversationLevel Level

Unvoicedevel

Fig. 1. Voiced-Unvoiced HyslerisisClassification

B. Pitch Detection Techniques

one of the following two categories:All the pitch detection algorithms surveyed [11-[ 161 conformed to

Those that det ermine the pitch period by considering a segment

of speech 10 o 30ms long and process it using autocorrelation,filtering or frequency domain techniques to find the pitch pe riodfor the whole segment under consideration [9], [ll] 12].Those that determine the pitch period in the time domain by

using combinations of peak picking, pattern recog nition and stat-

istical techniques [6], [14], [15].

LaGarde [4]pointsou t that autocorrelative and frequency domaintechniques for pitch detection ar e prohibitive for real time implemen-tations due to their computational intensity.Therefo re, only the lattercategory canbe considered for real time implementations. Algorithmsin the latter category include:

Rabine r and Gold’s Parallel Processing Technique for Estimat-ing Pitch Periods of Speech in the Time Do main [6].Hassanein and Bryden’s Implementation of the Gold-RabinerPitch Detector in Rea l Time using and Improved Voicing Detec-tor [7].Miller’s Pitch Detection By Data Reductio n Algorithm [15].

Gold‘s Com puter Program for Pitch Extraction [141.

Further applications of the a lgorithm, prohibit low pass filteringwith a cu t off frequen cy of 900 Hz as carried out by all these algorithms

except for Gold’s Computer Program for Pitch Extraction. Gold’s alg-orithm selects the maxima and minima of the waveform and labelsthem acco rding to certain characteristics. Several regularity tests areperform edon hese labelled peaks to determine those peaks that markthe pitch periods.

2 COMSIG 88 PRETORIA

Page 3: Realtime Pitch

8/13/2019 Realtime Pitch

http://slidepdf.com/reader/full/realtime-pitch 3/6

Th e adopted approach, like Gold’s, also employs elaborat e peakpicking techniques and pattern m atching. Th e implem ented algorithmdivides pitch detect ion into the process of first finding the pitch andthen tracking it, similar to the methods used by LaGarde [4] and Davis[U].Th is algorithm is designed to locate pitch frequencies in the rangefrom 64Hz o SO Hz.

Peak picking and pattern matching techniques capitalise on theslowly varying nature of the speech signal, ie. the pitch period peaksappear to follow trends. Consecutive pitch period peaks seldom varymore than SO , and variations in consecutive pitch periods seldo mexceed 10%.

This algorithm performs searches for peaks, that progressivelyhome in on the pitch period, by successively reducing the search ra ngein which pitch period peaks a re expected to be found. Tests are per-formed on the pitch period peaks to ensure that the found peak isbroadly following he trends of the previous pitch period peaks. Pitchtracking locates the next pitch period based on t he length of th e lastpitch period, and ensu res that the found peak is broadly following thetrends of the previous pitch period peaks.

111. REAL T I M E WPLEMENTATION STRATEGY

The developed algorithm accepts a digitised speech signal andoutputs the processed waveform in segments which are classifiedsilence, unvoiced or voiced. Silence segments approxim ate he lengthof the silence periods, Unvoiced segment lengths are determ ined bythe previous voiced speech pitch period. Voiced speech is segment edaccording to its pitch period. Adjacent unvoiced segments are the

same size, whereas voiced speech segments vary with the pitch pe riod

of the speech signal.Th e real time implem entation strategy decomposes the system

into three tasks as he da ta flow diagram illustrated in Fig. 2.:

Data A cquisition and Silence Extraction.Speech Classification and Segmentation.Segment Processing.

eFig. 2 System LevelData Flow Diagram

Th e Data Acquisition and Silence Extraction task accepts the dig-itised signal, samp le by sample, from the signal input device, detectsand extracts any peri ods of silence, and writes the ‘silence extracted’signal values into the Da ta Stream . The Data Stream is an array largeenough to buffer the ‘silence extracted’ signal values before the s ub-sequen t tasks can examine them.

Th e Speech Classification and Segmentation task reads segmentsof signal values from the Data Stream, classifies them, and locates thepitch of the voiced speech segments. This task stores the classified seg-ment in the Segment Buffer, and places information about it in theSegment Characteristics data store.

Th e Segment Processing task is invoked by the Speech Classifica-tion and Segmentation task. This task is provided for the furtherprocessing of the segment in th e Segment Buffer before it is output.

The Data Acquisition and Silence Extraction task is interrupt

driven and writes the ‘silence extracted’ signal values to the DataStream at a rate d eterm ined by the frequency at which signal valuesare read into the system. Th e Speech Classification and Segm entationtask loops continuously reading segments of signal values from th e

Data S tream, classifies, and locates or tracks the pitch period peaks ofvoiced speech before dispatching the segments to the Segment Pro-cessing task. The Data Stream array is the means with which thealgorithm synchronises the flowof signal values from the D ata Acquis-ition and Silence Extraction task and the Speech Classification andSegmentation task.

Iv. SPEECH RECORDINGSOR EVALUATION

Recordings for evaluation were tak en from appropri ately chosenspeakers to cover a pitch span from 67Hz to 500 Hz. The recorded

message was chosen to cover a comprehensive range of phonemesfound in English.Four sp eakers were selected:

Male w ith low pitch (SID).Male with average pitch (JON).Fema le with average pitch (CHRISTA ).Fema le child with very high pitch (NA TASHA ).

Each speaker recorded the same message which is: “Every saltbreeze com es from the sea. I was stunned by the beauty of the view. Iknow when my lawyer is due.”

All the m essages were filtered before being digitised. The ampli-fier used has a low pass 60 dB per octave Butterw orth filter with a -3dB point at 3.5 kHz, and a high pass 36dB per octave Butterw orth fil-ter with a -3dB point at 60 Hz. The amplifier has a signal to noise ratio

(SNR) of 65 dB. The messages were digitised at 8 kHz, and only the

most significant 12 bits of the signal were reco rded.To evaluate the algorithm performance in normally encountered

transmission and recording conditions, each m essage was recorded ina certain environment:

SIDwas recorded in a laboratory area with a dot matrix printeroperating and peopl e talking in the background.JONwas recorded in a quiet environment.CHRISTA was recorded in the sam e laboratory area as SID withpeople talking in the background.NATASHA was recorded with an amplifier that injected 50 Hz

hum and had a SNR of approximately 40 dB.

V. ACCURACYVALUATION

To evaluate the speech classification and pitch dete ction accuracyof the algorithm, standa rds are required with which the results of the

algorithm can be compared. Th e standards were generated m anuallyby examining plots of the messages and inspecting their digitisedvalues.

Fig. 3. shows the pitch period co ntours for each speaker (SID, JON,CHRISTA, and NATASHA). The histograms on the left are the standardsfound manually, and the histograms on the right are obtaine d from themessages processed by the algorithm. The vertical lines representsilences, the horizontal lines indicate unvoiced segments, and the dotsrepresen t the pitch periods of the voiced segments.

A Accuracy Evaluation Criteria.

lowing criteria:The accuracy of the algorithm is evaluated according to the fol-

Silence detection error rate.Erroneo us unvoiced classification rate.Erroneo us voiced classification rate.

Gross pitch detection er ror count.Fine pitch detection erro r standard deviation.

COMSIG 88 PRETORIA 3

Page 4: Realtime Pitch

8/13/2019 Realtime Pitch

http://slidepdf.com/reader/full/realtime-pitch 4/6

Fig. 3. Pitch Period Conlour of each Speaker

1 Silence Detection Error Rate: This is a measure of the accu racy ofthe algorithm to locate the periods of silence in the speech signal. It isfound by summing the magnitude of the differences between thelengths of th e silence periods in the standard and those found by thealgorithm. The percentage silence detection error is found by dividingthis sum by the sum of the lengths of all the silence periods in the cor-responding standard.

2. Erroneous Unvoiced Class cation Rate: This measurement indi-cates the error rate in classifying voiced segments as un voiced. It isfound by counting all th e segments classified unvoiced by th e algo-

rithm that should have been classified voiced for each m essage, anddividing this count by the numb er of voiced segmen ts in the standardfor that message.3.Erroneous Voiced Clm catio n Rate: This measurement displaysthe error rate in classifying unvoiced segments as voiced. It is found

by counting all the segm ents classified voiced by the alporithm thatshould have been classified unvoiced for each message, and dividingthis count by the num ber of unvoiced segments in the standard for thatmessage.

4 Gross Pitch Detection Error Count: If the algorithm erroneouslyfinds the pitch period of a voiced segment, and the difference b etweenit and that in the standard is greater than 1ms, then this pitch detec-tion error is a grosspitch detection error.

5 Fine Pitch Detection Error Standard Deviation: Should this algorithm erroneously estimate th e pitch period in voiced speech with an

error less than 1ms relative to the same segment in the sta ndard, thenit is a fine pitch detection error. Th e method for computing the stand-ard deviation is obtained from R abiner et al. (31.

4 COMSIG 88, PRETORIA

Page 5: Realtime Pitch

8/13/2019 Realtime Pitch

http://slidepdf.com/reader/full/realtime-pitch 5/6

B. Results ofAccuracy Evaluaion

Tests designed in accordance with the evaluation criteria dis-cussed in Section V-Awere perfo rmed on the four recorded messages.The results are presented in Tables I to V. Th e results of th e classifi-cation errors and the pitch detection errors are tabulated with thecorresponding results of other algorithms obtained by Rabiner et al.[3]. The algorithms used for comparison are:

Modified autocorrelation method using clipping (AUTOC)

Cepstrum method (CEP).

Simplified inverse filtering technique (SIFT).

Data Reduction Method (DARD).Parallel Processing Metho d (PPROC).

Spectral equalisation LPC method using Newton’s transform-ation (LPC).

Average magnitude difference function (AMDF).

In the following tables this algorithm will be referred to as PPICK

for peak picking technique. Rabinir ’s speakers correspond to those inthis paper as follows: LM to SID, M2 to JON, FL o CHRISTA, and C1 toNATASHA. The o verall performance of the algorithm over the rangeof speakers is evaluated by summing he results of the four speakersfor each evaluation criteria. Th e better results are indicated by lowervalues in the tables. The best results for each criteria are typ eset inbold.

1. Silence Derection Errors: The Silence Detection Error Rate per-centage for each of the speakers is presented in Table I. The resultsindicate that backgro und noise has an adverse effect on the silence de-tection performance. The message SID was recorded in the noisiestenvironm ent, and has the highest Silence Detection Error Rate . Themessage JON was recorded in a quiet env ironment and h as the lowesterror valu e. Low volume backgro und noise appears to have less effectthan low amplifier noise on th e algorithm ’s silence detection perfo r-mance.

TABLE 1 PERCENTAG E SILENCE DETECTION ERROR RATE

Speaker

SID JON CllRIsTA NATASMA

40.6 139 3.63 7.60

2. Erroneous Unvoiced Class cation: Table 11. shows the percent-age erron eous u nvoiced classification scores for each of the fourspeak ers and compares the results to similar results of the oth er algo-

rithms obtained by Ra biner et al. [3]. The score for NATASHA is thehighest indicating that this alg orithm is better at classifying speech as

voiced when the pitch is not too high.Comparing these scores to the those from the other algorithms,

this algorithm achieves he best results for low pitched male and aver-age female voices. This algorithm achieved an overall performancescore of 6.36 which is the lowest followed closely by the LPC methodwith 7.6.

TABLE 11. PERCENTAG E ERRONEOUS UNVOICED CLASSIFICATION

Pitch Detector Algorithm

Speaker PPICK AUTOC CEP SIFT DARD PPROC LPC AMDP

SID 0 2 0 6.2 24.4 8.6 14.4 3.4 0.8 2.8JON 1.88 4.1 18.6 1.8 5.6 2.3 3.9 2.4C H R I n A 0 9 2.4 10.1 4.5 7.3 4.9 1.8 4.5NATASHA 3.78 1.6 11.6 24.5 3.2 2.1 0.9 2.3

SU 6 3 6 14.3 64.7 39.4 30.5 12.7 7.6 12.0

3. Erroneous Voiced Classification: Table 111.  shows the erroneous

voiced classification co res of this algorithm for each of th e four speak-ers with similar result for the other methods obtained from Rabineret al. [3]. The highest score achieved by this algorithm was for speakerSID indicating that this algorithm is better at classifying segments un-voiced when the pitch is higher. Th e background noise appeared tohave a small but significant effect on the unvoiced classificationof thespeech signals.

W e n these results are compared to the others , average malevoices and high pitched voices faired best. This algorithm scored the

best overall result of 10.51 followed by the Cepstrum method for pitch

detection with 17.0.

TABLE 111. PERCENTAG E ERRONEOUS VOICED CLASSIFICATION

Pitch Detector Algorithm

Speaker PPICK AUTOC CEP sim DARD PPROC LPC AMDF

SID 5.10 17.8 1.7 18.3 13.9 16.1 25.6 17.2ION 1.04 15.6 3.1 23.4 14.8 20.3 17.2 18.8

CHRISTA 2.38 10.0 6.9 16.9 1.9 5.6 9.4 8.1

NATASHA 1.99 11.4 5.3 15.9 12.9 11.4 8.3 6.8

SU M 1051 54.8 17.0 74.5 43.5 53.4 60.5 50.9

4. GrossPitch Errors: Table IV. shows the results obtained for thegross pitch error measurements with similar results from other algo-rithms found in Rabiner et al. [3]. This algorithm achieves the bestscores when the pitch is generally high. Th e result for CHRISTA washigher than exp ected. It was found that this speaker occasionallyexhi-bits diplophonic characteristics, ie. every second p itch period peak islower than th e first, caused by the g lottis only closing properly on everysecond vibration.

Th e scores achieved by this algorithm are placed secon d and thirdwhen compared to similar results from the other algorithms. Never-theless, when the overall performance is considered, this algorithmhas the best score of 6.97, followed by the C epstrum method for pitchdetection with 17.6. This result shows that this algorithm detects p itchvery well over a wide rang e of pitch frequencies.

TABLE IV. PERCENTACE GROSS ERROR COUNT

Pitch D etector Algorithm

Speaker PPICK AUTOC EP sin DARD PPROC LPC A MD F

SID 1.61 19.5 1 3 4.5 13.8 23.8 13.0 23.8JO N 2.51 7.3 3 5.3 26.8 12.3 5.5 8.5C H R I n A 2.58 2.0 2.5 3.8 8.5 5.0 2.0 1 8

NATASHA 0.27 0.0 12.5 40.8 3.0 7.3 5.5 6.5

S U M 6.97 28.8 17.6 54.4 52.1 48.4 26.0 40.6

5 . Fine Pitch Errors Table V. presents the results of the standard de-viation analysis of the fine pitch errors over the range of speakers forthis algorithm and the other algorithm s as found by R abiner et al. [3].This algorithm achieved its best results for male speakers. The resultsbecame prog ressively worse as the pitch period of the speakers de-creased.

TABLE V. STANDARD DEVIATION OF FINE PITCH ERRO RS

Pitch Detector A lgorithm

Speaker PPICK AUTOC CEP s i n DARD PPK O C LPC AVOF

SID 035 0.8 1.0 0.9 1.1 1.6 1.1 2.1

JON 0 5 1 1.0 0.9 1.0 1.2 1.4 0.9 1.6CHRI-A 0.55 0.6 0 5 0.7 0.8 0.7 0.6 1.1NATASHA 0.65 0.5 0.4 0.9 0.8 0.7 0.5 1.1

SUM 2.06 2.9 2.8 3.5 3.9 4.4 3.1 5.8

COMSIG 88. PRETORIA 5

Page 6: Realtime Pitch

8/13/2019 Realtime Pitch

http://slidepdf.com/reader/full/realtime-pitch 6/6

In comparison to the results obtained by Rabine r, this algorithmhas the best sco res for male speakers, second place for average femal espeakers, third place for very high pitch speakers. In t e r m of overallfine pitch errors this algorithm out-performs all the other methodswith a score of 2.06. The Cepstrum method’ofpitch detec tion followsthis alg orithm w ith a sco re of 2.8.

W. OMPUTATIONAL SPEE DEVALUATION

Th e computational speed of this algorithm is judged in terms ofpercentage real time taken when processing a message. Percentage

real time is defined as

%TH = l 0 0

Where:%TH = Percentage real time.TPm = Time taken to process a N of samples.N = ,Num ber of samples processed.

f = Sampling frequency of the digitised message.

In preliminary tests over a wide range of messages using a num-ber of speakers, the algorithm used between 11.7 and 14.9 realtime to classify the speech and de tect the pitch of voiced speech. Th emessages with higher pitch tended to ta ke longer to execute.

The silence detection portion of the algorithmused between4.2and 4.6 real time. The variation in performance for silence detec-tion was found to depend more on the ratio between unvoiced andvoiced speech, rather than amou nt of silence in the m essage. Mes-sages with a greater unvoiced speech co ntent consum e more time.

Classificationand segmentation of the speech signalwhen it is un-voiced. consumed between 3.3 and 4.1 real time with the higherpitch messag es taking longer to process. This is due to th e unvoicedsegment size being related to the pitch period of th e signal and thatthere a re more segments to process when the pitch is high.

Classification and pitch detection of voiced speech too k between4.5 and 6.7 real time, again with the higher pitch messages takinglonger to process. This is beacause of the form ant frequencies havingan effect on the success of locating the pitch at high er pitch frequen-cies, and the additional tests required t o ensure that t he algorithm hasnot found a multiple of the pitch period.

Awo rst case situation was imposed on the a lgorithm, where it wasforced to fail all the pitch detecti on tests and never track the pitch. Inthis situation, classificationand pitch detection of voiced speech tookless than 16 of real time.

WI. ISCUSSIONF RESULTS AND CONCLUSION

Th e overall performance of this algorithm compares favourablywith the results of other algorithms found in the literature. The accu-racy scores achieved by this algorithm show it to classify speech an ddetect pitch very well over a wide range of pitch periods, whereas thealgorithms examined by Rabine r et a1.[3], showed preferences to ce r-tain categories of speakers.

Th e silence detection and voiced-unvoiced classification errorsmade by this algorithm are attributed to noise that was recorded withthe messages. This algorithm’s classificationprocess tends to favourvoiced classificationwhen th e pitch is high and unvoiced classificationwhen the pitch is low.

Gross pitcherro rs mainlyocc ur at the trailing edge ofvoicedutter -ances where the air flow through the vocal tract is slowing down andthe glottal m uscles are relaxing, causing the glottis to vibrate slower.Occasionally he time between pitch period peaks at the trailing edge

of utterances is greater than 20 ms, even for high pitched speakers.The occurrence of fine pitch errors tends to be proportional to th e

pitch frequency. This is beacause the interaction of the forman t fre-

quencies superimpose d on the pitch frequen cy which creates peaks ofsimilar height to the pitch period peak, but offset slightly from the

pitch period pe ak. Low pass filtering at about 900Hz will limit thenumber of pitch errors made by this algorithm. However, for the in-teded application of this algorithm, such filtering is not desirable.

The speed performance of this algorithm on the Texas Jnstru-ments TMS32ocLs digital signal processor permits its use in many realtime applications.The algorithm utilisesalmost 15 of real time, leav-ing 85 of real time for further processing of the speech signal.

In com parison to Rabiner’s results [3], the Cepstrum method ofpitch detection came closest to this algorithm. Th e Cepstrum methodrequires 400 seconds of real time to process one second of speech ona Nova 800computer with special floating point ha rdware.

The voiccd-unvoiced classification of this algorithm is currentlybeing improved to classify equallywell over the required range of pitchperiods. In m any speech processing applications, analogue processing,of the signal ncludes AG C amplification.This amplification nterfereswith the pat tern recognition techniques used in this algorithm . By in-terfacing this algorithm to the gain control signal from the AGCamplifier, the pattern recognition techniques can compensate for thegain control fluctuations.

REFERENCES

[ I ] Rabine r L.R. and Schafer R.W.. “Digital Processing of Speech Signals,” Pren-tice-Hall, 1978.

121 Hess W., “Pitch Determination of Speech Signals,” Springer V crlag, 1983.[3] Rabine r L.R., et ai, “A Comparative Study of Several Pitch Detection A lgo-

rithms,” IEE E Trans. Acoust., Spe ech, and Signal Process., vol. ASSP-24, pp.

[4] LaGarde P.M. and Schommarz M.H., “Speech Segmentation for Real timeProcessing,” CSIR Internal Report I ELEK 198, Dec 1985.

[5] M,arks J.A., “A Low Bit Rate Speech Encoding Algorithm for Real-Time A pplications,” Proc. Filth South African Digital Signal Processing Symposium,Rhodes University, pp. 2.4.1-2.4.4,July 6, 1987.

[6] Gold B. nd Rabiner L., “Parallel Processing Techniques for E stimating PitchPeriodsof Speech io the Time Doma in,” J. Acoust. Soc.Am. vol. 46  o. 2 pp.442-448, Aug. 1 9.

171 Hassanein H. and Bryden B., “Implementation of the Cold-Rabiner Pitch De-tector in a Real Time Environment Using an Improved Voicing Detector,”IE EE Trans. Acoust., Speech, and Signal Proc. vol ASSP-33, No. 1, pp 319-

399-417, Oct. 1976.

322, Feb. 1985.

181 Griffm D.W. and Lim J.S., “A New Pitch Detection Algorithm,” DIGITALSIGNAL PROCESSING 84, Elscvier Science Publishers B.V. (North Hol-land), pp. 395-399,1984.

[9] Markel J.D., “The S l l T Algorithm for Fundamental Frequency Estimation,”IEE E Trans. Audio U ectroacoust., vol AU-20, pp. 367-377, Dec. 1972.

[ l o ] Sondhi M.M., “New Met hods of Pitch Extraction,” IE EE T rans. Audio Elec-troacoust., vol AU-16, pp. 262-266, June 1968.

[ l l ] No11 A.M., “Cepstrum Pitch Determination,” J. Acoust. Soc.Am., vol. 41, no.2, pp. 239-309, Feb. 1 7.

1121 Rms M J. et al., “Average Magnitude Difference Function Pitch Extractor,”IEEE Trans. Acoust. Speech and Signal Proc., vol. ASSP-22, pp. 353-362,Oct.1974.

1131 Ambikairajah E. and Carey MJ., “The Time Domain Periodogam Algo-rithm,” Signal Processing5, Elsevier Science Publishers B.V. (Nor th Holland),pp. 491-513.1983.

[14] Gold B., “Computer Program for Pitch Extraction”, J. Acoust. Soc.Am., vol.34 no. 7, pp. 916-921,July 1962.

1151 Miller NJ., “Pitch Detection by Data Reduction,” I EE E Symp. Speech Rec-ognition, pp. 122-130 ”9, Apr. 1974. (CMU).

1161 Oh KA. and Un C.K., “A Perform ance Compar ison of Pitch Extraction Al-gorithms for Noisy Speech,” IE EE ICA SSP 1984, pp. 18B.4.1-18B.4.4,1984.

[17j Jayant N.S., “Coding Speech at Low Bit Rates”, I EE E Spcctrum, vol. 23, no.8 pp. 58-63, Aug. 1986.

[IS] Mwangi E. and X ydeas C., “Voiced-Unvoiced-Silence Classification of SpeechUsing F u ~ yet Theory,” MELECOM ’85, vol. 11, Digital Signal Processing,pp. 123-126, Elsevier Science Publishers B.V. (No rth Holla nd), 1985.

[19] van Rossum NJ.T.M. and Rietvcld A.C.M., “A Per ceptual Evaluation of V/UDetectors,” Speech Communication 3, pp. 151-156, Elsevier Science Publi-shers B.V. (North H olland), 1981.

[Za] Kobatake H., “Optimization of VoicedlUnvoiced Decisions in NonstationaryNoise Environments,” IEEE Trans. Acoust., Speech and Signal Proc., vol.ASSPJS, no.1, Jan. 1987.

[2l ] Sims J.T., “A Speech-to-Noise Ratio Meas uremen t Algorithm,” J. Acoust.Soc.Am. vol78, no. , pp. 1671-1674,Nov. 1985.

[22] Atal B. and Rabiner L.R., “A Pattern Recognition Approach to Voiced-Un-voiced-Silence Classification with Applications to Speech Recognition,” IE EETrans. Acoust., Speech, a nd Signal Proc., vol. ASSP-24, no.3, pp. 201- 12,June 1976.

[ U ] DavisA.M., “The Real Time Implementation of a Speech Coding Algorithm,”MSc. (Eng) project report, University of the Witwatersrand, Johannesburg,1986.

6  COMSIG 88 PRETORIA