Nepali Speech Recognition

SupervisorDr. Basanta Joshi

Aavaas Gajurel (068/BCT/501)Anup Pokhrel (068/BCT/505)

Manish K. Sharma (068/BCT/523)

System Overview

System Block Diagram - TrainingNoise

ReductionSplit Module (VAD Based) Training Set

MFCC Features Train HMM

System Block Diagram - Recognition

Audio Input Noise Reduction

Split Module (VAD Based)

MFCC Computation

HMM

Audio ClassifierLanguage Model Output

SYSTEM DESIGN METHODOLOGY

NOISE REDUCTION

Creating Noise Profile

BUILD NOISE PROFILE

Update the computed Noise Profile

AVERAGE OVER TIME1𝑁 [𝑆𝑢𝑚𝑜𝑓 𝐹𝐹𝑇𝐶𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡>10 𝑓𝑟𝑎𝑚𝑒𝑠 ]

FOURIER TRANSFORMFFT of 32ms Audio Samples

Spectral Subtraction

INVERSE FOURIER TRANSFORM

Rebuild the Signal

SUBTRACT NOISE PROFILE (STATIC AND MUSICAL)Over Subtraction Short Segment Removal

FOURIER TRANSFORM OF SIGNALFFT of 32ms Audio Samples

After Spectral Subtraction After Musical Noise Removal

Before Noise Removal Spectral Subtraction output

VOICE ACTIVITY DETECTION

Voice Activity Detection

Voice Activity Detection Process I

CALCULATE THE TRIGGER

𝑡𝑤=𝜇+𝛼 𝛿𝑤

COMPUTE mean AND variance

SAMPLE10 Frame Sampling

Voice Activity Detection Process II

CLASSIFY

If greater than trigger then voice

COMPUTE CLASSIFICATION MEASURE

READ THE SAMPLERead the frame

𝑊 𝑠1 (𝑚 )=𝑃 𝑠1(𝑚)(1−𝑍𝑠 1 (𝑚 ) )𝑆𝑐

Feature Extraction

Audio Feature Extraction

Feature Extraction Process I

APPLY MEL FILTERBANK Multiply Filterbank(20-40) by Periodogram Estimate

CALCULATE PERIODOGRAM ESTIMATE𝑃 𝑖 (𝑘 )= 1

𝑁 ¿𝑆𝑖 (𝑘)∨¿2¿

FRAMINGDivide Audio into Sections of 20ms-40ms

Feature Extraction Process II

KEEP REQUIRED COEFFICIENTSKeep Required Number of Coefficients

DISCRETE COSINE TRANSFORM OF ENERGIESTake DCT of Coefficients of Above Step

SCALINGTake Logarithm of Filterbank Energies

Language Model

Using Language Model

Language Model Training

Language Model Based Classification

SELECT BEST

𝑃 (𝑊 𝑖|𝑊 𝑖− 1 )=𝜆1 (𝑃 (𝑊𝑛|𝑊𝑛−1 ) )+𝜆2𝑃 (𝑊 𝑛)

GET POSSIBLE CANDIDATESFrom Acoustic Model

READ PREVIOUS WORD

ACOUSTIC MODEL

HMM Based Classification

Training the Acoustic Model

TRAIN USING BAUM WELCH ALGORITHM

SELECT HMM MODEL

READ MFCC COEFFICIENTS AND WORD

Using the Acoustic Model

OUTPUT WORD CORRESPONDING TO MODEL

SELECT MODEL WITH MAXIMUM PROBABILITY

FIND LOG PROBABILITY OF WORD FOR EACH MODEL

READ MFCC COEFFICIENTS OF WORD

RESULTS

Trained vs. Untrained Input

• 3 Speakers • 5X10 Words Each• 5 Testing Set Each

Accuracy of System0

10

20

30

40

50

60

70

80

90

100

86.67

66.67

Using Trained and Untrained Input

Trained Set Untrained Set

Noise Reduced vs. Not Noise Reduced

• 3 Speakers • 5X10 Words Each• Untrained Input Files for Testing• 5 Testing Set Each

Accuracy of System0

10

20

30

40

50

60

70

80

46.67

66.67

Effect of Noise Reduction

Noise Not Reduced Noise Reduced

Gender Based Results

• 7 Speakers • 3 Females, 4 Males• Animal Names as Test• Untrained Input Files for Testing

Female Voice Training Male Voice Training Female and Male Voice Training

0

10

20

30

40

50

60

70

36

6459

66

44

5651

5458

Gender Based Result

Male Female Male + Female

LIMITATIONS AND RECOMMENDATIONS

Limitations

Limited Vocabulary

User Specific Noise Profiles

Static MFCC Coefficients Only

Training Data Storage Absent

Non-Continuous Recognition

Recommendations

Using Dynamic Coefficients

Continuous HMM Model

Extensive Training

Better Phonemic Modeling

Dynamic Noise Modeling

USAGE SCENARIO

Usage Scenario I

Easy Nepali Input

Automated Telecom Assistance

Speech Controlled Interface

Automated Transcribing

Usage Scenario II

Military Sector for Automated Wire Tapping

Public Guidance System

Automated User Support (banks, corporate houses,etc.)

Thank You !

Nepali Speech Recognition

Software

Transcript of Nepali Speech Recognition