Kaldi-voice: Your personal speech recognition server using open source code
-
Upload
xanguera -
Category
Data & Analytics
-
view
900 -
download
3
Transcript of Kaldi-voice: Your personal speech recognition server using open source code
Kaldi&voice+Your+personal+speech+recogni4on+server+using+open+source+code+
Xavier+Anguera+CTO+&+CSO,[email protected]+
Outline+• Intro+• What+is+speech+recogni4on+
– Applica4ons+• Approaches+to+ASR+
– PaHern+matching+approaches+– Sta4s4cal&based+approaches+
• Available+speech+recogni4on+engines+– “open”+source+– Online+commercial+systems+
• Building+your+own+online+system+– Live+demo+
Automa4c+Speech+Recogni4on+
• Automa'c)Speech)Recogni'on)(ASR))is+the+process+of+conver4ng+an+unknown+speech+waveform+into+the+corresponding+orthographic+transcrip4on.++
Image:+hHp://blogs.msdn.com/b/devschool/archive/2012/02/06/speech&recogni4on&using&visual&studio&determining&the&bna.aspx+
Content2
Personal22context2
Search+Summary+
Transcripts+Meaning+Age+
Gender+
Height+
Spoken+language+
Spoken+dialect+
Spoken+accent+
Literacy+level+
Speaker+ID+
Personality+traits+(OCEAN)+
Speech+likability+
Speech+intelligibility+
Sleepiness/4redness+
Intoxica4on+level+
Emo4on+
State+of+interest+
Image:+Telefonica+I+D+
Applica4ons+of+Speech+Recogni4on/Understanding+(ASR/ASU)+
! Dicta4on+! Telephone&based+Informa4on++
! direc4ons,+air+travel,+banking,+etc+! Polls,+online+shopping+! Call+rou4ng+
! Hands&free+! in+car,+computer,+home(domo4cs),+controlling+tools+
! Second+language+(accent+reduc4on)+! Audio+archive+searching+! Help+for+disabled+people+
How+do+humans+do+it?+
Ar4cula4on+system+of+one+person+produces+sound+waves+which+the+ear+of+another+person+conveys+to+the+brain+for+processing+
How+can+computers+do+it?+
• Digi4za4on+• Acous4c+analysis+of+the+speech+signal+
• Linguis4c+interpreta4on+
Acous4c+waveform+ Acous4c+signal+
Speech+recogni4on+
Challenges+in+ASR+processing+! Speaker+variability+
! Inter&speaker:+Vocal+tract,+gender,+dialects+! Intra&speaker:+:+stress,+age,+humor,+changes+of+ar4cula4on+due+to+environment+influence,+…+
! Language+variability+! From+isolated+words+to+con4nuous+speech+! Out&of&vocabulary+words+
! Vocabulary+size+and+domain+! From+just+a+few+words+(e.g.+Isolated+numbers)+to+large+vocabulary+speech+
recogni4on+! Domain+that+is+being+recognized+(medical,+social,+engineering,+…)+
! Noise+! Convolu4ve:+recording/transmission+condi4ons,+reverbera4on+! Addi4ve:+recording+environment,+transmission+SNR+
PaHern&based+speech+recogni4on+
" Feature measurement: Filter Bank, MFCC, LPC, DFT, ... " Pattern training: Creation of a reference pattern derived from an averaging technique " Pattern classification: Compare speech patterns with a local distance measure and a global time alignment procedure (DTW) " Decision logic: similarity scores are used to decide which is the best reference pattern.
Sta4s4cs&based+approaches+• Can+be+seen+as+extension+of+template&based+approach,+using+more+powerful+mathema4cal+and+sta4s4cal+tools+
• Some4mes+seen+as+�an4&linguis4c�+approach+– Fred+Jelinek+(IBM,+1988):+�Every+4me+I+fire+a+linguist+my+system+improves�
• Process:+1. Collect+a+large2corpus+of+transcribed+speech+recordings+2. Train+the+computer+to+learn+the+correspondences+
(�machine+learning�)+3. At+run+4me,+apply+sta4s4cal+processes+to+search+through+
the+space+of+all+possible+solu4ons,+and+pick+the+sta4s4cally+most+likely+one+
Sta4s4cs&based+approaches+
• Hidden+Markov+Models+(HMM)+• Gaussian+Mixture+Models+(GMM)+• Deep+Neural+Networks+(DNN)+
Markov+model+
Output2=2sequence2of2states2
Image:+hHp://madhukaudantha.blogspot.pt/2014/05/markov&models&and&hidden&markov&models.html+
Hidden+Markov+Models+(HMM)+
Output2=2observa:ons2linked2to2the2states2through2a2predefined2probability2distribu:on2!2modeled2using2GMM2or2DNN2models2
Image:+hHp://izanami.tl.fukuoka&u.ac.jp/SLPL/HMM/HTKBook/node5.html+
A2neuron2in2our2brain2
Image:+hHp://www.medicalsciencenavigator.com/how&to&study&for&anatomy&and&physiology/why&sleep&improves&memory+
DNN+evolu4on+
• We+started+to+use+mul4layer+perceptrons+(MLP’s)+about+25+years+ago+[1]+– Neural+networks+with+1+or+few+hidden+layers+
• Around+2010+G.+Hinton+and+S.+Bengio+(separately)+proposed+methods+to+effec4vely+train+many+hidden+layers+– Machines+have+become+much+more+powerful+– Lots+of+audio+data+with+transcrip4ons+areavailable++
[1]+“Merging+Mul4layer+perceptrons+and+Hidden+Markov+Models:+some+experiments+in+con4nuous+speech+recogni4on”,+Herve+Bourlard+and+Nelson+Morgan,+Technical+report+ICSI,+1989+
Speech+recogni4on+engines+
• HTK+(hHp://htk.eng.cam.ac.uk/),+non&commercial+license+
• Sphinx+(hHp://cmusphinx.sourceforge.net/),+GPL+
• Julius+(hHp://julius.osdn.jp/en_index.php),+open+
• Kaldi+(hHp://www.kaldi&asr.org/),+Apache+license+
Online+ASR&STT+services+
• Google+voice+(hHps://console.developers.google.com/project)+
• ATT+voice+recogni4on+(hHp://developer.aH.com/apis/speech)+
• Wit.ai+(hHps://wit.ai/)+
Building+an+ASR+with+open+source+tools+
• We+need:+– Speech+recogni4on+engine+– Speech+databases+/+models+– Online+speech+server+– Frontend+interfaces+
My+toolchain+
• Kaldi+ASR+++++++++++++++++++++hHp://www.kaldi&asr.org/+
• Kaldi+gstreamer+server+hHps://github.com/alumae/kaldi&gstreamer&server+
• Dictate.js++hHp://kaljurand.github.io/dictate.js/+
• Kōnele+app+hHps://kaljurand.github.io/K6nele/+