Speech

69

Transcript of Speech

Page 1: Speech
Page 2: Speech
Page 3: Speech

Feature extractor

Page 4: Speech

Feature extractorMel-Frequency Cepstral Coefficients

(MFCCs)Feature vectors

Page 5: Speech

Acoustic Observations

Page 6: Speech

Acoustic ObservationsHidden States

Page 7: Speech

Acoustic ObservationsHidden StatesAcoustic Observation likelihoods

Page 8: Speech

“Six”

Page 9: Speech
Page 10: Speech

Constructs the HMMs of phonesProduces observation likelihoods

Page 11: Speech

Constructs the HMMs for units of speech

Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8k

Page 12: Speech

Constructs the HMMs for units of speech

Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8kTIDIGITS, RM1, AN4, HUB4

Page 13: Speech

Word likelihoods

Page 14: Speech

ARPA format Example:

1-grams:-3.7839 board -0.1552-2.5998 bottom -0.3207-3.7839 bunch -0.21742-grams:-0.7782 as the -0.2717-0.4771 at all 0.0000-0.7782 at the -0.29153-grams:-2.4450 in the lowest -0.5211 in the middle -2.4450 in the on

Page 15: Speech

public <basicCmd> = <startPolite> <command> <endPolite>;

public <startPolite> = (please | kindly | could you ) *;

public <endPolite> = [ please | thanks | thank you ];

<command> = <action> <object>;

<action> = (open | close | delete | move); <object> = [the | a] (window | file | menu);

Page 16: Speech

Maps words to phoneme sequences

Page 17: Speech

Example from cmudict.06d

POULTICE P OW L T AH SPOULTICES P OW L T AH S IH ZPOULTON P AW L T AH NPOULTRY P OW L T R IYPOUNCE P AW N SPOUNCED P AW N S TPOUNCEY P AW N S IYPOUNCING P AW N S IH NGPOUNCY P UW NG K IY

Page 18: Speech

Constructs the search graph of HMMs from: Acoustic model Statistical Language model ~or~ Grammar Dictionary

Page 19: Speech
Page 20: Speech
Page 21: Speech

Can be statically or dynamically constructed

Page 22: Speech

FlatLinguist

Page 23: Speech

FlatLinguistDynamicFlatLinguist

Page 24: Speech

FlatLinguistDynamicFlatLinguistLexTreeLinguist

Page 25: Speech

Maps feature vectors to search graph

Page 26: Speech

Searches the graph for the “best fit”

Page 27: Speech

Searches the graph for the “best fit”

P(sequence of feature vectors| word/phone)

aka. P(O|W)

-> “how likely is the input to have been generated by the word”

Page 28: Speech

F ay ay ay ay v v v v vF f ay ay ay ay v v v vF f f ay ay ay ay v v vF f f f ay ay ay ay v vF f f f ay ay ay ay ay vF f f f f ay ay ay ay vF f f f f f ay ay ay v…

Page 29: Speech

TimeO1 O2 O3

Page 30: Speech

Uses algorithms to weed out low scoring paths during decoding

Page 31: Speech

Words!

Page 32: Speech

Most common metricMeasure the # of modifications to

transform recognized sentence into reference sentence

Page 33: Speech

Reference: “This is a reference sentence.”

Result: “This is neuroscience.”

Page 34: Speech

Reference: “This is a reference sentence.”

Result: “This is neuroscience.”Requires 2 deletions, 1 substitution

Page 35: Speech

Reference: “This is a reference sentence.”

Result: “This is neuroscience.”

WER100deletions substitutions insertions

Length

Page 36: Speech

Reference: “This is a reference sentence.”

Result: “This is neuroscience.” D S

D

WER10021 05

1003

560%

Page 37: Speech
Page 38: Speech
Page 39: Speech
Page 40: Speech
Page 41: Speech
Page 42: Speech
Page 43: Speech
Page 44: Speech
Page 45: Speech
Page 46: Speech
Page 47: Speech
Page 48: Speech

Limited Vocab Multi-Speaker

Page 49: Speech

Limited Vocab Multi-SpeakerExtensive Vocab Single Speaker

Page 50: Speech

*If you have noisy audio input multiply expected error rate x 2

Page 51: Speech

Other variables:-Continuous vs. Isolated-Conversational vs. Read-Dialect

Page 52: Speech

Questions?

Page 53: Speech

TimeO1 O2 O3

Page 54: Speech

TimeO1 O2 O3

P(ay | f) *P(O2|ay)

P(f|f) * P(O2 | f)

Page 55: Speech

TimeO1 O2 O3

P (O1) * P(ay | f) *P(O2|ay)

Page 56: Speech

TimeO1 O2 O3

Page 57: Speech

Common Sphinx4 FAQs can be found online:http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html

What followes are some less-FAQs

Page 58: Speech

Q. Is a search graph created for every recognition result or one for the recognition app?

A. This depends on which Linguist is used. The flat linguist generates the entire search graph and holds it in memory. It is only useful for small vocab recognition tasks. The lexTreeLinguist dynamically generates search states allowing it to handle very large vocabularies

Page 59: Speech

Q. How does the Viterbi algorithm save computation over exhaustive search?

A. The Viterbi algorithm saves memory and computation by reusing subproblems already solved within the larger solution. In this way probability calculations which repeat in different paths through the search graph do not get calculated multiple times

Viterbi cost = n2 – n3

Exhaustive search cost = 2n -3n

Page 60: Speech

Q. Does the linguist use a grammar to construct the search graph if it is available?

A. Yes, a grammar graph is created

Page 61: Speech

Q. What algorithm does the Pruner use?

A. Sphinx4 uses absolute and relative beam pruning

Page 62: Speech

Absolute Beam Width - # active search paths

<property name="absoluteBeamWidth" value="5000"/>

Page 63: Speech

Absolute Beam Width - # active search paths

<property name="absoluteBeamWidth" value="5000"/>

Relative Beam Width – probability threshold

<property name="relativeBeamWidth" value="1E-120"/>

Page 64: Speech

Absolute Beam Width - # active search paths

<property name="absoluteBeamWidth" value="5000"/>

Relative Beam Width – probability threshold

<property name="relativeBeamWidth" value="1E-120"/>

Word Insertion Probability – Word break likelihood

<property name="wordInsertionProbability" value="0.7"/>

Page 65: Speech

Absolute Beam Width - # active search paths <property name="absoluteBeamWidth" value="5000"/> Relative Beam Width – probability threshold <property name="relativeBeamWidth" value="1E-120"/> Word Insertion Probability – Word break likelihood <property name="wordInsertionProbability" value="0.7"/> Language Weight – Boosts language model scores <property name="languageWeight" value="10.5"/>

Page 66: Speech

Silence Insertion Probability – Likelihood of inserting silence

<property name="silenceInsertionProbability" value=".1"/>

Page 67: Speech

Silence Insertion Probability – Likelihood of inserting silence

<property name="silenceInsertionProbability" value=".1"/>

Filler Insertion Probability – Likelihood of inserting filler words

<property name="fillerInsertionProbability" value="1E-10"/>

Page 68: Speech

To call a Java example from Python:

import subprocess

subprocess.call(["java", "-mx1000m", "-jar","/Users/Username/sphinx4/bin/Transcriber.jar”)

Page 69: Speech

Speech and Language Processing 2nd Ed.Daniel Jurafsky and James MartinPearson, 2009

Artificial Intelligence 6th Ed.George LugerAddison Wesley, 2009

Sphinx Whitepaperhttp://cmusphinx.sourceforge.net/sphinx4/#whitepaper

Sphinx Forumhttps://sourceforge.net/projects/cmusphinx/forums