Unsupervised Lexicon Discovery from Acoustic...

Unsupervised Lexicon Discovery from Acoustic Input

Chia-ying Lee1, Timothy J. O’Donnell2, and James Glass11Computer Science and Artificial Intelligence Laboratory

2Department of Brain and Cognitive SciencesMassachusetts Institute of Technology

Cambridge, MA 02139, USA

Abstract

We present a model of unsupervised phono-logical lexicon discovery—the problem ofsimultaneously learning phoneme-like andword-like units from acoustic input. Ourmodel builds on earlier models of unsuper-vised phone-like unit discovery from acous-tic data (Lee and Glass, 2012), and unsuper-vised symbolic lexicon discovery using theAdaptor Grammar framework (Johnson et al.,2006), integrating these earlier approaches us-ing a probabilistic model of phonological vari-ation. We show that the model is competi-tive with state-of-the-art spoken term discov-ery systems, and present analyses exploringthe model’s behavior and the kinds of linguis-tic structures it learns.

1 Introduction

One of the most basic problems of language acqui-sition is accounting for how children learn the in-ventory of word forms from speech—phonologicallexicon discovery. In learning a language, childrenface a number of challenging, mutually interdepen-dent inference problems. Words are represented interms of phonemes, the basic phonological units of alanguage. However, phoneme inventories vary fromlanguage to language, and the underlying phonemeswhich make up individual words often have variableacoustic realizations due to systematic phonetic andphonological variation, dialect differences, speechstyle, environmental noise, and other factors. Tolearn the phonological form of words in their lan-guage children must determine the phoneme inven-tory of their language, identify which parts of the

acoustic signal correspond to which phonemes—while discounting surface variation in the realiza-tion of individual units—and infer which sequencesof phonemes correspond to which words (amongstother challenges).

Understanding the solution to this complex joint-learning problem is not only of fundamental sci-entific interest, but also has important applicationsin Spoken Language Processing (SLP). Even set-ting aside additional grammatical and semantic in-formation available to child learners, there is stilla sharp contrast between the type of phonologicallearning done by humans and current SLP meth-ods. Tasks that involve recognizing words fromacoustic input—such as automatic speech recogni-tion and spoken term discovery—only tackle parts ofthe overall problem, and typically rely on linguisticresources such as phoneme inventories, pronuncia-tion dictionaries, and annotated speech data. Suchresources are unavailable for many languages, andexpensive to create. Thus, a model that can jointlylearn the sound patterns and the lexicon of a lan-guage would open up the possibility of automati-cally developing SLP capabilities for any language.

In this paper, we present a first step towards anunsupervised model of phonological lexicon discov-ery that is able to jointly learn, from unannotatedspeech, an underlying phoneme-like inventory, thepattern of surface realizations of those units, and aset of lexical units for a language. Our model buildson earlier work addressing the unsupervised discov-ery of phone-like units from acoustic data—in par-ticular the Dirichlet Process Hidden Markov Model(DPHMM) of Lee and Glass (2012)—and the un-

389

Transactions of the Association for Computational Linguistics, vol. 3, pp. 389–403, 2015. Action Editor: Eric Fosler-Lussier.Submission batch: 11/2014; Revision batch 2/2015; Published 7/2015.

c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license.

supervised learning of lexicons from unsegmentedsymbolic sequences using the Adaptor Grammar(AG) framework (Johnson et al., 2006; Johnson andGoldwater, 2009a). We integrate these models witha component modeling variability in the surface re-alization of phoneme-like units within lexical units.

In the next section, we give an overview of re-lated work. Following this, we present our modeland inference algorithms. We then turn to prelimi-nary evaluations of the model’s performance, show-ing that the model is competitive with state-of-the-art single-speaker spoken term discovery systems,and providing several analyses which examine thekinds of structures learned by the model. We alsosuggest that the ability of the system to successfullyunify multiple acoustic sequences into single lexicalitems relies on the phonological-variability (noisychannel) component of the model—demonstratingthe importance of modeling symbolic variation inphonological units. We provide preliminary evi-dence that simultaneously learning sound and lexi-cal structure leads to synergistic interactions (John-son, 2008b)—the various components of the modelmutually constrain one another such that the linguis-tic structures learned by each are more accurate thanif they had been learned independently.

2 Related Work

Previous models of lexical unit discovery have pri-marily fallen into two classes: models of spokenterm discovery and models of word segmentation.Both kinds of models have sought to identify lexi-cal items from input without direct supervision, buthave simplified the joint learning problem discussedin the introduction in different ways.

Spoken Term Discovery Spoken term discoveryis the problem of using unsupervised pattern discov-ery methods to find previously unknown keywordsin speech. Most models in this literature have typi-cally made use of a two-stage procedure: First, sub-sequences of the input that are similar in an acousticfeature space are identified, and, then clustered todiscover categories corresponding to lexical items(Park and Glass, 2008; Zhang and Glass, 2009;Zhang et al., 2012; Jansen et al., 2010; Aimetti,2009; McInnes and Goldwater, 2011). This prob-lem was first examined by Park and Glass (2008)

who used Dynamic Time Warping to identify sim-ilar acoustic sequences across utterances. The in-put sequences discovered by this method were thentreated as nodes in a similarity-weighted graph, andgraph clustering algorithms were applied to producea number of densely connected groups of acousticsequences, corresponding to lexical items. Buildingon this work, Zhang and Glass (2009) and Zhanget al. (2012) proposed robust features that allowedlexical units to be discovered from spoken docu-ments generated by different speakers. Jansen et al.(2010) present a similar framework for finding re-peated acoustic patterns, based on line-segment de-tection in dotplots. Other variants of this approachinclude McInnes and Goldwater (2011) who com-pute similarity incrementally, and Aimetti (2009)who integrates a simplified, symbolic representationof visual information associated with each utterance.

Word Segmentation In contrast to spoken termdiscovery, models of word (or morpheme) segmen-tation start from unsegmented strings of symbolsand attempt to identify subsequences correspondingto lexical items. The problem has been the focusof many years of intense research, and there are alarge variety of proposals in the literature (Harris,1955; Saffran et al., 1996a; Harris, 1955; Olivier,1968; Saffran et al., 1996b; Brent, 1999b; Frank etal., 2010; Frank et al., 2013). Of particular interesthere are models which treat segmentation as a sec-ondary consequence of discovering a compact lexi-con which explains the distribution of phoneme se-quences in the input (Cartwright and Brent, 1994;Brent, 1999a; Goldsmith, 2001; Argamon et al.,2004; Goldsmith, 2006; Creutz and Lagus, 2007;Goldwater et al., 2009; Mochihashi et al., 2009; El-sner et al., 2013; Neubig et al., 2012; Heymann etal., 2013; De Marcken, 1996c; De Marcken, 1996a).Recently, a number of such models have been intro-duced which make use of Bayesian nonparametricdistributions such as the Dirichlet Process (Fergu-son, 1973) or its two-parameter generalization, thePitman-Yor Process (Pitman, 1992), to define a priorwhich favors smaller lexicons with more reusablelexical items. The first such models were proposedin Goldwater (2006) and, subsequently, have beenextended in a number of ways (Goldwater et al.,2009; Neubig et al., 2012; Heymann et al., 2013;

390

Mochihashi et al., 2009; Elsner et al., 2013; John-son et al., 2006; Johnson and Demuth, 2010; John-son and Goldwater, 2009b; Johnson, 2008a; John-son, 2008b).

One important lesson that has emerged from thisliterature is that models which jointly represent mul-tiple levels of linguistic structure often benefit fromsynergistic interactions (Johnson, 2008b) where dif-ferent levels of linguistic structure provide mutualconstraints on one another which can be exploitedsimultaneously (Goldwater et al., 2009; Johnson etal., 2006; Johnson and Demuth, 2010; Johnson andGoldwater, 2009b; Johnson, 2008a; Borschingerand Johnson, 2014; Johnson, 2008b). For example,Elsner et al. (2013) show that explicitly modelingsymbolic variation in phoneme realization improveslexical learning—we use a similar idea in this paper.

An important tool for studying such synergies hasbeen the Adaptor Grammars framework of John-son et al. (2006). Adaptor Grammars are a gen-eralization of Probabilistic Context-free Grammars(PCFGs) which allow the lexical storage of com-plete subtrees. Using Adaptor Grammars, it is pos-sible to learn lexica which contain stored units atmultiple levels of abstraction (e.g., phonemes, on-sets, codas, syllables, morphemes, words, and multi-word collocations). A series of studies using theframework has shown that including such additionalstructure can markedly improve lexicon discovery(Johnson et al., 2006; Johnson and Demuth, 2010;Johnson and Goldwater, 2009b; Johnson, 2008a;Borschinger and Johnson, 2014; Johnson, 2008b).

Unsupervised Lexicon Discovery In contrast tomodels of spoken term discovery and word segmen-tation, our model addresses the problem of jointlyinferring phonological and lexical structure directlyfrom acoustic input. Spoken term discovery sys-tems only attempt to detect keywords, finding lex-ical items that are isolated and scattered throughoutthe input data. They do not learn any intermediatelevels of linguistic structure between the acoustic in-put and discovered lexical items. In constrast, ourmodel attempts to find a complete segmentation ofthe input data into units at multiple levels of abstrac-tion (e.g., phonemes, syllables, words, etc.). Un-like word segmentation models, our model worksdirectly from speech input, integrating unsupervised

acoustic modeling with an an approach to symboliclexicon discovery based on adaptor grammars.

Although some earlier systems have examinedvarious parts of the joint learning problem (Bac-chiani and Ostendorf, 1999; De Marcken, 1996b),to our knowledge, the only other system which ad-dresses the entire problem is that of Chung et al.(2013). There are two main differences between theapproaches. First, in Chung et al. (2013), word-like units are defined as unique sequences of sub-word-like units, so that any variability in the real-ization of word-parts must be accounted for by theacoustic models. In contrast, we explicitly modelphonetic variability at the symbolic level, allowingour system to learn low-level units which tightlypredict the acoustic realization of phonemes in par-ticular contexts, while still ignoring this variabil-ity when it is irrelevant to distinguishing lexicalitems. Second, while Chung et al. (2013) employa fixed, two-level representation of linguistic struc-ture, our use of adaptor grammars to model sym-bolic lexicon discovery means that we can easily andflexibly vary our assumptions about the hierarchi-cal makeup of utterances and lexical items. In thispaper, we employ a simple adaptor grammar withthree-levels of hierarchical constituency (word-like,sub-word-like, and phone-like units) and with right-branching structure; future work could explore morearticulated representations along the lines of John-son (2008b).

3 Model

3.1 Problem Formulation and Model Overview

Given a corpus of spoken utterances, our model aimsto jointly infer the phone-like, sub-lexical, and lex-ical units in each spoken utterance. To discoverthese hierarchical linguistic structures directly fromacoustic signals, we divide the problem into threesub-tasks: 1) phone-like unit discovery, 2) variabil-ity modeling, and 3) sub-lexical and lexical unitlearning. Each of the sub-tasks corresponds to cer-tain latent structures embedded in the speech datathat our model must identify. Here we briefly dis-cuss the three sub-tasks as well as the latent vari-ables associated with each, and provide an overviewon the proposed model for each sub-task.

391

Figure 1: (a) An overview of the proposed model for inducing hierarchical linguistic structures directly fromacoustic signals. As indicated in the graph, the model leverages partial knowledge learned from each level todrive discovery in the others. (b) An illustration of an input example, xi, and the associated latent structuresin the acoustic signals di,ui,oi, ~vi, zi. These latent structures can each be discovered by one of the threecomponents of the model as specified by the red horizontal bars between (a) and (b).

Phone-like unit discovery For this sub-task, themodel converts the speech input xi into a sequenceof Phone-Like Units (PLUs), ~vi, which implic-itly determines the phone segmentation, zi, in thespeech data as indicated in (iv)-(vi) of Fig 1-(b). Weuse xi = {xi,t|xi,t ∈ R39, 1 ≤ t ≤ Ti} to denotethe series of Mel-Frequency Cepstral Coefficients(MFCCs) representing the ith utterance (Davis andMermelstein, 1980), where Ti stands for the totalnumber of feature frames in utterance i. Each xi

contains 13-dimensional MFCCs and their first- andsecond-order time derivatives at a 10 ms frame rate.Each xi is also associated with a binary variable zi,t,indicating whether a PLU boundary exists betweenxi,t and xi,t+1. The feature vectors with zi,t = 1are highlighted by the dark blue bars in Fig. 1-(vi),which correspond to segment boundaries.

Each speech segment is labelled with a PLU idvi,j,k ∈ L, in which L is a set of integers that rep-resent the PLU inventory embedded in the speechcorpus. We denote the sequence of PLU ids asso-ciated with utterance i using ~vi as shown in Fig. 1-(iv), where ~vi = {vi,j |1 ≤ j ≤ Ji} and vi,j ={vi,j,k|vi,j,k ∈ L, 1 ≤ k ≤ |vi,j |}. The variable Jiis defined in the discussion of the second sub-task.As depicted in Fig. 1-(a), we construct an acousticmodel to approach this sub-problem. The acousticmodel is composed of a set of Hidden Markov Mod-els (HMMs), π, that are used to infer and model thePLU inventory from the given data.

Phonological variability modeling In conversa-tional speech, the phonetic realization of a word caneasily vary because of phonological and phoneticcontext, stress pattern, etc. Without a mechanismthat can map these speech production variations intoa unique representation, any model that induces lin-guistic structures based on phonetic input would failto recognize these pronunciations as instances of thesame word type. We exploit a noisy-channel modelto address this problem and design three edit opera-tions for the noisy-channel model: substitute, split,and delete. Each of the operations takes a PLU as aninput and is denoted as sub(u), split(u), and del(u)respectively. We assume that for every inferred se-quence of PLUs ~vi in Fig. 1-(b)-(iv), there is a cor-responding series of PLUs, ui = {ui,j |1 ≤ j ≤ Ji},in which the pronunciations for any repeated wordin ~vi are identical. The variable Ji indicates thelength of ui. By passing each ui,j through the noisy-channel model, which stochastically chooses an editoperation oi,j for ui,j , we obtain the noisy phoneticrealization vi,j .1 The relationship among ui, oi, and~vi is shown in (ii)-(iv) of Fig. 1-(b). For notationsimplicity, we let oi,j encode ui,j and vi,j , whichmeans we can read ui,j and vi,j directly from oi,j .We refer to the units that are input to the noisy-channel model ui as “top-layer” PLUs and the unitsthat are output from the noisy-channel model ~vi as“bottom-layer” PLUs.

1We denote vi,j as a vector since if a split operation is cho-sen for ui,j , the noisy-channel model will output two PLUs.

392

Sub-word-like and word-like unit learning Withthe standardized phone-like representation ui,higher-level linguistic structures can be inferred foreach spoken utterance. We employ Adaptor Gram-mars (AGs) (Johnson et al., 2006) to achieve thisgoal, and use di to denote the parse tree that encodesthe hierarchical linguistic structures in Fig. 1-(b)-(i).We bracket sub-word-like (e.g., syllable) and word-like units using [·] and (·), respectively.

In summary, our model integrates adaptor gram-mars with a noisy-channel model of phonetic vari-ability and an acoustic model to discover hierarchi-cal linguistic structures directly from acoustic sig-nals. Even though we have discussed these sub-tasksin a bottom-up manner, our model provides a jointlearning framework, allowing knowledge learnedfrom one sub-task to drive discovery in the others.We now a review the formalization of adaptor gram-mars and define the noisy-channel and acoustic com-ponents of our model. We conclude this sectionby presenting the generative process implied by thethree components of our model.

3.2 Adaptor GrammarsAdaptor grammars are a non-parametric Bayesianextension of Probabilistic Context-Free Grammars(PCFGs). A PCFG can be defined as a quintu-ple (N,T,R, S, {~θq}q∈N ), which consists of dis-joint finite sets of non-terminal symbols N and ter-minal symbols T , a finite set of production rulesR ⊆ {N → (N ∪ T )∗}, a start symbol S ∈ N ,and vectors of probabilistic distributions {~θq}q∈N .Each ~θq contains the probabilities associated withthe rules that have the non-terminal q on their left-hand side. We use θr to indicate the probability ofrule r ∈ R. We adopt a Bayesian approach and im-pose a Dirichlet prior on each ~θq ∼ Dir(~αq).

Let t denote a complete derivation, which repre-sents either a tree that expands from a non-terminalq to its leaves, which contain only terminal symbols,or a tree that is composed of a single terminal sym-bol. We define root(t) as a function that returns theroot node of t and denote the k immediate subtreesof the root node as t1, · · · , tk. The probability distri-bution over T q, the set of trees that have q ∈ N ∪ Tas the root, is recursively defined as follows.

Gqpcfg(t) =

{∑r∈Rq θr

∏ki=1 G

root(ti)pcfg (ti) root(t) = q ∈ N

1 root(t) = q ∈ T

An adaptor grammar is a sextuple(N,T,R, S, {~θq}q∈N , {Y q}q∈N ), in which(N,T,R, S, {~θq}q∈N ) is a PCFG, and {Y q}q∈N isa set of adaptors for the non-terminals. An adaptorY q is a function that maps a base distribution overT q to a distribution over distributions over T q.The distribution Gq

ag(t) for q ∈ N of an AG is asample from this distribution over distributions.Specifically,

Gqag(t) ∼ Y q(Hq(t))

Hq(t) =∑

r∈Rqθr

∏k

i=1Groot(ti)

ag (ti),

where Hq(t) denotes the base distribution over T q.In this paper, following Johnson et al. (2006), weuse adaptors that are based on Pitman-Yor pro-cesses (Pitman and Yor, 1997). For terminal sym-bols q ∈ T , we define Gq

ag(t) = 1, which is adistribution that puts all its probability mass on thesingle-node tree labelled q. Conceptually, AGs canbe regarded as PCFGs with memories that cache thecomplete derivations of adapted non-terminals, al-lowing the AG to choose to either reuse the cachedtrees or select a production an underlying rule in Rto expand each non-terminal. For a more detaileddescription of AGs and their connection to PCFGs,we refer readers to Johnson et al. (2006) and Chapter3 of O’Donnell (2015).

To discover the latent hierarchical linguistic struc-tures in spoken utterances, we employ the followingAG to parse each spoken utterance, where we adoptthe notations of Johnson and Goldwater (2009b) anduse underlines to indicate adapted non-terminals andemploy + to abbreviate right-branching recursiverules for non-terminals. The last rule shows thatthe terminals of this AG are the PLU ids, which arerepresented as ui and depicted as the units in thesquares of Fig. 1-(b)-(ii).

Sentence→Word+

Word→ Sub-word+

Sub-word→ Phone+

Phone→ l for l ∈ LNote that the grammar above only makes use of

right-branching rules and therefore could be sim-ulated using finite-state infrastructure, rather thanthe more complex context-free machinery implicit inthe adaptor grammars framework. We nevertheless

393

make use of the formalism for two reasons. First, ona theoretical level it provides a uniform frameworkfor expressing many different assumptions about thesymbolic component of segmentation models (Gold-water et al., 2009; Johnson, 2008b; Borschinger andJohnson, 2014; Johnson, 2008a; Johnson and Gold-water, 2009b; Johnson and Demuth, 2010). Usingadaptor grammars to formalize the symbolic compo-nent our our model thus allows direct comparisons tothis literature as well as transparent extensions fol-lowing earlier work. Second, on a practical level, us-ing the framework allowed us to make use of MarkJohnson’s efficient implementation of the core adap-tor grammar sampling loop, significantly reducingmodel development time.

3.3 Noisy-channel Model

We formulate the noisy-channel model as a PCFGand encode the substitute, split, and delete opera-tions as grammar rules. In particular, for l ∈ L,

l→ l′ for l′ ∈ Ll→ 1′1l

′2 for l′1, l

′2 ∈ L

l→ ε

(1)

where l ∈ L are the start symbols as well as the non-terminals of the PCFG. The terminals of this PCFGare l′ ∈ L, which correspond to bottom-layer PLUs~vi that are depicted as units in circles in Fig. 1-(b)-(iv). Note that {l} and {l′} are drawn from thesame inventory of PLUs, and the notation is meantto signal that {l′} are the terminal symbols of thisgrammar. The three sets of rules respectively map tothe sub(·), split(·), and del(·) operations; thus, theprobability of each edit operation is automaticallycaptured by the corresponding rule probability. Notethat to simultaneously infer a phonetic inventory ofan unknown size and model phonetic variation, wecan use the infinite PCFG (Liang et al., 2007) toformulate the noisy-channel model. However, forcomputational efficiency, in our experiments, we in-fer the size of the PLU inventory before training thefull model, and impose a Dirichlet prior on the ruleprobability distribution associated with each non-terminal l. We explain how inventory size is deter-mined in Sec. 5.2.

3.4 Acoustic ModelFinally, we assign each discovered PLU l ∈ L toan HMM, πl, which is used to model the speech re-alization of each phonetic unit in the feature space.In particular, to capture the temporal dynamics ofthe features associated with a PLU, each HMM con-tains three emission states, which roughly corre-spond to the beginning, middle, and end of a pho-netic unit (Jelinek, 1976). We model the emissiondistribution of each state by using 39-dimensionaldiagonal Gaussian Mixture Models (GMMs). Theprior distributions embedded in the HMMs are thesame as those described in (Lee and Glass, 2012).

3.5 Generative Process of the Proposed ModelWith the adaptor grammar, the noisy-channel model,and the acoustic model defined, we summarize thegenerative process implied by our model as follows.For the ith utterance in the corpus, our model

1. Generates a parse tree di from GSentenceag (·).

2. For each leaf node ui,j of di, samples an editrule oi,j from ~θui,j to convert ui,j to vi,j .

3. For vi,j,k ∈ vi,j , 1 ≤ k ≤ |vi,j |, generates thespeech features using πvi,j,k , which determinis-tically sets the value of zi,t.

Thus, the latent variables our model definesfor each utterance are: di, ui, oi, ~vi, zi, π,{~θq}q∈Nag∪Nnoisy-channel . In the next section, we de-rive inference methods for all the latent variablesexcept for {~θq}q∈Nag∪Nnoisy-channel , which we integrateout during the inference process.

4 Inference

We exploit Markov chain Monte Carlo algorithmsto generate samples from the posterior distributionover the latent variables. In particular, we constructthree Markov chain kernels: 1) jointly sampling di,oi, ui, 2) generating new samples for ~vi, zi, and 3)updating π. Here, we give an overview of each ofthe sampling moves.

4.1 Sampling di, oi, and implicitly ui

We employ the Metropolis-Hastings (MH) algo-rithm (Chib and Greenberg, 1995) to generate sam-ples for di and oi, which implicitly determines ui.

394

Given d−i,o−i, the current parses and the currentedit operations associated with all the sentences inthe corpus except the ith utterance, we can con-struct a proposal distribution for d′i and o′i

2 by usingthe approximating PCFG described in Johnson et al.(2006) and the approximated probability of o′i,j ino′i, which is defined in Eq. 2.

p(o′i,j |o−i; {~α}q) ≈C−i(u′i,j → v′i,j) + ~α

u′i,j

u′i,j→v′

i,j

C−i(u′i,j) +∑

r∈Ru′i,j~αu′i,j

r

(2)where q ∈ Nnoisy-channel, and C−i(w) denotes thenumber of times thatw is used in the analyses for thecorpus, excluding the ith utterance, in which w canbe any countable entity such as a rule or a symbol.

More specifically, we combine the PCFG thatapproximates the adaptor grammar with the noisy-channel PCFG whose rules are weighted as in Eq. 2to form a new PCFG G′. The new PCFG G′ is thusa grammar that can be used to parse the terminals~vi and generate derivations that are rooted at thestart symbol of the AG. Therefore, we transform thetask of sampling d′i and o′i to the task of generat-ing a parse for ~vi using G′, which can be efficientlysolved by using an variant of the Inside algorithmfor PCFGs (Lari and Young, 1990; Johnson et al.,2007; Goodman, 1998; Finkel et al., 2006).

4.2 Sampling ~vi and ziGiven the top-layer PLUs ui and the speech data xi,sampling the boundary variables zi and the bottom-layer PLUs ~vi is equivalent to sampling an align-ment between ui and xi. Therefore, we use theprobabilities defined in Eq. 2 and employ the back-ward message-passing and forward-sampling algo-rithm described in Lee et al. (2013), designed foraligning a letter sequence and speech signals, to pro-pose samples for ~vi and zi. The proposals are thenaccepted by using the standard MH criterion.

4.3 Sampling πGiven zi and ~vi of each utterance in the corpus,generating new samples for the parameters of eachHMM πl for l ∈ L is straightforward. For each PLUl, we gather all speech segments that are mapped to

2We use di and d′i to denote the current and the proposedparses. The same relationship is also defined for oi and o′

i.

Lecture topic DurationEconomics 75 minsSpeech processing 85 minsClustering 78 minsSpeaker adaptation 74 minsPhysics 51 minsLinear algebra 47 mins

Table 1: A brief summary of the six lectures usedfor the experiments reported in Section 6.

a bottom-layer PLU vi,j,k = l. For every segment inthis set, we use πl to block-sample the state id andthe GMM mixture id for each feature vector. Fromthe state and mixture assignments, we can collect thecounts that are needed to update the priors for thetransition probability and the emission distributionof each state in πl. New samples for the parametersof πl can thus be yielded from the updated priors.

5 Experimental Setup

5.1 DatasetTo the best of our knowledge, there are no standardcorpora for evaluating models of unsupervised lex-icon discovery. In this paper, we perform experi-ments on the six lecture recordings used in (Park andGlass, 2008; Zhang and Glass, 2009), a part of theMIT Lecture corpus (Glass et al., 2004). A briefsummary of the six lectures is listed in Table 1.

5.2 SystemsFull system We constructed two systems basedon the model described in Sec. 3. These systems,FullDP and Full50, differ only in the size of the PLUinventory (K). For FullDP, we set the value of Kto be the number of PLUs discovered for each lec-ture by the DPHMM framework presented in (Leeand Glass, 2012). These numbers were: Economics,99; Speech Processing, 111; Clustering, 91; SpeakerAdaptation, 83; Physics, 90; and Algebra, 79. ForFull50, we used a fixed number of PLUs, K = 50.

The acoustic component of the FullDP systemwas initialized by using the output of the DPHMMmodel for each lecture. Specifically, we made use ofthe HMMs, the phone boundaries, and the PLU thatthe DPHMM model found as the initial values forπ, zi, and ~vi of the FullDP system. After initializa-tion, the training of FullDP proceeds by following

395

the three sampling moves described in Sec. 4. Sim-ilarly, we employ a Hierarchical HMM (HHMM),which is presented in detail in (Lake et al., 2014), tofind the initial values of π, zi, and ~vi for the Full50system. The adaptor grammar component of all sys-tems was initialized following the “batch initializa-tion” method described in Johnson and Goldwater(2009b) which independently samples an AG analy-sis for each utterance. The lesioned systems that aredescribed in the rest of this section were also initial-ized in the same manner.

To reduce the inference load on the HMM, we ex-ploit acoustic cues in the feature space to constrainphonetic boundaries to occur at a subset of all pos-sible locations (Lee and Glass, 2012). We followthe pre-segmentation method described in (Glass,2003) to achieve the goal. Empirically, this bound-ary elimination heuristic reduces the computationalcomplexity of the inference algorithm by roughly anorder of magnitude on clean speech corpora.

No acoustic model We remove the acoustic modelfrom FullDP and Full50 to obtain the -AM sys-tems. Since the -AM systems do not have an acous-tic model, they cannot resegment or relabel the data,which implies that there is no learning of phoneticunits in the -AM systems, making them similar tosymbolic segmentation models that include a noisychannel component (Elsner et al., 2013). By com-paring a -AM system to its full counterpart, we caninvestigate the synergies between phonetic and lexi-cal unit acquisition in the full model.

No noisy-channel To evaluate the importance ofmodeling phonetic variability, we remove the noisy-channel model from the -AM systems to form -NC systems. A -NC system can be regarded asa pipeline, whereby utterance phone sequences arediscovered first, and latent higher-level linguisticstructures are learned in the second step and thus issimilar to models such as that of Jansen et al. (2010).

6 Results and Analyses

Training convergence Fig. 2 shows the negativelog posterior probability of the sampled parses dand edit operations o for each lecture as a functionof iteration generated by the FullDP system. Giventhat each lecture consists of roughly only one hour

0 50 100 150 2000.5

1

1.5

2

2.5

3

3.5

4x 10

5

iterations

−lo

g P

(d,o

|x)

Economics

Speech processing

Clustering

Speaker adaptation

Physics

Algebra

Figure 2: The negative log posterior probability ofthe latent variables d and o as a function of iterationobtained by the FullDP model for each lecture.

of speech data, we can see that the model convergesfairly quickly within a couple hundreds of iterations.In the this section, we report the performance ofeach system using the corresponding sample fromthe 200th iteration.

Phoneme segmentation We first evaluate ourmodel on the task of phone segmentation for thesix lectures. We use a speech recognizer to producephone forced alignments for each lecture. The phonesegmentation embedded in the forced alignments isthen treated as the gold standard to which we com-pare the segmentation our model generates. We fol-low the suggestion of (Scharenborg et al., 2010) anduse a 20-ms tolerance window to compute the F1score of all proposed phone segmentations.

Table 2 presents the F1 scores achieved by dif-ferent systems. Because the -AM and -NC systemsdo not do inference over acoustic segmentations, wecompare the phoneme-segmentation performance ofeach full system to its performance at initialization.Recall that the Full50 system is initialized using theoutput of a Hierarchical HMM (HHMM), and theFullDP system is initialized using DPHMM.

From Table 2 we can see that the two -AM sys-tems achieve roughly the same segmentation perfor-mance for the first four lectures. Aside from usingthe boundary elimination method described in (Leeand Glass, 2012), these two systems are trained in-dependently. The table shows that the two initializa-tion systems achieve roughly the same segmentation

396

Lecture topic Full50 HHMM FullDP DPHMMEconomics 74.4 74.6 74.6 75.0Signal processing 76.2 76.0 76.0 76.3Clustering 76.6 76.6 77.0 76.9Speaker adaptation 76.5 76.9 76.7 76.9Physics 75.9 74.9 75.7 75.8Linear algebra 75.5 73.8 75.5 75.7

Table 2: The F1 scores for the phone segmentation task obtained by the full systems and their correspondinginitialization systems.

Lecture topic Full50 -AM -NC FullDP -AM -NCEconomics 15.4 15.4 14.5 16.1 14.9 13.8Signal processing 17.5 16.4 12.1 18.3 17.0 14.5Clustering 16.7 18.1 15.9 18.4 16.9 15.2Speaker adaptation 17.3 17.4 15.4 18.7 17.6 16.2Physics 17.7 17.9 15.6 20.0 18.0 15.2Linear algebra 17.9 17.5 15.4 20.0 17.0 15.6

Table 3: F1 scores for word segmentation obtained by the full systems and their ablated systems.

performance for the first four lectures. Thus, theirnarrow performance gap indicates that the two ini-tialization systems may have already found a near-optimal segmentation.

Since our model also looks for the best segmenta-tion in the same hypothesis space, by initializing theboundary variables around the optimum, our modelshould simply maintain the segmentation. In par-ticular, as shown in Table 2, the full systems alsoachieve about the same performance as the -AMsystems for the first four lectures, with the overalllargest performance difference being 0.4%.

It is perhaps more interesting when the initializa-tion system gets stuck at a local optimum. By com-paring the performance of the two -AM systems forthe last two lectures, we can see that the initial-ization of Full50 converges to local optimums forthe two lectures. Nonetheless, as shown in Table 2,the Full50 system is able to improve the given ini-tial segmentation and reach a similar performanceto that accomplished by the FullDP and the initial-ization of the FullDP systems.

Word segmentation In addition to phone segmen-tation, we also evaluate our model on the task ofword segmentation. Similar to how we generate thegold standard segmentation for the previous task, we

use a speech recognizer to produce word alignmentsand obtain the word segmentation for each lecture.We then compare the word segmentation that oursystems generate to the gold standard and calculatethe F1 scores by using a 20-ms tolerance window.

By comparing the full systems to their -NC coun-terparts, we can see that the noisy-channel modelplays an important role for word segmentation,which resonates with the findings of in (Elsner et al.,2013). Without the capability of modeling phoneticvariability, it is difficult, or even impossible, for the-NC systems to recognize word tokens of the sametype but with different phonetic realizations.

We can also observe the advantage of joint learn-ing both word-level and phone-level representationsby comparing the FullDP system to the correspond-ing -AM model. On average, the FullDP system out-performs its -AM ablated counterpart by 1.6% onthe word segmentation task, which indicates that thetop-down word-level information can help refine thephone-level knowledge that the model has learned.

While similar improvements are only observed intwo lectures for the Full50 and its -AM version, webelieve it’s because the Full50 system does not haveas much flexibility to infer the phonetic embeddingsas the FullDP system. This inflexibility may have

397

Lecture topic Full50 -AM -NC FullDP -AM -NCP&G Zhang2008 2013

Economics 12 4 2 12 9 6 11 14Signal processing 16 16 5 20 19 14 15 19Clustering 18 17 9 17 18 13 16 17Speaker adaptation 14 14 8 19 17 13 13 19Physics 20 14 12 20 18 16 17 18Linear algebra 18 16 11 19 17 7 17 16

Table 4: The number of the 20 target words discovered by each system described in Sec. 5, and by thebaseline (Park and Glass, 2008), and state-of-the-art system (Zhang, 2013). The best performance achievedfor each lecture is shown in bold.

prevented the Full50 system to fully exploit the top-down information for learning.

Finally, note that even though the F1 scores forthe word segmentation task are low, we find simi-lar performance reported in Jansen et al. (2013). Wewould like to raise the question of whether the con-ventional word segmentation task is a proper eval-uation method for an unsupervised model such asthe one described in this paper. Our thought istwo fold. First, correct segmentation is vaguely de-fined. By choosing different tolerance windows, dif-ferent segmentation performance is obtained. Sec-ond, as we show later, many of the units discov-ered by our model are linguistically meaningful, al-though they do not always strictly correspond towords (i.e., the units may be morphemes or collo-cations, etc.). Since these are linguistically mean-ingful units which should be identified by an unsu-pervised lexical discovery model, it is not clear whatadvantage would be gained by privileging words inthe evaluation. Nevertheless, we present the wordsegmentation performance achieved by our model inthis paper for future references.

Coverage of words with high TFIDF scores Toassess the performance of our model, we evaluatethe degree to which it was able to correctly recoverthe vocabulary used in input corpora. To facilitatecomparison with the baseline (Park and Glass, 2008)and state-of-the-art (Zhang, 2013) spoken term dis-covery systems, we restrict attention to the top 20highest TFIDF scoring words for each lecture. Notethat the set of target words of each lecture were orig-inally chosen in (Park and Glass, 2008) and usedas the evaluation set in both Park and Glass (2008)

and Zhang (2013). To compare our system directlyto previous work, we use the same set of targetwords to test our model.3

Table 4 summarizes the coverage achieved byeach variant of our model as well as the two spo-ken term discovery systems. Note that these twosystems differ only in the acoustic features used torepresent the input: MFCC features versus more ro-bust Gaussian posteriorgrams, respectively. Both theFullDP and Full50 systems consistently outperformthe baseline. Both systems also exceed or perform aswell as the state-of-the-art system on most lectures.Furthermore, they do so using the less robust MFCCfeature-based acoustic representation.

The results in Table 4 also illustrate the syner-gistic interactions that occur between the acousticand symbolic model components. As described inSec. 5, the -AM systems are identical to the fullsystems, except they do not include the acousticmodel, and, therefore, do not re-segment or relabelthe speech after initialization. As Table 4 shows,the Full50 and FullDP models both tend to havecoverage which is as good as, or better than, thecorresponding models without an acoustic compo-nent. This indicates that top-down pressure from thesymbolic component can refine bottom-layer PLUs,leading, ultimately, to better lexicon discovery.

Finally, the comparison between Full/-AM mod-els and their -NC components suggests the impor-tance of modeling variability in the realization ofphonemes. As we will discuss in the next section,the full model tends to merge multiple sequences of

3For the procedure used to identify the word label for eachlexical unit, we refer the reader to Sec. 3.5.3 of (Lee, 2014) fordetailed explanation.

398

Figure 3: The bottom-layer PLUs ~v and the top-layer PLUs u as well as the word-internal structure that theFullDP system discovered for three instances of the words (a) globalization and (b) collaboration. We alsoinclude phoneme transcriptions (derived by manual inspection of spectrograms), for clarity.

bottom-layer PLUs into single lexical items whichshare a single top-layer PLU sequence. The resultsin Table 4 confirm this: When the model does nothave the option of collapsing bottom-layer PLU se-quences, word discovery degrades considerably.

Examples and qualitative analyses To provideintuition about the model behavior, we present sev-eral examples and qualitative analyses. Figure 3illustrates the model’s representation of two wordswhich appeared frequently in the economics lecture:globalization and collaboration. The figure shows(i) the bottom-layer PLUs which the model assignedto three instances of each word in the training cor-pus, (ii) the alignments between these bottom-layerPLUs and the top-layer PLU sequence correspond-ing the model’s lexical representation of each word,(iii) the decomposition of each word-like unit intosub-word-like units, which are denoted as bracketedsequences of PLUs (e.g., [70 110 3]), and (iv) a handannotated phonemic transcription.

The first thing to note is the importance of thenoisy-channel component in normalizing variationacross word tokens. In Figure 3-(a) the model has in-ferred a different sequence of bottom-layer PLUs foreach spoken instance of globalization. PLUs whichvary between the three instances are highlighted in

red. The model was able to map these units to thesingle sequence of top-layer PLUs associated withthe lexical item. Similar remarks hold for the wordcollaboration in Figure 3-(b). This suggests thatacoustic variability between segments led the modelto infer different bottom-layer PLUs between wordtokens, but this variability was correctly normalizedby the noisy-channel component.

The second thing to note is the large amount ofvariability in the granularity of stored sub-word-likeunits (bracketed PLU sequences). The model allowssub-word-like units to consist of any sequence ofPLUs, without further constraint. Figure 3 showsthat the model makes use of the flexibility, repre-senting linguistic structure at a variety of differentscales. For example, the initial sub-word-like unitof collaboration groups together two PLUs corre-sponding to a single phoneme /k/. Other sub-word-like units correspond to syllables. Still others cap-ture morphological structure. For example, the fi-nal sub-word-like unit in both words (highlightedin green) corresponds to the combination of suf-fixes -ation—a highly productive unit in English(O’Donnell, 2015).4

4The reader may notice that the lexical representations aremissing a final /n/ phoneme. Manual examination of spectro-

399

Transcription Discovered lexical units |Word|/iy l iy/ (really, willy, billion) [35] [31 4] 68/ey sh ax n/ (innovation, imagination) [6 7 30] [49] 43/ax bcl ax l/ (able, cable, incredible) [34 18] [38 91] 18discovered [26] [70 110 3] [9 99] [31] 9individual [49 146] [34 99] [154] [54 7] [35 48] 7powerful [50 57 145] [145] [81 39 38] 5open university [48 91] [4 67] [25 8 99 29] [44 22] [103 4] 4the arab muslim world [28 32] [41] [67] [25 35] [1 27] [13 173] [8 139] [38 91] 2

Table 5: A subset of the lexical units that the FullDP system discovers for the economics lecture. Thenumber of independent speech segments that are associated with each lexical unit is denoted as |Word|.

Lexical units also exhibit variability in granular-ity. Table 5 shows a subset of lexical units discov-ered by the FullDP system for the economics lecture.Each entry in the table shows (i) the decompositionof each word-like unit into sub-word-like units, (ii) aphonemic transcription of the unit, and (iii) the num-ber of times each lexical unit was used to label a seg-ment of speech in the lecture (denoted as |Word|).The lexical units displayed in the table correspond,for the most part, to linguistic units. While there area few cases, such as /iy l iy/, where the model storeda sequence of phones which does not map directlyonto a linguistic unit such as a syllable, morpheme,or word, most stored units do correspond to intu-itively plausible linguistic constituents.

However, like sub-word-like units, there is vari-ability in the scale of the linguistic structure whichthey capture. On one hand, the model stores anumber of highly reusable smaller-than-word units,which typically correspond to morphemes or highlyfrequent syllables. For example, the sequences /axbcl ax l/ and /ey sh ax n/ correspond to the pro-ductive suffix -able and suffix combination -ation(O’Donnell, 2015). On the other hand, the modelalso stores lexical units which correspond to words(e.g., powerful) and multi-word collocations (e.g.,the arab muslim world). Figure 4 shows an analy-sis of stored lexical units for each lecture, plotting

grams revealed two likely reasons for this. First, the final PLU30 is likely to be a nasalized variant of /@/, thus encoding someportion of the following /n/ phoneme. Second, across theseword instances there is a great deal of acoustic variation be-tween the acoustic realizations of the consonant /n/. It is un-clear at present whether this variation is systematic (e.g., co-articulation with the following word), or simply noise.

14.3%

9.4%

34.0%

33.8%42.9% 42.4%

35.1%32.6%19.9% 19%

14.8%22.8%

46.1% 53.1% 45.9%

48.2%42.3%43.4%

Economics Signal processing Clustering

Speaker adaptation Physics Linear algebra

multi-word

single wordsub-word

Figure 4: The proportions of the word tokens theFullDP system generates for each lecture that mapto sub-words, single words, and multi-words.

the proportion of stored items which map onto sub-words, words, and multi-word collocations for each.

Why does the model choose to store some unitsand not others? Like many approaches to lexi-con learning (Goldwater, 2006; De Marcken, 1996c;Johnson et al., 2006), our model can be understoodas balancing a tradeoff between productivity (i.e.,computation) and reuse (i.e., storage). The modelattempts to find a set of lexical units which explainthe distribution of forms in the input, subject to twoopposing simplicity biases. The first favors smallernumbers of stored units. The second favors deriva-tions of observed utterances which use fewer com-putational steps (i.e., using small number of lexicalitems). These are opposing biases. Storing largerlexical units, like the arab muslim world, leads tosimpler derivations of individual utterances, but alarger lexicon. Storing smaller lexical units, like thesuffix -able, leads to a more compact lexicon, butmore complex derivations individual utterances.

400

Smaller units are favored when they are usedacross a large variety of relatively infrequent con-texts. For example, -ation appears in a large numberof input utterances, but often as part of words whichthemselves are relatively infrequent (e.g., conversa-tion, reservation, innovation, and foundation whichappear 2, 3, 4, and 2 times respectively). Largerunits will be favored when a combination of smallerunits appears more frequently than would be pre-dicted by considering their probabilities in isolation.For example, the model stores the words globaliza-tion and collaboration in their entirety, despite alsostoring the suffix combination -ation. These wordsoccur 25 and 21 times respectively in the lecture,which is a greater number of times than would beexpected merely by considering the words sub-parts.Thus, the fact that the model stores a variety of lexi-cal units at different granularities is expected.

7 Conclusion and Future Work

In this paper, we have presented a probabilisticframework for inferring hierarchical linguistic struc-tures from acoustic signals. Our approach is for-mulated as an integration of adaptor grammars, anoisy-channel model, and an acoustic model. Com-parison of the model with lesioned counterparts sug-gested that our model takes advantage of synergis-tic interactions between phonetic and lexical repre-sentations. The experimental results also indicatethat modeling phonetic variability may play a crit-ical role in inferring lexical units from speech.

While the noisy-channel model has demonstratedan ability to normalize phonetic variations, it hasits limitations. In the future, we plan to investi-gate alternatives that more accurately capture pho-netic variation. We also plan to explore grammarsthat encode other types of linguistic structures suchas collocation of lexical and morphological units.

Acknowledgements

The authors would like to thank the anonymous re-viewers and the action editor of this paper, EricFosler-Lussier, for helpful comments. Furthermore,the authors would also like to thank Mark Johnsonfor making his implementation of adaptor grammarspublicly available and for answering detailed ques-tions about the model and his implementation.

References

Guillaume Aimetti. 2009. Modelling early language ac-quisition skills: Towards a general statistical learningmechanism. In Proceedings of EACL: Student Re-search Workshop, pages 1–9.

Shlomo Argamon, Navot Akiva, Amihood Amir, andOren Kapah. 2004. Efficient unsupervised recursiveword segmentation using minimum description length.In Proceedings of the 20th international conference onComputational Linguistics.

M Bacchiani and M Ostendorf. 1999. Joint lexicon,acoustic unit inventory and model design. SpeechCommunication, 29(24):99 – 114.

Benjamin Borschinger and Mark Johnson. 2014. Explor-ing the role of stress in Bayesian word segmentationusing adaptor grammars. Transactions of ACL, 2:93 –104.

Michael R. Brent. 1999a. An efficient, probabilisticallysound algorithm for segmentation and word discovery.Machine Learning, 34:71–105.

Michael R. Brent. 1999b. Speech segmentation and worddiscovery: A computational perspective. Trends inCognitive Sciences, 3(8):294–301, August.

Timothy A. Cartwright and Michael R. Brent. 1994.Segmenting speech without a lexicon: Evidence fora bootstrapping model of lexical acquisition. In Pro-ceedings of the 16th Annual Meeting of the CognitiveScience Society.

Siddhartha Chib and Edward Greenberg. 1995. Un-derstanding the Metropolis-Hastings algorithm. TheAmerican Statistician, 49(4):327–335.

Cheng-Tao Chung, Chun-an Chan, and Lin-shan Lee.2013. Unsupervised discovery of linguistic structureincluding two-level acoustic patterns using three cas-caded stages of iterative optimization. In Proceedingsof ICASSP, pages 8081–8085.

Mathias Creutz and Krista Lagus. 2007. Unsupervisedmodels for morpheme segmentation and morphologylearning. ACM Transactions on Speech and LanguageProcessing, 4(1).

Steven B. Davis and Paul Mermelstein. 1980. Com-parison of parametric representations for monosyllabicword recognition in continuously spoken sentences.IEEE Transactions on Acoustics, Speech, and SignalProcessing, 28(4):357–366.

Carl De Marcken. 1996a. Linguistic structure as com-position and perturbation. In Proceedings of the 34thannual meeting on Association for Computational Lin-guistics, pages 335–341. Association for Computa-tional Linguistics.

Carl De Marcken. 1996b. The unsupervised acquisitionof a lexicon from continuous speech. Technical Report

401

AI-memo-1558, CBCL-memo-129, Massachusetts In-stitute of Technology Artificial Intelligence Labora-tory.

Carl De Marcken. 1996c. Unsupervised Language Ac-quisition. Ph.D. thesis, Massachusetts Institute ofTechnology.

Micha Elsner, Sharon Goldwater, Naomi Feldman, andFrank Wood. 2013. A joint learning model of wordsegmentation, lexical acquisition, and phonetic vari-ability. In Proceedings of EMNLP, pages 42–54.

T.S. Ferguson. 1973. A Bayesian analysis of some non-parametric problems. Ann. Statist, 1(2):209–230.

Jenny Rose Finkel, Christopher D. Manning, and An-drew Y. Ng. 2006. Solving the problem of cascadingerrors: Approximate Bayesian inference for linguisticannotation pipelines. In Proceedings of the Confer-ence on Empirical Methods in Natural Language Pro-cessing (EMNLP), pages 618–626.

Michael C. Frank, Sharon Goldwater, Thomas L. Grif-fiths, and Joshua B. Tenenbaum. 2010. Modelinghuman performance in statistical word segmentation.Cognition, 117(2):107–125.

Michael C. Frank, Joshua B. Tenenbaum, and EdwardGibson. 2013. Learning and long-term retention oflarge-scale artificial languages. Public Library of Sci-ence (PloS) ONE, 8(4).

James Glass, Timothy J. Hazen, Lee Hetherington, andChao Wang. 2004. Analysis and processing of lectureaudio data: Preliminary investigations. In Proceedingsof the Workshop on Interdisciplinary Approaches toSpeech Indexing and Retrieval at HLT-NAACL, pages9–12.

James Glass. 2003. A probabilistic framework forsegment-based speech recognition. Computer Speechand Language, 17:137 – 152.

John Goldsmith. 2001. Unsupervised learning of themorphology of natural language. Computational Lin-guistics, 27(2):153–198.

John Goldsmith. 2006. An algorithm for the unsuper-vised learning of morphology. Natural Language En-gineering, 12(4):353–371.

Sharon Goldwater, Thomas L. Griffiths, and Mark John-son. 2009. A Bayesian framework for word segmen-tation: Exploring the effects of context. Cognition,112:21–54.

Sharon Goldwater. 2006. Nonparametric Bayesian Mod-els of Lexical Acquisition. Ph.D. thesis, Brown Uni-versity.

Joshua T. Goodman. 1998. Parsing Inside-Out. Ph.D.thesis, Harvard University.

Zellig Harris. 1955. From phoneme to morpheme. Lan-guage, 31(2):190–222.

Jahn Heymann, Oliver Walter, Reinhold Haeb-Umbach,and Bhiksha Raj. 2013. Unsupervised word seg-mentation from noisy input. In Proceedings of ASRU,pages 458–463. IEEE.

Aren Jansen, Kenneth Church, and Hynek Hermansky.2010. Towards spoken term discovery at scale withzero resources. In Proceedings of INTERSPEECH,pages 1676–1679.

Aren Jansen, Emmanuel Dupoux, Sharon Goldwater,Mark Johnson, Sanjeev Khudanpur, Kenneth Church,Naomi Feldman, Hynek Hermansky, Florian Metze,Richard C. Rose, et al. 2013. A summary of the 2012JHU CLSP workshop on zero resource speech tech-nologies and models of early language acquisition. InICASSP, pages 8111–8115.

Frederick Jelinek. 1976. Continuous speech recogni-tion by statistical methods. Proceedings of the IEEE,64:532 – 556.

Mark Johnson and Katherine Demuth. 2010. Unsu-pervised phonemic Chinese word segmentation usingadaptor grammars. In Proceedings of COLING, pages528–536, August.

Mark Johnson and Sharon Goldwater. 2009a. Improvingnonparameteric Bayesian inference: Experiments onunsupervised word segmentation with Adaptor Gram-mars. In Human Language Technologies: The 2009Annual Conference of the North American Chapter ofthe ACL, pages 317–325.

Mark Johnson and Sharon Goldwater. 2009b. Improvingnonparameteric Bayesian inference: experiments onunsupervised word segmentation with adaptor gram-mars. In Proceedings of NAACL-HLT, pages 317–325.

Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa-ter. 2006. Adaptor grammars: A framework for speci-fying compositional nonparametric Bayesian models.In Advances in Neural Information Processing Sys-tems, pages 641–648.

Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa-ter. 2007. Bayesian inference for PCFGs via Markovchain Monte Carlo. In Proceedings of NAACL, pages139–146.

Mark Johnson. 2008a. Unsupervised word segmentationfor Sesotho using adaptor grammars. In Proceedingsof the Tenth Meeting of ACL Special Interest Groupon Computational Morphology and Phonology, pages20–27, Columbus, Ohio, June.

Mark Johnson. 2008b. Using adaptor grammars to iden-tify synergies in the unsupervised acquisition of lin-guistic structure. In Proceedings of ACL, pages 398–406.

Brenden M. Lake, Chia-ying Lee, James R. Glass, andJoshua B. Tenenbaum. 2014. One-shot learning ofgenerative speech concepts. In Proceedings of the 36thAnnual Meeting of the Cognitive Science Soceity.

402

Karim Lari and Steve J. Young. 1990. The estimationof stochastic context-free grammars using the inside-outside algorithm. Computer speech & language,4(1):35–56.

Chia-ying Lee and James Glass. 2012. A nonparamet-ric Bayesian approach to acoustic model discovery. InProceedings of ACL, pages 40–49.

Chia-ying Lee, Yu Zhang, and James Glass. 2013. Jointlearning of phonetic units and word pronunciations forASR. In Proceedings of the Conference on EmpiricalMethods on Natural Language Processing (EMNLP),pages 182–192.

Chia-ying Lee. 2014. Discovering Linguistic Structuresin Speech: Models and Applications. Ph.D. thesis,Massachusetts Institute of Technology.

Percy Liang, Slav Petrov, Michael I. Jordan, and DanKlein. 2007. The infinite PCFG using hierarchicalDirichlet processes. In Processing of EMNLP, pages688–697.

Fergus R. McInnes and Sharon Goldwater. 2011. Un-supervised extraction of recurring words from infant-directed speech. In Proceedings of CogSci, pages2006 – 2011.

Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda.2009. Bayesian unsupervised word segmentation withnested Pitman-Yor language modeling. In Proceed-ings of ACL, pages 100–108.

Graham Neubig, Masato Mimura, and Tatsuya Kawa-hara. 2012. Bayesian learning of a language modelfrom continuous speech. The IEICE Transactions onInformation and Systems, 95(2):614–625.

Timothy J. O’Donnell. 2015. Productivity and Reuse inLanguage: A Theory of Linguistic Computation andStorage. The MIT Press, Cambridge, Massachusettsand London, England.

D. C. Olivier. 1968. Stochastic Grammars and LanguageAcquisition Mechanisms. Ph.D. thesis, Harvard Uni-versity.

Alex S. Park and James R. Glass. 2008. Unsupervisedpattern discovery in speech. IEEE Transactions onAcoustics, Speech, and Signal Processing, 16(1):186–197.

Jim Pitman and Marc Yor. 1997. The two-parameterPoisson-Dirichlet distribution derived from a stablesubordinator. The Annals of Probability, pages 855–900.

Jim Pitman. 1992. The two-parameter generalizationof ewens’ random partition structure. Technical re-port, Department of Statistics University of CaliforniaBerkeley.

Jennifer R. Saffran, Richard N. Aslin, and Elissa L. New-port. 1996a. Statistical learning by 8-month-old in-fants. Science, 274(5294):1926–1928, December.

Jennifer R. Saffran, Elissa L. Newport, and Richard N.Aslin. 1996b. Word segmentation: The role of dis-tributional cues. Journal of Memory and Language,35:606–621.

Odette Scharenborg, Vincent Wan, and Mirjam Ernestus.2010. Unsupervised speech segmentation: An analy-sis of the hypothesized phone boundaries. Journal ofthe Acoustical Society of America, 127:1084–1095.

Yaodong Zhang and James Glass. 2009. Unsuper-vised spoken keyword spotting via segmental DTWon Gaussian posteriorgrams. In Proceedings of ASRU,pages 398–403.

Yaodong Zhang, Ruslan Salakhutdinov, Hung-An Chang,and James Glass. 2012. Resource configurable spokenquery detection using deep Boltzmann machines. InProceedings of ICASSP, pages 5161–5164.

Yaodong Zhang. 2013. Unsupervised speech processingwith applications to query-by-example spoken term de-tection. Ph.D. thesis, Massachusetts Institute of Tech-nology.

403

Unsupervised Lexicon Discovery from Acoustic...

Documents

Transcript of Unsupervised Lexicon Discovery from Acoustic...