SPEECHBOT: An Experimental Speech-Based Search Engine for

SPEECHBOT: An Experimental Speech-Based Search Engine forMultimedia Content in the Web

Jean-Manuel Van Thong, Pedro J. Moreno, Beth Logan, Blair Fidler,Katrina Maffey and Matthew Moores

CRL 2001/06

July 2001

SPEECHBOT: An ExperimentalSpeech-BasedSearchEnginefor MultimediaContentin theWeb

Jean-ManuelVanThong,PedroJ.Moreno,BethLogan,Blair Fidler, KatrinaMaffey andMatthew Moores

CambridgeResearchLaboratoryCompaqComputerCorporation

CambridgeMA 02142-1612

July2001

Abstract

As the Web transformsfrom a text only medium into a more multimedia richmediumtheneedarisesto performsearchesbasedon themultimediacontent.In thispaperwe presentan audioandvideo searchengineto tackle this problem. The en-gine usesspeechrecognitiontechnologyto index spoken audioandvideo files fromtheWorld Wide Webwhenno transcriptionsareavailable. If transcriptions(evenim-perfectones)areavailablewecanalsotakeadvantageof themto improvetheindexingprocess.

Our engineindexesseveral thousandtalk andnews radio shows covering a widerangeof topicsandspeakingstylesfrom a selectionof publicWebsiteswith multime-dia archives. Our Web site is similar in spirit to normalWeb searchsites;it containsan index, not the actualmultimediacontent. The audio from theseshows suffers inacousticquality dueto bandwidthlimitations, coding,compression,andpoor acous-tic conditions.Our word-errorrateresultsusingappropriatelytrainedacousticmodelsshow remarkableresilienceto thehigh compression,thoughmany factorscombinetoincreasethe averageword-errorratesover standardbroadcastnews benchmarks.Weshow that, even if the transcriptionis inaccurate,we canstill achieve goodretrievalperformancefor typical userqueries(77.5%).

Authoremail:[email protected]

c�

Compaq Computer Corporation, 2001

This work maynot becopiedor reproducedin wholeor in part for any commercialpurpose.Per-missionto copy in wholeor in partwithout paymentof feeis grantedfor nonprofiteducationalandresearchpurposesprovidedthatall suchwholeor partialcopiesincludethefollowing: a noticethatsuchcopying is by permissionof theCambridgeResearchLaboratoryof CompaqComputerCorpo-rationin Cambridge,Massachusetts;anacknowledgmentof theauthorsandindividual contributorsto the work; andall applicableportionsof the copyright notice. Copying, reproducing,or repub-lishing for any otherpurposeshallrequirea licensewith paymentof feeto theCambridgeResearchLaboratory. All rightsreserved.

CRL Technicalreportsareavailableon theCRL’s webpageathttp://crl.research.compaq.com.

CompaqComputerCorporationCambridgeResearchLaboratory

OneCambridgeCenterCambridge,Massachusetts02142USA

1

1 Introduction

As themagnitudeanduseof multimediacontenton theWebgrows, in particularlargecollectionsof streamedaudioandvideofiles, efficient waysto automaticallyfind therelevantsegmentsin thesemultimediastreamsarenecessary. Unfortunately, traditionalWebsearchenginesareoftenlimited to text andimageindexing andmany multimediadocuments,videoandaudio,arethusexcludedfrom classicalretrieval systems.Eventhosesystemsthatdo allow searchesof multimediacontent,like AltaVista multimediasearchandLycosMP3search,onlyallow searchesbasedondatasuchasthemultimediafile name,nearbytext on the Web pagecontainingthe file, andmeta-dataembeddedin the file suchastitle andauthor. Clearly thesesystemsdo no performany detailedanalysisof themultimediacontent.

Exceptfor somecarefullyarchivedfiles,mostmultimediaarchiveson theWebaresimple lists of links to long audio files, sometimesseveral hoursin length 1. Veryoften, thereis no transcriptionavailableandthereforeno simplemeansfor indexingtheir content.Evenwhena transcriptionis availableoftenit is not annotatedor linkedto the relevantpointsof the multimediastream.A searchableindex thatprovidestheability to play thesegmentsof interestwithin theaudiofile of theseshowswouldmakethesearchivesmuchmoreaccessiblefor listenersinterestedin aparticulartopic.

A straightforwardapproachto solve this problemconsistsof generatingthe tran-scriptionautomaticallyusinga largevocabulary speechrecognitionsystem.However,speechrecognitiontechnologyis currentlyinherentlyinaccurate,particularlywhentheaudioquality is degradeddueto poorrecordingconditionsandcompressionschemes.Despitethis, we show that we can achieve accuracy satisfactory for indexing audiofrom theWebif theacousticandlanguagemodelsareproperlytrained.

SpeechBotis not thefirst systemto offer thesecapabilities.In fact,therehavebeenseveralstudieswhich hadsimilar goals[1], [2], [3], [4], [5]. We differ from theseprojectsin severalways. First, we fetchtheaudiodocumentsfrom theWebandbuildan index from thatdata.Second,we don’t serve content,but ratherkeepa link to theoriginaldocument,similar to traditionalsearchengines.Third, our systemis designedto scaleupondemand.Finally, oursearchindex is availableandrunningon theWeb2.

The contentcurrently indexed is populartalk radio, technicalandfinancialnewsshows andsomeconferencevideorecordings.For videostreams,sinceno videopro-cessingis needed,we only downloadthe associatedaudio track, hencereducingthebandwidthin thetranscoding,andsaving disk spaceandbandwidth.Theseshows arealmostentirelyspeech,andvery few of themhaveassociatedtranscriptions,unlikeTVshows thatareoftenclosedcaptionedin theU.S.Closedcaptions(or CC) arethetex-tual transcriptionsof theaudiotrackof a TV program,similar to subtitlesfor a movie.TheU.S.FederalCommunicationsCommission(FCC)requiresbroadcastprogramstobe closecaptionedin orderto be accessibleto peoplewith disabilities. Theseclosedcaptionscaneasilybeusedfor indexingpurposessincethetranscriptionis timealignedfor displaypurposes.

Someof theradioshowsweindex areextremelypopular:Dr. Laurais syndicatedto

1seefor examplehttp://www.broadcast.com2http://www.compaq.com/speechbot

2 2 SYSTEMOVERVIEW

430stationsin theU.S.andis reportedto haveaweeklyaudienceof 18million people;Art Bell Coastto Coast/Dreamlandis reportedto haveaweeklyaudienceof 10millionpeople.Someof theseshowsarevery long. For exampleeachArt Bell Coastto Coastshow is four hourslong, andeachDr. Laurashow is two hourslong. However, noneof theseshowshaveexisting transcriptions,exceptin somecasesa title for eachshow.This is clearly inadequatesincelong shows may cover a variety of differentsubjectsandshouldnot beindexedasa whole. Sincetheseshowsall have existing archivesontheInternet,SpeechBotprovideslinks to therelevantsegmentsin thesearchives.

The outline of the paperis as follows. In Section2 we give an overview of thearchitectureof thesystem.In Section3 we presenta performanceanalysis.In Section4 we describeour usability studies.Finally, in Section5 we presentour conclusionsandsuggestionsfor futurework.

2 System Overview

SpeechBot is a public searchsitesimilar to AltaVista,which indexesaudiofrom pub-lic Web sitessuchasBroadcast.com,PBS.org, and InternetNews.com. The index isupdateddaily asnew shows arearchived in their Web sites. The systemconsistsofthe following modules: the transcoders,the speechdecoders,the librarian database,andthe indexer. Figure1 presentsthesystemarchitectureat a conceptuallevel. Fig-ure2 presentstheoverall architectureof thesystemandgivesanideaof thehardwarerequirementsof SpeechBot. We describeeachcomponentbelow.

Figure1: Overall architectureof thesystem

2.1 Transcoder 3

Figure2: Overallhardwareview of thesystem

2.1 Transcoder

The transcodersfetch anddecodevideo andaudiofiles from the Internet. For eachitem, they extract themeta-data,downloadthemediadocumentsto a temporarylocalrepositoryandconvert theaudiointo anuncompressedfile format(typically MicrosoftPCM wav). This format preserves the original samplingrate of the streamand thenumberof audiochannels.Themeta-data(if available)containsinformationaboutthefile downloadedsuchasthesamplerate,copyright, thestorytitle, andpossiblya shortdescription.Thisinformationis usedby theLibrariandatabaseto identify andtrackthedocumentwhile beingprocessedby thesystem,andto displaytheresultof thequeryto theuser.

2.2 Speech Recognizer

SpeechBot usesCalista,our in-housespeechrecognizer. Calistais derived from theSphinx-3system[6] andis a largevocabularycontinuousspeechrecognitionpackagewhich usesstate-of-the-artGaussianmixture, triphonebased,HiddenMarkov Model(HMM) technology. Thebesthypothesisproducedby this one-passrecognizeris used

4 2 SYSTEMOVERVIEW

to producea textual transcription,or annotationstream,from the downloadedaudiofiles. Theannotationstreamconsistsof thestarttimeandendtimeof eachword in thetranscript,andthe word itself. The audiofiles aresegmentedin sucha way that thespeechrecognitioncanbeperformedin parallelon differentportionsof thedocument.

A farmof workstationstiedtogetherin aqueuingsystemrecognizeseachportionofthedocument.Thus,evenif thespeechdecoderis notreal-time,wecanstill achievesubreal-timethroughput.Whenall theportionsarerecognized,theresultsareassembledto createa fully annotateddocument.Calistayields an error rateof about20% on asinglepasssearchon the 1998ARPA HUB4 evaluationcorpora[7] with an averagecomputationalload of 6 times real time on a CompaqDS20 EV6 Alpha processorrunningat 500 MHz. The productionsystemis composedof a farm of 30 CompaqAP500andProliant1850dualPentiumII/III, 450MHz and600MHz, 256Mb RAM,runningLinux 6.0.

Whenthetranscriptionis availablewe replacethespeechrecognitionmodulewithan alignermodule. Its role is to provide time marksfor eachword of the input text.We have developeda methodfor aligning transcriptsto very long audio segments.Thismethodis robustto occasionallyinaccuratetranscripts,musicinsertions,or noisyspeech. The precisionmeasuredis over 99% of words alignedwithin a 2 secondsmisalignment[8].

In othercasesthetranscriptionis only approximate.Thismaybedetectedwhenthealignermodulefails. Somewordsin the transcriptmight be missingor wrongwordsmight be inserted.Sometimeseven completesentencesmight be missed.Followingtheideasof [9] webuild a languagemodelusingall theavailabletext in thetranscriptandthenrecognizetheaudiousingthismodel.Thistrainingprovidesalanguagemodelthat is muchmorespecificthangeneralbroadcastnews modelswhile allowing somefreedomin the search. Becausethis model is very restrictive andcloseto the truthit limits dramaticallythe searchspacefor the recognizerandyields lower word errorrates. However, sincewe areperformingrecognition,not alignment,errorsarestillpossible.

2.3 Librarian

Thelibrarianhastwo mainroles. It managestheworkflow of taskscarriedout by theindividualmodules,andit storesmeta-dataandotherinformationrequiredby theuserinterface.

The componentmodulesoften run on remotemachinesanddo not communicatedirectly. Eachprocessregistersitself with the librarianandonceregisteredcanmakea requestfor work to perform. This includessuchtasksasspeechrecognition,text toaudioalignmentor insertionof text into the index. The outputof onetaskis usuallythe input for another, andthe librarian tracksthe locationof theseinputsandoutputs.In this sensethelibrarianis thebrainof oursystem.

In additionto storingthemeta-datacollectedby theTranscodermodule,thelibrar-ian stores10 second’clips’ of formattedtext for eachdocument.It maintainsa map-pingbetweenword locationsfrom theindex, correspondingtext clips,andthetime theclip occursin themultimediadocument.TheUI usesthis informationto constructthequeryresponsepagesdisplayedto theuser.

2.4 Indexer 5

The librarian is built on anOraclerelationaldatabaserunningon Tru64Unix 4.0.Clientscommunicatewith the librarianvia a in-house,cleartext protocolover socketconnections.The useof a centralrepositoryfor sharedinformationallows a robustdistributedarchitecturewhich canscaleon demand.

2.4 Indexer

Theindexerprovidesanefficientcatalogueof audioandvideodocumentsbasedonthetranscriptionproducedby thespeechdecoder. As well assupplyingtheuserinterfacewith alist of documentsthatmatchauser’squery, theindexeralsoretrievesthelocationof thesematcheswithin the documents.It doesthis usinga modifiedversionof theAltaVistaqueryengine[10].

The indexer sortsthematchesaccordingto relevance,asdescribedin Section3.3.Wedefinerelevanceusingthetermfrequency inversedocumentfrequency (tf.idf) met-ric [11], adjustedfor theproximity of thetermswithin thedocument.This sortingisperformedonboththelist of documentsreturnedandthelist of matchlocations.In ad-dition to thetranscriptionof theaudio,we alsoindex themeta-dataif any is available.Both thelibrarianandtheindexerarebasedon thesystemdescribedin [12].

2.5 User Interface

TheWebserver passestheuserqueriesto the indexer, readsbackthe list of matches,retrievestheassociatedmeta-datafrom the librarian,andformatsanddisplaysthe re-sults. As part of this process,it also performsthe advancedfunctionsdescribedinSection4, suchashighlighting the matcheswithin the transcriptandexpandingandnormalizingacronyms,abbreviationsandnumbers.

TheWeb server returnsup to 20 pagesof up to 10 documentseachsortedby rel-evance.Externallinks to theoriginal audioandvideofiles aredisplayed,aswell asalink for eachdocumentto a pagewith moredetails.This is shown in Figure3 wherewepresentthelook andfeelof a searchpagewith thequeryPaul McCartney.

3 Performance Analysis

In thissectionweobtainobjectivemeasurementsof themaincomponentsof theSpeech-Botsystem.Wedescribetheperformanceof thetranscoder, thespeechrecognitionen-gine,theinformationretrieval engine(a combinationof thespeechrecognitionoutputandtheindexer)andtheWebserver.

3.1 Transcoder

The currentversionof the transcoderscanhandlea variety of differentformatsbothstreamingandnon streaming.Most of the documentsdownloadedhowever areRe-alAudio encoded.Transcodersrun in parallelon a farmof 8 CompaqAP400dual400MHz PentiumII workstations,256Mb RAM, underWindows NT. Eachmachinecanhandle4 streamsin real-time,leadingto a maximumthroughputof �� downloadable

6 3 PERFORMANCEANALYSIS

Figure3: View of thesearchpage

hoursof audioperday. Notice that thestreamingformatscannotbedownloadedat afasterthanreal time ratesinceby designstreamingserversfeedthe dataat real timerates,not fasternor slower.

3.2 Speech Recognition

The speechrecognitionsystemhasbeentestedon randomlyselectedsegmentsfromseveralpopularradioshows: Coastto Coastwith Art Bell, Dr LauraSchlessinger, EdTyll, Rick Emerson,andtheMotley Fool RadioShow. Four 15-minutesegmentswereselectedfrom eachof fiveshowsfor worderrorrateanalysis.Themajorityof theaudiostreamsareencodedwith the 6.5 Kbps RealAudiocodec. A few audiostreamsareencodedat higherrates.After download,the audiois storedin a wav file sampledat8 kHz. Acousticconditionsvaryasshowsmayhavetelephoneconversations,commer-

3.2 SpeechRecognition 7

cials,severalpeopletalking simultaneously, or musicin thebackground.Theselectedsegmentsaretranscribedmanually, andthe transcriptionis usedto estimatetheWordError Rate( �� ). Where � , � , � and � are the numberofsubstitutions,insertions,deletionsandtotal numberof wordsrespectively.

We breakeach15-minutetestsegmentinto smallerutterancesthataremoreeasilyhandledby therecognizer. Eachsegmentis brokeninto 30 secondslong subsegmentsoverlappingby 5 seconds.Weuseastatistical3-gramlanguagemodeltrainedusingtheDARPA broadcastnews Hub-41998text corpora[7]. An n-gramstatisticallanguagemodelgives the probability of the currentword given the previous n-1 words. Ourtraining sethasbeentrainedon a corpusof about100 millions words. It containsavocabulary of 64,000wordswhich correspondsto 4 million bigramsand15 milliontrigramsin thelanguagemodel.

Sincethe audio hasbeenencodedfor streamingpurposes,we have to adapttheacousticmodelsto the compression/decompressionscheme.We investigatedthe im-pactof acousticadaptationby exploringtwo acousticmodelingapproaches.In ourfirstapproachwe build acousticmodelsby training on 100 hoursof the BroadcastNewscorpusprovided by LDC (Linguistic Data Consortium) [13] at its original 16 kHzsamplingratewith recordingstudioacousticquality. Whenusingthesemodelsthetestdatahadto beup-sampledto 16kHz. Thesecondapproachusedmodelstrainedonthesametrainingcorpusbut afterbeingencodedusingthe6.5KbpsRealAudiocodec,andthendecodedto a samplingrateof 8 kHz. This encoding/decodingoperationwasper-formedto reducethe acousticmismatchbetweenthe training corporaandthe testingcorpora.

On thetestsetwe observedanimprovementin theaverageword errorrate(WER)from 60.5%for the first setof acousticmodelsto 49.6%whenthe 8 kHz RealAudioencoded/decodedmodelswereused.Theseresultsarefor a systemwith 16 Gaussianmixture componentsper HMM stateand 6,000sharedstates. Table 1 presentsfullresultsfor the8 kHz RealAudiosystem.

As acomparison,theTotalErrorRate(TER)of televisionnewsandsportscaption-ersis usuallybetween1.5%and2.5% [14]. The TER is however measuredslightlydifferentlyto theWER.TheTERcalculatestheproportionof errorsoverthetotalnum-berof words. Errorsincludesmis-typedwordsandpunctuationerrorswhich usuallydonot occurin transcriptionsgeneratedby aspeechrecognizer.

Table1: SpeechRecognitionWER. 8 kHz RealAudiomodels,16 GaussianmixturecomponentsperclusteredHMM state.6000clusterstates

Show Chunk Chunk Chunk ChunkName 1 2 3 4

Art Bell 50.5% 46.3% 51.2% 44.5%Dr. Laura 51.2% 47.4% 52.7% 59.2%Ed. Tyll 62.3% 54.0% 49.9% 47.0%

R. Emerson 48.2% 51.5% 53.1% 56.4%M. Fool 44.1% 38.8% 47.2% 43.5%


TheCalistaspeechrecognizercanbeadjustedto tradeoff processingtime for ac-curacy within certainlimits. We observe that rangesbetween6 and30 timeslongerthan real time on a PentiumII, 450 MHz processorunder Linux give good accu-racy/processingtime tradeoffs for aproductionsystem.Thedominantparameteris thebeamof the recognitionsearch.During recognition,hypotheseswith likelihoodlessthan the beamarediscarded,reducingthe numberof calculationsrequired. Table2shows theperformanceof therecognitionsystemasthis factoris adjusted.In produc-tion modewe use8 Gaussianmixturecomponentsperstateanda beamof �� "!�# . Therecognizerthenrunsat around13 timesreal time (xRT). Becausethe speechrecog-nizer is parallelizedwhenrunningon archived data,andassumingan averageof 13timesreal-time,we areableto processapproximately110hoursof audioperdayon afarmwith 60 processors.

Table2: WERandspeed(x RealTime)of speechrecognizerasafunctionof beamandnumberof Gaussians.8 kHz RealAudiotrainedmodels

Number of Beam WER xRTGaussians

16 �� "$&% 49.6% 23.216 �� "!�# 51.5% 18.58 �� "$&% 51.9% 18.08 �� "$&' 53.0% 14.58 �� "!�# 53.9% 13.0

3.3 Information Retrieval Performance

Theeffectivenessof retrieval dependson a numberof factorssuchasthe complexityof thequeries,the lengthof thedocuments,thesizeof thecollection,theaccuracy oftheautomatictranscriptions,andtheinformationretrieval techniquesused.

Theretrieval performancewasevaluatedonthepublicsitewith staticcontentsincetheindex needsto bestableduringthetestswhich lastseveraldays.We hadindepen-dentandunbiasedtestersevaluatea setof queriesselectedfrom the list of 100 mostfrequentlysubmittedqueries.Eachtesterperformed50 queries,andassessedthepre-cisionof thetop shows returnedfor each.Thestudywaslimited to thetop 20,10 and5 shows, basedon theassumptionthat typical userstendonly to look at thefirst cou-ple of pagesof the retrievedresults[15]. Thequerytermswereselectedfrom thetop100queriessubmittedto thepublic sitefrom December1st,1999to March1st,2000.Thewordswereselectedsuchthat they cover a largevarietyof topics,varyinglengthof words(phoneme-wise),andvarying typesof wordssuchasacronyms andpropernouns. The querieswereeithernounsor nounphraseswith a maximumlengthof 3words.ExamplequeriesareBill Clinton,InternetandY2K.

Testersjudgedtherelevanceof theretrieveddocumentsbasedon a concepthit ap-proach,i.e. a documentwasconsideredrelevant if andonly if the conceptdescribed

3.3 InformationRetrieval Performance 9

in the querywaspresentin the retrieveddocument.To assesswhethera givenresultwasrelevant,testersreadthetranscriptsreturnedwith thesearchresultsof thedetailedpageand, for eachclip of the document,listenedto the correspondingpartsof theaudio show. To evaluatethe retrieval results,we useda standardaverageprecisionmetric[16]. We did not run theexperimentona standardtestset,likeTREC-SDR[3],but ratheron real data(both audiocontent,andqueries)wherethe diversityof audioconditionandcontentis very high. Unlike TREC evaluations,we do not know thegroundtruth of our index, sowe cannotmeasurerecall. We thereforemeasurepreci-sionwhichstill meaningfullyreflectshow well our index servesourusers’informationneeds.

Table3 presentsaverageretrieval precisionnumbersfor theabovedescribedexper-imentsfor two differentrankingfunctions.

Table3: Averageretrieval precisionfor wholedocumentsusingtwo differentrankingfunctions

Ranking Docs Duration Queries Averagefunction (count) (hours) (count) precision

A 4695 4188 50 71.7%B 4695 4188 50 75.5%

Rankingfunction A scoresdocumentsusing a slightly-modifiedimplementationof an algorithmfor rankingdocumentsin large text index. FunctionA is calculatedusinganapproximatetermfrequency inversedocumentfrequency (tf.idf) metric [11],combinedwith ascorebasedon theproximity of querytermsto eachotherandascorebasedon their locationwithin thedocument.For speedof search,thetermfrequencyin the approximatetf.idf is calculatedasa stepfunction indicatingwhetherthe termoccursonceor morethanoncein thedocument.Theproximity biashelpsto retrievedocumentswith amulti-wordquerystring,evenif theuserdidn’t choosethethisexactphraseoption.Locationwithin thedocumentis calculatedsuchthattermscloserto thebeginningof a documentreceivea higherscore.

RankingfunctionB scoresdocumentsusinga true tf.idf metric,combinedwith ascorebasedon the proximity of query termsto eachother. The bias towardstermsoccurringearlierin thedocumentis not includedin functionB.

The retrieval performanceof the systemis betterthan expectedconsideringtheaccuracy of thespeechdecoder. In many studies,documentretrievalhasbeenshown toberemarkablyresilientto thespeechrecognizerword error rate. Recentexperimentsshow thataworderrorrateof 30%reducesrecallby 4%relative,andaworderrorrateof 50%reducesit by only 10%relative [17], [4]. Thereareseveralexplanationsforthesenumbers.First the querywordsareoften repeatedseveral timesduring a showandarethusmorelikely to berecognized.Second,thekeywordstendto belongerthanstopwords(e.g. a, the, and ...), so thespeechrecognitionsearchis moreconstrainedandtendsto performbetteron thesewords.Also, if therearemany wordsin thequery,missingoneor two of themmaystill permitretrieval of thedocument.


Retrieval errorsweredueto severalreasons.First, insertionor substitutionrecog-nition errorscausequerywordsto appearerroneouslyin thetranscripts.Rankingfunc-tion A is more sensitive to this type of error: we observed several caseswhereaninsertionat thevery beginningcausedthedocumentto getanerroneouslyhigh score.This typeof falsealarmerrorrepresentsroughlyhalf of thecasesof theappearanceofnon-relevantdocuments.RankingfunctionB helpsto alleviatethisproblemby weight-ing termsequallythroughoutthedocumentwithout usinga locality bias. This makesintuitive sensewhen you considerthat an audio documentis lesslikely than a textdocumentto beaboutasingletopic.

All of thewordsusedfor evaluationwerein thevocabularyof thespeechrecogni-tion engine,thustheaverageprecisionmeasuredshouldbeviewedasanupperboundperformance.However, our Out Of Vocabulary (OOV) rateis very low (about1.2%with a staticlanguagemodelbuilt in late1999)andhasa minor impacton our recog-nition performance.It is likely thatotherfactorssuchasacousticandlanguagemodelmismatch,andpronunciationinaccuraciesarethemaincontributorsto thewordrecog-nition error rate. We have found that the OOV of userqueries(12%) is considerablyhigher than the speechrecognitionOOV. This ratehasclearly an implication on theperformanceof thesystemsincea querywith a word that is out of vocabulary do notreturnany valid hit [18].

The secondmain reasonfor a non-relevant documentto be retrieved is that thequerywordsarementionedout-of-context, or areinherentlyambiguous.For example,thequeryaidsreturnedmany documentswhich talkedaboutaidsmeaninghelpsratherthana disease.Theseproblemsmay be overcomeby introducinga notion of subjectpopularity, andintroducingbiasin therankingscheme[19].

Finally, documentsmay not be retrieved becauseof the ranking function itself.Althoughtheseerrorsprobablycontribute to theperformanceof thesystem,we werenot ableto evaluatetheir impactbecausewe don’t have a properlylabeledtestsettomeasuretherecall.

3.4 Web Site Performance

To estimateoverall performancein termsof speedas perceived by a user, we usedautomatedtestscriptswhichfire thousandsof queriesat thesiteoveracertainperiodoftime,andreadbacktheresults.Thescriptsmeasurepageviewspersecondandaveragelatency. ThequeriesusedarerealuserinputgatheredfromthesitelogssinceSpeechBotwent live in December1999. They aresubmittedin multiple parallelsequenceswithrandomvariabledelaysto simulatelikely real-world conditions.

The resultswerefound to be affectedby the frequency of the querytermsin thedatabase,by databaseandindex size,andby thenumberof CPUsavailablefor usebythe system. It is possibleto split componentsof the SpeechBot systemover severalsingleor multi-processormachinesin orderto increaseperformancewith a very largedatabase.Performanceprofiling usingDCPI [20] andour own logging tools showedthat the main bottleneckswere usability featuresincluding acronym expansionandrankingof matcheswithin documents.Performanceimprovementsin theseareashaveresultedin an increasefrom ( ) �+* pageviews persecondto �-, ).��( pageviews persec-ond. The testsused *0/��+* hoursof contentwith all componentsrunningon a single

11

CompaqDS20AlphaServer EV6 (2 CPUs)andsimulated275userswaiting an aver-ageof 15 secondsbetweenclicks. A sitewith this performancewould supportover 1million hits perday.

Thefastestresultsto date,27pageviewspersecondwith anaveragelatency of �1)2�3�seconds,wereobtainedwith no acronym expansionor within-documentranking,anda relatively small database(200 hoursof content)with the librarian and the indexerrunningon a CompaqAlphaServer 4100with 4 processorsand2Gbytesof memoryand the UI runningon a separateCompaqAlphaServer 4100with 4 processorsand2Gbytesof memory.

We have alsoconductedperformancetestsof thedatabaseasa standalonecompo-nentsinceit is a potentialbottleneckfor the systemwhenservingthe userinterface.Thetestof thedatabasecomponentalonerunningon oneAlphaServer4100with fourAlpha processorsat 466MHz measureda throughputof 29 usersessionsper second,equivalentto 88 pagespersecondor 7.5 million pagesover a 24-hourperiod. Theseperformancesarebetterthanthepresentationserver, whichgeneratestheHTML pagesata maximumrateof about30 pagespersecond[12].

4 Usability Studies

The usability of the site wasextensively evaluatedandimproved. A groupof sevenexperiencedInternetusersaged18-50, most (five) with experiencewatchingvideocontentor listening to audiocontenton the Web wereeachobserved while they at-temptedto achieve thesamesetof goalsusinga SpeechBotprototypesite. Thegoalswereexpressedin the form of two scenarioswritten in termsof what informationtheuserwasto discoverusingthesite,with eachscenariocoveringthetwo universalgoalsof searching:to make thewidestpossiblesearchandto narrow down theresultsof asearch.Beforeactingout the scenarios,eachuserwasasked to performa free-formsearchfor a subjectof their own choice. Usersweregiven no instructionon the useof thesite,andeachsessiontook a little over anhour. Userswereencouragedto talkthroughtheir thoughtprocessesandexpectationsprior to eachinteractionwith thesiteandthencommenton theresponseobtainedfrom thesite. Eachuseralsocompletedabrief questionnaireat theendof thesession.Thetrendsin behavior wereanalyzedtodeterminethemajorfactorsto beimproved.

Althoughthe interfaceis very similar to mosttext-basedsearchenginesusersen-counteredseveraldifficulties. Whenperformingmultiword searches,in generalusersexpectedto seeall querywordsin theresults,which is notnecessarilythecasedepend-ing on how thesearchhasbeenperformed.Interestingly, evenregularAltaVista usersdid not understandthe querysyntax. For instance,when looking for divorce in Dr.Laura’sinterviews, mostfirst tried laura divorce, cleveronesthentried laura+divorce,but nobodytried+laura +divorce. Thisproblemwasaddressedby exposingthepossi-blerelationshipsbetweensearchwords(all/any/phrase/Boolean)in adropdown menuand making all the default. Second,the syntaxfor query termsis sometimesnon-intuitive. For example,usersmay type acronyms in differentforms: USAor U.S.A.,but without specialprocessingnot all written formswill matchtheir spokencounter-part. To alleviate this problem,termslike abbreviations,acronyms,andnumbersare

12 4 USABILITY STUDIES

now expandedandnormalizedby thesystemto acommonform for bothqueryingandindexing. For instance,thenumber,+(3� will betrainedandrecognizedastwo hundredand fifty, probablybe searchedas ,+(3� andwill be indexed in its normalizedform as,+(+� . Finally, theadvancedsearchwasconfusingandprovidedno additionalfunction-ality otherthanBooleansearch.This wasthereforedroppedafteraddingtheBooleancapabilityasa choiceon thesimplesearchpage.

Someimprovementsto thedisplayof resultswereindicatedby usertesting.High-lighting the querywordswasessential,andgave the userstrongfeedbackon the rel-evanceof thedocumenteven if the speechrecognitionoutputwassometimeshardtoreadandunderstand.Figure4 shows a detailedsearchresultwherethe searchword“clinton” is highlighted.

Userswantedguidanceonwhattopicscouldbesearched.To addressthis,weaddeda list of indexedshowsgroupedby topic to thehomepage.

Figure4: View of theShowMe More

The accuracy of the transcriptionappearedto be a problemtoo. For example,on somecontentsuchasnoisy or poorly articulatedspeechsegments,the outputof

13

the speechrecognitioncanbe very poor. A solutionto this problemis the subjectofongoingresearch.Lastly, userswereoftenfrustratedby thereliability of theRealAudioproxiesandservers. Our queryserversarestableandresponsive, but the audiohasto beenstreamedin from a variety of content-hostingsites,someof which are lessreliable.

A smallerin-housestudyon thepublic sitewaslaterconductedon a groupof fouruniversity studentsto verify that the changesmadeprior to launchhadresolved theoriginal usability problems. This studyshowed that mostof the problemshadbeenfixed, but that therewasa residualproblemwhensearchingwith multiple keywordsinput. Dependingon how closelytogetherthewordswerespokenin theoriginalaudiodocument,usersstill occasionallyfound that the resultsdidn’t containall the searchwordsthey expected.To addressthis, we arecurrentlyinvestigatinga modificationtomakethedefaultsearchoptionbehavemorelike all thesewordsnearby.

5 Conclusions and future work

SpeechBot is thefirst audioindexing andretrieval systemfor theWeb. It incorporatesspeechrecognitiontechnologyandso is able to operateboth with andwithout tran-scriptions.Thecurrentversionof thesystemis capableof indexing up to 800hoursofaudioperweek,andis configuredin sucha way thatit caneasilyscaleup on demand.It hasgivenus a uniqueopportunityto testa spoken dataretrieval systemon a largescale.Ourexperimentsshow acceptableretrieval accuracy despitehighrecognitioner-ror rates.This suggeststhatindexing of audiodocumentson theWebis feasiblegiventhecurrentlevel of recognitionandretrieval technologies.

Thereis still roomfor many improvements.TheOOV ratefor userqueriescouldbeimprovedby performingbettertext normalization.We arealsoplanningto investigateindexing word categoriesand/orperformsubword basedretrieval. Speechrecognitionaccuracy can be increasedby providing specializedaudio models,vocabulary, pro-nunciationdictionaries,and languagemodels. Acousticmodelscould be adaptedtospecificshows. Languagemodelscouldbe trainedfor specificdomains,like sportorfinancial news. Finally, we plan to updatethe recognitionvocabulary, and improvewordpronunciationfor propernouns.

To improveIR accuracy, weareinvestigatingdifferentrelevancerankingalgorithmsandalternative waysof indexing theoutputof thespeechrecognizer, aswell asqueryanddocumentexpansionmethods.We alsoplan to improve the userinterfacesincerandomaccessto multimediadocumentsraisesinterestingnew questionsabouttheefficiency of the currentdesign. A bettersegmentationof shows could allow searchandretrieval of smallsectionsof audiorecordingsthatcouldbelistenedto asawhole,thusminimizing thenavigationwithin a document.Finally, we planto port thespeechrecognitionengineto otherlanguages,suchasFrenchandSpanish.

14 REFERENCES

6 Acknowledgments

TheSpeechBotprojecthasbenefitedfrom the insightsandwork of many people.Wewould like to thankfirst theCompaqSoftwareEngineeringAustraliagroup,Greg Mc-Cane,Liam Davies, Ian Walters,ShazinaHashmi,SueSalamacha,andRobynHan-cock, without whom the site would not exist. We also thankFrank BombaandhisengineeringgroupincludingScottBlackwell, GenePreble,andBenJones.They pro-vided talentedengineeringandtestingsupport. Scotthasbeenin charge of runningthesite,keepingthe index fresh,andperformingnumerousanalyseson our Website.BhikshaRaj andEd Whittaker helpedus by giving their insight on languagemodel-ing. We alsowould like to acknowledgethe pastcontributionsof FredericDufaux,BrianEberman,BobIannucci,ChrisJoerg, RonGentile,DaveGoddeau,R. PaulJohn-son,LeonidasKontothanassis,DaveKovalcin,Andrew Leonard,AnnaLitvinova,GabeMahoney, MikeSchexnaydre,Mike Sokolov, MikeSwain,andChrisWeikart.Finally,we would like to thankthe HiserGroupandAPL Digital, both in Australia,for theirhelpin theusabilitystudiesandWebsitegraphicdesignrespectively.

References

[1] H. D. Wactlar, A. G. Hauptmann,and M. J. Witbrock, “Informedia: News-on-demandexperimentsin speechrecognition,” in DARPA Speech RecognitionWorkshop, 1996.

[2] S.E.Johnson,P. Jourlin,G.L. Moore,K. SparckJones,andP. C.Woodland,“Thecambridgeuniversityspoken documentretrieval system,” in ProceedingsIEEEInternationalConferenceon Acoustics,Speech, andSignalProcessing, 1999.

[3] D. Abberley, G. Cook,S.Renals,andT. Robinson,“Retrieval of broadcastnewsdocumentswith thethisl system,” in Proceedingsof the8th Text Retrieval Con-ference(TREC-8), 1999.

[4] J.Garfolo,E.Vorhees,C.Auzanne,V. Stanford,andB. Lund, “Spokendocumentretrieval track overview and results,” in Proceedingsof the 7th Text RetrievalConference(TREC-7), 1998.

[5] P.C. Woodland,T. Hain, S.E.Johnson,T.R. Niesler, A. Tuerk, andS.J.Young,“Experimentsin broadcastnews transcription,” in Proc. ICASSP98, Vol II, pp.909-912,Seattle, Washington, 1998.

[6] P. Placeway, S. Chen,M. Eskenazi,U. Jain,V. Parikh, B. Raj, M. Ravishankar,R. Rosenfeld,K. Seymore,M. Siegler, R. Stern,andE. Thayer, “The 1996hub-4sphinx-3system,” in Proceedingsof the1997DARPA Speech RecognitionWork-shop,Chantilly, Virginia, 1997.

[7] D. S. Pallet, J. G. Fiscus,J. S. Garafolo,A. Martin, andM.A. Przybocki, “1998broadcastnews benchmarktest results,” in DARPA Speech Recognition Work-shop, 1999.

REFERENCES 15

[8] P. J. Moreno,C. Joerg, JM Van Thong,andO. Glickman, “A recursive algo-rithm for the forcedalignmentof very long audiosegments,” in InternationalConferenceon SpokenLanguageProcessing, 1998.

[9] P. Placeway andJ. Lafferty, “Cheatingwith imperfecttranscripts,” in Interna-tional Conferenceon SpokenLanguageProcessing(ICSLP), 1996.

[10] M. Burrows, “Method for indexing information of a database.u.s. patent5,745,899,” 1998.

[11] G. SaltonandM. J.McGill, ,” in Introductionto ModernInformationRetrieval.McGraw-Hill, 1983.

[12] B. Eberman,B. Fidler, R. A. Iannucci,C. Joerg, L. Kontothanassis,D. E. Koval-cin, P. J.Moreno,M. J.Swain,andJM VanThong,“Indexing multimediafor theinternet,” in In Visual InformationandInformationSystems.D. P. HuijsmansandArnold W.M. Smeulders (Eds.)Springer-Verlag, 1999.

[13] “Linguistic dataconsortium(ldc),” http://www.ldc.upenn.edu.

[14] G.Robson,InsideCaptioning, CyberDawg Publishing,CastroValley, California,1997.

[15] C.Silverstein,M. Henzinger, H. Marais,andM. Moricz, “Analysisof averylargealtavistaquerylog,” in SRCTechnicalNote1998-014, 1998.

[16] G. Salton, AutomaticText Processing:The transformation,Analysis,and Re-trieval of Informationby Computer, Addison-Wesley, Reading,Massachusetts,1989.

[17] M. Witbrock, “Searchenginesandbeyond,” in SearchEngineMeeting, Boston,1999.

[18] B. Logan,P. Moreno,JM VanThong,andE. Whittaker, “An experimentalstudyof anaudioindexing systemfor theweb,” in InternationalConferenceonSpokenLanguageProcessing(ICSLP), 2000.

[19] A. SinghalandF. Pereira, “Documentexpansionfor speechretrieval,” in ACMSIGIR’99, 1999.

[20] J. Dean,C. A. Waldspurger, andW. E. Weihl, “Transparent,low-overheadpro-filing on modernprocessors,” in Workshopon Profile and Feedback-DirectedCompilation,Paris, France, 1998.

16 REFERENCES

CR

L2001/06

July2001

SP

EE

CH

BO

T:A

nE

xperim

entalS

peech

-Based

Search

En

gin

efo

rM

ultim

edia

Co

nten

tin

the

Web

Jean-ManuelV

anT

hong,Pedro

J.Moreno,B

ethLogan,B

lairF

idler,Katrina

Maffey

andM

atthewM

oores

SPEECHBOT: An Experimental Speech-Based Search Engine for

Documents

Transcript of SPEECHBOT: An Experimental Speech-Based Search Engine for