Text, Speech, and Vision for Video Segmentation: The...

6
Abstract We describe three technologies involved in creating a digital video library suitable for full- content search and retrieval. Image processing analyzes scenes, speech processing transcribes the audio signal, and natural language process- ing determines word relevance. The integration of these technologies enables us to include vast amounts of video data in the library. 1 Introduction The Informedia Digital Video Library Project at Carnegie Mellon University is creating a digital library of text, images, videos and audio data available for full content retrieval [Stevens94][Christel94]. The initial testbed will be installed in several K-12 schools and students will use the system to explore multi-media data for educational purposes. The Informedia system for video libraries goes far beyond the current paradigm of video-on-demand by retrieving a short video para- graph in response to the user’s query. The project can be divided into two phases: library creation and library exploration (See Figure 1). 1.1 Library creation The Informedia project is creating intelligent, automatic mechanisms for populating a video library and allowing for its full-content and knowledge-based search and segment retrieval. The material is obtained from video assets of WQED/Pittsburgh as well as the British Open University video courses.The project uses the Sphinx-II speech recognition system to transcribe and align narratives and dialogues automatically. The resulting transcript is then processed through methods of natural language understanding to extract subjective descriptions and mark relevant key words. Acoustic sig- nal analysis identifies potential segment boundaries of “paragraph” size. Within a paragraph, scenes are iso- lated and clustered into video segments through the use of various image understanding techniques. These com- ponents are described in Figure 2. 1.2 Library exploration Users are able to explore the Informedia library through an interface that allows them to search using typed or spoken natural language queries, select relevant documents retrieved from the library and display the material on their PC workstations. The library retrieval system can effectively process spoken queries and deliver relevant video data in a compact format, based on information embedded with the video during library creation. Video and other data may be explored in depth for related content. During retrieval based on keyword searches by a user, only the relevant video segments are displayed. Prototype exploration systems have been implemented on both Macintosh and PC platforms. In this paper we will focus on the library cre- ation aspect of the Informedia Project. In particular, we Indexed Store Presentation Video Segments TV Footage Extra Footage New Video Footage Raw Video audio video Video Database OFFLINE ONLINE CREATION EXPLORATION Indexed Transcript of Text Segmented Described Video Indexed Transcript of Text Segmented Described Video Video Library Distribution or Sale to Users Visual, Spoken, and Natural Language Query Interactive Video Search Speech & Language Interpretation and Indexing Video Segmentation and Description Figure 1: Overview of the Informedia Digital Video Library System Text, Speech, and Vision for Video Segmentation: The Informedia TM Project Alexander G. Hauptmann Michael A. Smith School of Computer Science Dept. of Electrical and Computer Engineering Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 [email protected] [email protected]

Transcript of Text, Speech, and Vision for Video Segmentation: The...

Page 1: Text, Speech, and Vision for Video Segmentation: The …lastchance.inf.cs.cmu.edu/alex/aaai-95.pdf · 2004-01-27 · Text, Speech, and Vision for Video Segmentation: The InformediaTM

AbstractWe describe three technologies involved in

creating a digital video library suitable for full-content search and retrieval. Image processinganalyzes scenes, speech processing transcribesthe audio signal, and natural language process-ing determines word relevance. The integrationof these technologies enables us to include vastamounts of video data in the library.

1 IntroductionThe Informedia Digital Video Library Project

at Carnegie Mellon University is creating a digitallibrary of text, images, videos and audio data availablefor full content retrieval [Stevens94][Christel94]. Theinitial testbed will be installed in several K-12 schoolsand students will use the system to explore multi-mediadata for educational purposes. The Informedia systemfor video libraries goes far beyond the current paradigmof video-on-demand by retrieving a short video para-graph in response to the user’s query. The project can bedivided into two phases: library creation and libraryexploration (See Figure 1).

1.1 Library creationThe Informedia project is creating intelligent,

automatic mechanisms for populating a video libraryand allowing for its full-content and knowledge-basedsearch and segment retrieval. The material is obtainedfrom video assets of WQED/Pittsburgh as well as theBritish Open University video courses.The project usesthe Sphinx-II speech recognition system to transcribeand align narratives and dialogues automatically. Theresulting transcript is then processed through methodsof natural language understanding to extract subjectivedescriptions and mark relevant key words. Acoustic sig-nal analysis identifies potential segment boundaries of“paragraph” size. Within a paragraph, scenes are iso-lated and clustered into video segments through the useof various image understanding techniques. These com-ponents are described in Figure 2.

1.2 Library explorationUsers are able to explore the Informedia library

through an interface that allows them to search using

typed or spoken natural language queries, select relevantdocuments retrieved from the library and display thematerial on their PC workstations. The library retrievalsystem can effectively process spoken queries anddeliver relevant video data in a compact format, basedon information embedded with the video during librarycreation. Video and other data may be explored in depthfor related content. During retrieval based on keywordsearches by a user, only the relevant video segments aredisplayed. Prototype exploration systems have beenimplemented on both Macintosh and PC platforms.

In this paper we will focus on the library cre-ation aspect of the Informedia Project. In particular, we

Indexed

StorePresentation

Video Segments

TV Footage

Extra Footage

New Video Footage

Raw Videoaudio video

VideoDatabase

OFFLINE

ONLINE

CREATION

EXPLORATION

IndexedTranscript of

Text

SegmentedDescribed

Video

IndexedTranscript of

Text

SegmentedDescribed

Video

Video Library

Distribution or Sale to Users

Visual, Spoken, and NaturalLanguage Query

Interactive VideoSearch

Speech & LanguageInterpretation and Indexing

Video Segmentationand Description

Figure 1: Overview of the Informedia DigitalVideo Library System

Text, Speech, and Vision for Video Segmentation:The InformediaTM Project

Alexander G. Hauptmann Michael A. SmithSchool of Computer Science Dept. of Electrical and Computer EngineeringCarnegie Mellon University Carnegie Mellon University

Pittsburgh, PA 15213 Pittsburgh, PA [email protected] [email protected]

Page 2: Text, Speech, and Vision for Video Segmentation: The …lastchance.inf.cs.cmu.edu/alex/aaai-95.pdf · 2004-01-27 · Text, Speech, and Vision for Video Segmentation: The InformediaTM

will describe how to segment a video meaningfullyusing the integration of different technologies. Throughthe combined efforts of Carnegie Mellon’s speech,image and natural language processing groups, this sys-tem provides a robust tool for segmenting many types ofvideo data in order to utilize them within a digital videolibrary.

2 Video SegmentationGenerally, the videos in the Informedia Library

are full one-hour feature broadcast videos based on edu-cational documentaries. To allow efficient access to therelevant content of the videos, we need to separate theminto small pieces. To answer a user query by showing anhour long video is rarely a reasonable response.

The Informedia library creation phase usesthree different levels of segmentation for a video. Thefirst and generally largest segment shows a “video para-graph”, which consists of a series of related scenes witha common content. The second level of segmentationidentifies a single scene on the video within the videoparagraph. Finally, within a single scene we also need tobe able to select a representative frame icon for staticdisplays.

2.1 Video paragraphsWhen a user receives the response to a query,

the system needs to determine how much content andcontext to display. Where should the video clip start andwhere does it end? The answer to this is partly deter-mined by the content of the user query. But the answer isalso dependent on natural segments within the videowhich we call “video paragraphs”. In the ideal case, avideo paragraph starts at the natural boundary of the rel-evant content and ends wherever the video moves to adifferent context.

2.2 Individual scenesSegment breaks produced by image processing

are examined along with the boundaries identified bythe speech and natural language processing of the tran-script, and an improved set of segment boundaries areheuristically derived to partition the video paragraphsinto scenes.

All frames from each new scene will be used toselect the frame icon. This technique will allow for theinclusion of all relevant image information in the videoand the elimination of redundant data.

Scene Isolation

Representative Frame Icon

- Image Analysis

- Keyword Search

Speech Transcript- Keywords

Video Paragraph - Speech SNR

Figure 2: Combined technology to select representative frame (icon)

they

are

toug

hth

eyar

ede

man

ding

they

are

the

jury

ever

yto

ym

anuf

actu

rer

hope

sto pl

ease

the

crea

tors

of the

new

toy

knex

have

rece

ived

a larg

eam

ount

of

Page 3: Text, Speech, and Vision for Video Segmentation: The …lastchance.inf.cs.cmu.edu/alex/aaai-95.pdf · 2004-01-27 · Text, Speech, and Vision for Video Segmentation: The InformediaTM

2.3 Frame iconsFor the purposes of static displays, the most

characteristic frame of a scene is included in static (non-animated) representations of the user’s selection. A sin-gle frame is displayed as the representative for thewhole video segment. This is used in an outlined displayshowing the results of a user query.

Showing frame icons allows the user to simul-taneously look at a static representation of multiplevideo paragraphs and to obtain some information abouttheir content and possible relevance to the user’s query,before selecting any one paragraph for playback. Frameicons are also important as encapsulations of the videoparagraph for printed reports and viewgraphs.

In order to create these various levels of seg-mentation, we integrate a number of different technolo-gies which will be described in the next section.

3 Component TechnologiesThere are 3 broad categories of technologies

we can bring to bear to the problem of identifying videosegments from broadcast video materials.

a. Text processing looks at the textual (ASCII)representation of the words that were spoken,as well as other annotations derived from thetranscript, the production notes or the close-captioning that may be available.

b. Speech signal analysis provides the basis foranalyzing the audio component of the material.

c. Image analysis looks at the images in thevideo-only portion.

Currently in the library creation phase of theInformedia Digital Video Library the following specificapproaches are used to create segmentation information.

3.1 Text AnalysisText analysis can work on an existing ASCII

transcript to help segment the text into paragraphs. Ananalysis of keyword prominence allows us to identifyimportant sections in the transcript [Mauldin 89]. Othermore sophisticated language based criteria are underinvestigation. The notion of semantic connectionsbetween text portions might be exploited for segmenta-tion as well. Currently we use two main techniques innatural language analysis.

a. If we have a complete time aligned transcriptavailable from the close-captioning or througha human generated transcription, we canexploit natural “structural” text markers suchas punctuation to identify segments of videoparagraph granularity

b. To identify and rank the contents of varioussegments, we use the well-known technique ofTF/IDF (term frequency/inverse document fre-quency) to identify critical keywords and theirrelative importance for the video document[Salton83].

3.2 Speech AnalysisSpeech analysis operates only on the audio

portion of the video. Using speech recognition we canobtain a transcript, although it may contain errors. Wecan also detect transitions between speakers and topicswhich are usually marked by silence or low energy areasin the acoustic signal.

RecognitionTo transcribe the content of the video material,

we recognize spoken words with the Sphinx-II speechrecognizer. The CMU Sphinx-II system uses semi-con-tinuous Hidden Markov Models to model context-dependent phones (triphones), including between wordcontext [Hwang94]. The recognizer processes an utter-ance in 3 steps: It makes a forward time synchronouspass using full between word models, Viterbi scoringand a trigram language model. This produces a word lat-tice where words may have only one begin time but sev-eral end times. The recognizer then makes a backwardpass which uses the end times from the words in the firstpass and produces a second lattice which contains multi-ple begin times for words. An A* algorithm is used togenerate the best hypothesis from these two lattices. Thelanguage model consists of words (with probabilities),bigrams/trigrams which are word pairs/triplets with con-ditional probabilities for the last word given the previ-ous word(s). The language model was constructed froma corpus of news stories from the Wall Street Journalfrom 1989 to 1994 and the Associated Press news ser-vice stories from 1988 to 1990. Only trigrams that wereencountered more than once were included in the model,but all bigrams and the most frequent 58800 words inthe corpus were included [Rudnicky95].

Processing the video tape using the speech rec-ognition system gives us a transcript. This transcriptcontains errors, which depending on the quality of thetape and the subject matter, currently range from 20% to70% word error rate.

Acoustic SegmentationTo detect breaks between utterances we use a

modification of Signal to Noise ratio (SNR) techniqueswhich compute signal power. This algorithm computesthe power of digitized speech samples where Si is a pre-

emphasized sample of speech within a frame of 20 mil-liseconds. A low power level indicates that there is little

Power1n---

Si2

∑⋅

log=

Page 4: Text, Speech, and Vision for Video Segmentation: The …lastchance.inf.cs.cmu.edu/alex/aaai-95.pdf · 2004-01-27 · Text, Speech, and Vision for Video Segmentation: The InformediaTM

active speech occurring in this frame (low energy). Seg-mentation breaks between utterances are set at the mini-mum power as averaged over a 1 second window. Toprevent unusually long segments, we force the system toplace at least one break within 30 seconds.

3.3 Image Analysis Image analysis is primarily used for the identi-

fication of breaks between scenes and the identificationof a single static frame icon that is representative of ascene.

Histogram AnalysisVideo is segmented into scenes through the use

of comparative difference measures [Zhang93]. Imageswith small histogram disparity are considered to be rela-tively equivalent. By detecting significant changes inthe weighted color histogram of each successive frame,image sequences can be separated into individualscenes. A comparison between the cumulative distribu-tions is used as a difference measure. The histogram dif-ference plot is shown in the bottom graph of Figure 3.

This result is passed through a high pass filter to furtherisolate peaks and an empirical threshold is used to selectonly those regions where scene breaks occur. To makethe analysis more robust, we examine the individualimages in tiled subwindows. This reduces the noise inour difference data and compensates for motionbetween frames. The images are initially subsampled toprovide an efficient means of computation. Using onlythe histogram difference, we have achieved 90% accu-racy on a test set of roughly 200,000 video images (2hours).

Optical FlowOne important method of visual segmentation

and description is based on interpreting camera motion

0 20 40 60 80 100 1200

10

20

30

40

50Motion Vector Confidence Measure

0 20 40 60 80 100 1200

50

100

150

Frames

Histogram Difference Analysis

Figure 3: Scene segmentation and motionvector error.

[Akutsu94]. We can interpret camera motion as a pan orzoom by examining the geometric properties of theoptical flow vectors. Using the Lucas-Kanade gradientdescent method for optical flow, we can track individualregions from one frame to the next [Lucas81]. Bymeasuring the velocity that individual regions show overtime, a motion representation of the scene is created.Figure 4 shows examples of the optical flow analysis fordifferent types of camera motion. Drastic changes in thisflow describe random motion, and therefore, new scenes.These changes will also occur during gradual transitionsbetween images such as fades or special effects.

Only regions of low ambiguity are selected fortracking. Trackable regions are found by searching theentire image for subwindows whose gradient derivativesexhibit relatively similar eigenvalues. In order toaccurately track a region over large areas, a multi-resolution structure is used. With this structure we cantrack regions across many pixels and reduce the timeneeded for computation. When optical flow is minimalthe frames are suitable for an iconic framerepresentation. Since we are primarily interested indistinguishing static frames from motion frames, it wassufficient to track only the top 30 regions.

These techniques work well when scenechanges are abrupt, however, camera motion and grad-ual changes can severely affect the accuracy of the sys-tem. The first graph in Figure 3 shows the optical flowerror for a given sequence. When changes are gradual,we combine the optical flow results with histogramanalysis. This allows for segmentation under conditionsthat do not involve drastic changes in image content anddetection accuracy as high as 95%.

Figure 4: Camera motion analysis using opticalflow. Flow vectors are amplified forvisibility.

I0 I1 I2 Flow

Page 5: Text, Speech, and Vision for Video Segmentation: The …lastchance.inf.cs.cmu.edu/alex/aaai-95.pdf · 2004-01-27 · Text, Speech, and Vision for Video Segmentation: The InformediaTM

4 Technology SynthesisWe now describe how we integrate the differ-

ent component technologies. In our early work on theInformedia digital video library, all segmentation wasdone by hand. We have now moved to a procedurewhere the segmentation boundaries are suggested by thesystem, but adjusted and verified by a person supervis-ing the digital video library creation process. Eventuallywe will transition from computer-assisted procedures tofully automatic video segmentation, as the algorithmsdescribed above become better tested and more robust.

Our current library creation process starts witha raw digitized video tape. The audio portion is fedthrough the speech analysis routines, which produces atranscript of the spoken text. The speech signal is alsoanalyzed for low energy sections that indicate acoustic“paragraphs” through silence. This is the first pass atsegmentation. If a close-caption transcript is available,we use that instead of the speech recognition output,since it is less errorful.

The transcript is processed by the natural lan-guage system and important keywords are identified.Using the results returned from the image analysis, we

then match the acoustic paragraph to the nearest scenebreak. This gives us an appropriate video paragraph clipin response to a user’s request.

The keywords and their corresponding para-graph locations in the video are indexed in the informe-dia library catalogue. To obtain video clips suitable forviewers, we first search for keywords from the userquery in the recognition transcript. When we find amatch, the surrounding video paragraph is returned.

For a static icon representative of a video clip,we place most emphasis on the image data. The para-graph is determined by the transcript and keywords.Within the paragraph the most prominent keywordsidentify the most prominent scene. The scene bound-aries are determined by the image analysis of color his-togram differences and optical flow analysis. Figure 5shows the integration of technologies used by the sys-tem.

5 ConclusionWe are currently using these techniques to cre-

ate digital video library collections suitable for full con-tent retrieval. While some of the steps are not yet fully

Scenes

Histogram Scene Analysis

Figure 5: Analysis of scene changes in video and audio signal

desp

itehe

roic

effo

rts

man

y of the

wor

lds

wild

crea

ture

sar

edo

omed the

loss of

spec

ies

isno

wth

esa

me

asw

hen

the

grea

tdi

nosa

urs

beco

me

extin

ctw

illth

ese

crea

ture

sbe

com

eth

edi

nosa

urs of our

time

toda

ym

anki

nd isch

angi

ng the

entir

efa

ce ofpl

anet

eart

h ...

Audio Segments and Text

0 1 2 3 4 5 6 7 8 9 10

x 105

−2

1.5

−1

0.5

0

0.5

1

1.5x 10

4

Samples

Audio Signal

Page 6: Text, Speech, and Vision for Video Segmentation: The …lastchance.inf.cs.cmu.edu/alex/aaai-95.pdf · 2004-01-27 · Text, Speech, and Vision for Video Segmentation: The InformediaTM

integrated, each one has been shown to work indepen-dently, and several of the techniques are fully integratedwithin the informedia system.

Another use of the combined technologies willbe the development of the video skim [Smith95]. Byonly presenting significant regions, a short synopsis ofthe video paragraph can be used as a preview for theactual segment.

The Informedia Project will establish an on-line digital video library consisting of over 1000 hoursof video material. In order to be able to process this vol-ume of data, practical, effective and efficient tools areessential. We have outlined a practical set of techniquesfor video segmentation that allows us to automaticallyprocess the volume of data required.

6 References[Akutsu94] Akutsu, A. and Tonomura, Y. “Video

Tomography: An efficient method forCamerawork Extraction and MotionAnalysis,” Proc of ACM Multimedia‘94, Oct. 15-20, 1994, San Francisco,CA, pp. 349-356.

[Christel94] Christel, M., Stevens, S., & Wactlar, H.“Informedia Digital Video Library,”Proceedings of the Second ACM Inter-national Conference on Multimedia,Video Program. New York: ACM, Octo-ber, 1994, pp. 480-481.

[Hwang94] Hwang, M., Rosenfeld, R., Thayer, E.,Mosur, R., Chase, L., Weide, R., Huang,X., Alleva, F., “Improving Speech Rec-ognition Performance via Phone-Depen-dent VQ Codebooks and AdaptiveLanguage Models in SPHINX-II.”ICASSP-94, vol. I, pp. 549-552.

[Lucas 81] Lucas, B.D., Kanade, T. “An IterativeTechnique of Image Registration and ItsApplication to Stereo,”Proc. 7th Inter-national Joint Conference on ArtificialIntelligence, pp. 674-679, August 1981.

[Mauldin89] Mauldin, M. “Information Retrieval byText Skimming,” PhD Thesis, CarnegieMellon University. August 1989.Revised edition published as “Concep-tual Information Retrieval: A CaseStudy in Adaptive Partial Parsing, Klu-wer Press, September 1991.

[Rudnicky95] Rudnicky, A., “Language Modelingwith Limited Domain Data,” Proceedingof the 1995 ARPA Workshop on SpokenLanguage Technology, in press.

[Salton83] Salton, G., McGill, M.J. “Introductionto Modern Information Retrieval,”McGraw-Hill, New York, McGraw-HillComputer Science Series, 1983.

[Stevens94] Stevens, S., Christel, M., Wactlar, H.“Informedia: Improving Access to Digi-tal Video”. Interactions1 (October1994), pp. 67-71.

[Zhang93] Zhang, H., Kankanhalli, A., and Smo-liar, S. “Automatic partitioning of full-motion video,” Multimedia Systems(1993) 1, pp. 10-28.

[Smith95] Smith, M., Kanade, T., “Video Skim-ming for Quick Browsing Based onAudio and Image Characterization,” CSTechnical Report, Carnegie Mellon Uni-versity, Summer 1995.

7 AcknowledgmentThe authors would like to thank Howard Wact-

lar and the other members of the Informedia Project fortheir valuable discussions and contributions. This workis partially funded by the National Science Foundation,the National Space and Aeronautics Administration, andthe Advanced Research Projects Agency.