SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof...
Transcript of SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof...
![Page 1: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/1.jpg)
SEGMENTATION OF TV SHOWS INTO SCENES
USING SPEAKER DIARIZATION AND SPEECH RECOGNITION
![Page 2: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/2.jpg)
Context • Exponen<al growth of video content
– Mostly TV and web
• Content-‐based video indexing – Query/browse by people
• REPERE challenge – Query/browse by seman<c concept
• GdR ISIS IRIM project • NIST TRECVid Seman<c Indexing task
• Make content consump+on easier – Automa<c summariza<on
• PhD thesis with IRIT (Ph. Ercolessi)
2
![Page 3: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/3.jpg)
(More) Context
• Automa<c summariza<on of TV series – At the episode level
• Shot segmenta,on • Scene segmenta,on • Plot (or substory) deinterlacing • Episode summariza,on
– « Previously, on Lost… » – Browse by plot
– At the collec<on level • Cross-‐episode plot • Episode summariza<on wrt. whole collec<on
3
![Page 4: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/4.jpg)
Shot, scene, sequence and plot • Shot
– a part of a film between two camera cuts • Scene
– (each author working on scene segmenta+on uses its own defini+on) – group of consecu<ve shots – temporal con<nuity / unity of <me – seman<c coherence / unity of ac<on
• Sequence – group of consecu<ve scenes – seman<c coherence – aka story or topic in TV news
• made of mul<ple stories introduced by the anchor • Plot
– group of (not necessarily consecu<ve) sequences – modern TV show episodes usually have mul<ple interlaced plots – two plots can overlap
4
![Page 5: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/5.jpg)
Outline
• Context • Defini<ons & nota<ons • Principle – Scene transi<on graph with color histograms – Generalized STG
• Mul<modal extension – Speaker diariza<on & speech recogni<on – Mul<modal fusion
5
![Page 6: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/6.jpg)
Scene segmenta<on
• Input: shot boundaries
• Output: scene boundaries
• Classifica<on problem on shot boundaries – precision, recall, F1-‐measure
6
10 119876321 4 5 k kth shot
10 119876321 4 5 k kth shot
![Page 7: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/7.jpg)
Color
shot boundaries
1 1 1 1 12 2 2 2 23 3 3
automatic speech recognitiondon't look atme like that
Ally
it's true!
whatever
...
automatic speaker diarization
color histogram (one frame per second)
ASR
SD
HSV
...
...
...
...
...
... ...
...
• Mul<ple frame per shot, one color histogram per frame • : minimum distance between all possible pairs of
histograms from shots i and j dHSV
ij
7
![Page 8: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/8.jpg)
Scene transi<on graph
10 119876321 4 5
k kth shot
8
Segmenta<on of Video by Clustering and Graph Analysis / Yeung (1998)
![Page 9: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/9.jpg)
Nota<ons
dissimilarity between shots i and j temporal distance between shots i and j combined distance between shots i and j temporal distance threshold combined distance threshold
tijdij
Dij
Dij =
⇢dij if tij < �t
+1 otherwise
�t
�d
9
![Page 10: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/10.jpg)
Scene transi<on graph
10
11
9
8
7
6
3
2
1
45
k kth shot shot cluster
• Step 1: complete-‐link agglomera<ve clustering �d
10
![Page 11: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/11.jpg)
Scene transi<on graph
10
11
9
8
7
6
3
2
1
45
k kth shot shot cluster edge
• Step 2: scene transi<on graph
11
![Page 12: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/12.jpg)
Scene transi<on graph
10
11
9
8
7
6
3
2
1
45 ||
scene 1 scene 2
k kth shot shot cluster || edgecut
• Step 3: cut-‐edge detec<on
12
![Page 13: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/13.jpg)
HSV/STG Results
• Corpus – First eight episodes of Ally McBeal TV shows – 5 hours of videos, 5564 shots and 306 scenes
• Evalua<on – Leave-‐one-‐episode-‐out cross valida<on – Two thresholds and
13
Precision Recall F-Measure # ScenesHSV (STG) 0.256 0.533 0.449 461
�t �d
![Page 14: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/14.jpg)
Scene transi<on graph
• Limita<on: – Every pair leads to a different set of detected scene boundaries.
– The op<mal values are very dependent on the video
• Proposi<on (Sidiropoulos, 2011): – Generalized STG
(�t,�d)
14
![Page 15: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/15.jpg)
Generalized STG
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Sce
ne
bou
nd
ary
pro
bab
ility
p
– Large set of STGs by selec<ng random and – Scene boundary probability
�t �d
15
![Page 16: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/16.jpg)
Generalized STG
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Sce
ne
bou
nd
ary
pro
bab
ility
p
t hreshold �
– Large set of STGs by selec<ng random and – Scene boundary probability – Unique threshold
�t �d
✓
16
![Page 17: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/17.jpg)
HSV/GSTG Results
• Corpus – First eight episodes of Ally McBeal TV shows – 5 hours of videos, 5564 shots and 306 scenes
• Evalua<on – Leave-‐one-‐episode-‐out cross valida<on
17
Precision Recall F-Measure # ScenesHSV/STG 0.256 0.533 0.449 461HSV/GSTG 0.447 0.566 0.487 403
![Page 18: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/18.jpg)
Mul<ple modali<es, mul<ple
shot boundaries
1 1 1 1 12 2 2 2 23 3 3
automatic speech recognitiondon't look atme like that
Ally
it's true!
whatever
...
automatic speaker diarization
color histogram (one frame per second)
ASR
SD
HSV
...
...
...
...
...
... ...
...
dij
18
![Page 19: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/19.jpg)
Speaker diariza<on
shot boundaries
1 1 1 1 12 2 2 2 23 3 3
automatic speech recognitiondon't look atme like that
Ally
it's true!
whatever
...
automatic speaker diarization
color histogram (one frame per second)
ASR
SD
HSV
...
...
...
...
...
... ...
...
shot boundaries
1 1 1 1 12 2 2 2 23 3 3
automatic speech recognitiondon't look atme like that
Ally
it's true!
whatever
...
automatic speaker diarization
color histogram (one frame per second)
ASR
SD
HSV
...
...
...
...
...
... ...
...
• Mul<ple speakers per shot, but only one descriptor – Term Frequency / Inverse Document Frequency – Speaker Frequency / Inverse Shot Frequency
• : cosine distance between TF-‐IDF vector dSDij
19
![Page 20: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/20.jpg)
Automa<c Speech Recogni<on
shot boundaries
1 1 1 1 12 2 2 2 23 3 3
automatic speech recognitiondon't look atme like that
Ally
it's true!
whatever
...
automatic speaker diarization
color histogram (one frame per second)
ASR
SD
HSV
...
...
...
...
...
... ...
...
• Lemma<za<on (tree-‐tagger) • : cosine distance between TF-‐IDF vector
shot boundaries
1 1 1 1 12 2 2 2 23 3 3
automatic speech recognitiondon't look atme like that
Ally
it's true!
whatever
...
automatic speaker diarization
color histogram (one frame per second)
ASR
SD
HSV
...
...
...
...
...
... ...
...
dASRij
20
![Page 21: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/21.jpg)
Monomodal GSTG Results
• Corpus – First eight episodes of Ally McBeal TV shows – 5 hours of videos, 5564 shots and 306 scenes
• Evalua<on – Leave-‐one-‐episode-‐out cross valida<on
21
Precision Recall F-Measure # ScenesHSV (STG) 0.256 0.533 0.449 461HSV 0.447 0.566 0.487 403SD 0.157 0.562 0.240 1136ASR 0.105 0.572 0.175 1751
![Page 22: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/22.jpg)
Monomodal GSTG approaches
• Limita<on: – One modality cannot solve the problem on its own
• Proposi<on: – Mul<modal fusion
22
![Page 23: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/23.jpg)
Mul<modal Fusion sceneboundaries
sceneboundaryprobabilities
GSTG threshold distancebetweenshots
front-endvideoshots
EARLYFUSION
LATEFUSION
INTERMEDIATEFUSION
• Late fusion – intersec<on or union
• Early fusion –
• Intermediate fusion –
\ [
dij = wHSV · dHSV
ij + wSD · dSD
ij + wASR · dASR
ij
p = wHSV · pHSV + wSD · pSD + wASR · pASR23
![Page 24: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/24.jpg)
Mul<modal Results
24
Fusion Precision Recall F-MeasureHSV (baseline) 0.447 0.566 0.487HSV \ SD 0.598 0.357 0.438HSV \ SD \ ASR 0.606 0.242 0.341HSV [ SD 0.180 0.770 0.288HSV [ SD [ ASR 0.121 0.851 0.210d(HSV) + d(SD) 0.445 0.599 0.499d(HSV) + d(SD) + d(ASR) 0.445 0.599 0.499p(HSV) + p(SD) 0.484 0.555 0.510p(HSV) + p(SD) + p(ASR) 0.488 0.622 0.539
![Page 25: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/25.jpg)
Conclusion & future work
• Graph-‐based approach to segmenta<on
• Using more modali<es is beker – Beker use of ASR output – Add visual seman<c concept detec<on
25
![Page 26: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –](https://reader033.fdocuments.net/reader033/viewer/2022052718/5f0642787e708231d41718e2/html5/thumbnails/26.jpg)
What’s next?
• At the episode level – Shot segmenta,on – Scene segmenta,on – Plot (or story) deinterlacing – Episode summariza,on
• « Previously, on Lost… » • Browse by plot
• At the collec<on level – Cross-‐episode plot – Episode summariza<on wrt. whole collec<on
26