sfchang/course/svia/papers/brown... · 2003-08-22 · @fi 3ui ghgm0 g¹p 8?3$@f8?3ui ½ ¾?¿
EE 6850 Lecture #5 (Oct. 2, 2002)sfchang/course/vis/SLIDE/lecture5_include_scan.pdfEE 6850 Lecture...
Transcript of EE 6850 Lecture #5 (Oct. 2, 2002)sfchang/course/vis/SLIDE/lecture5_include_scan.pdfEE 6850 Lecture...
EE 6850, F'02, Chang, Columbia U.
EE 6850 Lecture #5 (Oct. 2, 2002)
� Syntactic-level browsing and visualization� Keyframe selection and browsing� Visualization and skimming
� References� D. Zhong, H. Zhang and S.-F. Chang, “Clustering Methods for Video Browsing and
Annotation”, IS&T/SPIE Symposium on Storage and Retrieval for Image and Video Database, San Jose, February 1996.
� H.J. Zhang, C.Y. Low, S.W. Smoliar and J.H. Wu,"Video Parsing, Retrieval and Browsing: An integrated and content-based solution. In Proc. of the ACM Multimedia Conference, pages 15--24, 1995.
� M. Christel, A. Hauptmann, A. Warmack and S. Crosby, "Adjustable Filmstrips and Skims as Abstractions for a Digital Video Library," IEEE Advances in Digital Libraries Conference, Baltimore, MD, May 1999.
� Yeo, B.-L., and Yeung, M.M. Retrieving and Visualizing Video, Communications of ACM, 40, 12 (Dec. 1997), pp. 43-52.
� H. Sundaram, S.-F. Chang, Constrained Utility Maximization for generating Visual Skims, iEEEWorkshop on Content-based Access of Image and Video Libraries (CBAIVL'2001) Dec. 2001 Kauai, HI USA.
EE 6850, F'02, Chang, Columbia U.
Issues
� Shot summary: play each video clip?� Keyframe selection and browsing� Visualize a large collection of clips?
Examples:WebClip and WebSeek video visualizationDifferent types of video: consumer, sports, news etc
Approaches:� Keyframes, Mosaicing� Hierarchical Clustering� Spatial Summary and visualization
EE 6850, F'02, Chang, Columbia U.
Keyframe Selection
� Considerations� Flexibility (number and level)� Fidelity (content comprehension)
� Approaches� Fixed number, fixed spacing� First/last frame, clean frame, cluster centroids� Difference, motion� Clustering
~
~
"'~c;
:;.
~
~,
}
~c.
cs ~
r ~
~ " fs:
::9' Q*
{0"
- -, .- ;s:..
.
~~
~7 ':I:.&
atcj
81\
~ -, t"\ ~ '5'
! (') r I
~
Structure Parsing (Zhong’96)
� Automatic Layout of KeyFrames [Uchihashi & Foote ’99]
� Given 2D space constraints, KeyFrame Set, and their importance measures, what’s the best display layout?� Comic book concept
� Issues:� Time Order vs. Layout
Order� Preserve high-level
structures� Importance Measures
EE 6850, F'02, Chang, Columbia U.
Packing Keyframes to 2D space [Uchihashi ’99]
EE 6850, F'02, Chang, Columbia U.
Information Visualization
� Shneiderman ‘96: “ Overview first, zoom and filter, then details-on-demand”
� Problem: � result overload and confusing search interfaces
� browsing 100-1000 video segments� Search using multiple features
� Approaches: � Text items + scores� Thumbnail key frames + scores� Map digest� VIBE concept map� Distance Map� Feature-space browser
� CMU Informedia Project� > 1200 hrs news, 400 hrs
documentary
� Keyword search thumbnails
� Communicate more information than just “top 10 textual list”
� Color bar: word match scoretext headline
� Issue: � keyframe selection, number
of kframes� Context� Temporal information
Filmstrip
� Spatial static abstraction of multiple shots
� Issues:� Use film border to
indicate association� Time relation
between shot and whole segment
� Too many key frames: use match shots only? �query-based filmstrip
Timeline Digest(e.g., impeachment)
� Explore temporal trend� Can be combined with
VIBE and Map Digest� Issue: context,
a/v content
Map Digest(e.g., Pope visit)
� Combine visuals and time clusters
Skim
� Temporal abstraction:motivate viewers
� Time compression:preserving essential data
� Segments with match words are combined
� Each segment is extended based on the “goodness scores” of the ending point, until the time budget is reached
� Issues:� Choppy presentation� Temporal syntax (e.g., dialog)� Early cutout of sentence,
scene, audio
Included segment
Word/phrasematch
VIBE� Provide relevance to each
concept� Number of related concepts� Relative weights
� drag the anchors to see related data
� manipulate the concept combinations, e.g., and, or, not
� zoom in specific areas
� Filter by time, location…
� Issues:� Location ambiguity� Context beyond word
matching
(example: Clinton, Andrew, Johnson, Impeachment)
EE 6850, F'02, Chang, Columbia U.
VIBE: Concept Map
� Active query elements mapped in a 2D display.� Each query element visualized as a concept.� Location of return results is function of position and
relative distances to each concept.� Allow users to explore concept relationships.� Allow users to zoom in to particular return results.
Q0 Q1
d0+d1
p = d0p0 + d1p1
x
y
EE 6850, F'02, Chang, Columbia U.
Variations of Concept Map
sunset
low complexity
skiing
high activity
camera
d0+ d1 + d2
d0p0 + d1p1 + d2p2 P=
• Point on the edges are covered by two concepts only• Points at the center are equally distant from all concepts
• Issues: multiple concepts, different modalities
Video Skim Generation (Sundaram/Chang 01)
dropped frames1. What is the appropriate problem formulation?
2. What are important types of skims?3. Possible operations:
shot selection and trimming.4. What’d the data unit for transformation?5. How is the “quality” affected?
Aesthetic affects, information comprehension
utility framework to modelrelation between operations anduser comprehension� optimal skim generation
News story100 sec �
16 secShot removal
Skim: Drastically condensed audio-video clips
Action scene190 sec �
38 secproportional
The entities preserved in skims
� Video shot (duration altered)� The fundamental video entity; we shall maximize the
coherence of each retained video shot� Segment beginnings (SBEG’s) significant phrase
� This is an element of the speech discourse� Synchronous multimedia segments
� Ensures maximum skim coherence� Elements of visual syntax
� dialogs, regular anchors, shot phrases� Film rhythm
� Preserves the “pace” of the film
EE 6850, F'02, Chang, Columbia U.
Modeling Utility of Shots
� How much time is required for generic comprehension (who, what, where, when)?
� Is comprehension time related to the visual spatio-temporal complexity of the shot ?
(a) (b)
� Explore Viewer Perceptual Model
Estimate Utility Function from Subjective Study
complexity →
Re
qu
ire
d t
ime
→
0
02.
54.
5
1
reduce to upper boundoriginal
shot Ub
Lb
( ) 2 .4 0 1 .1 1
( ) 0 .6 1 0 .6 8b
b
U c c
L c c
= += +
� Plot of average required time vs. complexity shows two bounds
Shot utility function
t: duration, c: complexity
: selection indicator sequence
Utility of shot sequence
Preserving syntax
� Minimum number of shots in a scene� The particular ordering of the shots (cut)� The specific duration of the shots,
to direct viewer attention� Changing the scale of the shots
The specific arrangement of shots so as to bring out their mutual relationship. [ sharff 82 ].
Film makers think in terms of phrases of shots and not individual shots.
EE 6850, F'02, Chang, Columbia U.
The progressive phrase
Hence, a phrase (a group of shots) must at least have three shots.
“Two well chosen shots will create expectations of the development of narrative; the third well-chosen shot will resolve those expectations.”[ sharff 82 ].
Maximal shot removal:eliminate all the dark shots.
EE 6850, F'02, Chang, Columbia U.
Structure (dialog)
Hence, a dialog must at least have six shots
“Depicting a conversation between m people requires 3m shots.” [ sharff 82 ].
Maximal adaptation:eliminate all the dark shots.
SVM classifier
Time-dependentViterbi decoder-temporal consistency
Mid-level audio content analysis
silence
significant phrases, beginning of topical segments
Understand types and importance of audio by prosody analysis: (pitch,
pause, energy)[details in ACM MM2002]
audio-scenes
silence removal
non-speech cleannoisy
speech
Synchronous Entities
� Synchronous segments:� Include all significant speech phrases and both opening and ending syntactic
segments� Audio and video boundaries are fully synchronized� Not condensed or de-synchronized� Such tied segments allow viewers to “catch up” when viewing skims
� Untied segments:� Audio-video can be dropped, condensed, reduced� Audio-video segments do not have to synchronize
Opening syntax closingsignificant phrase
Dialog syntax
Do you make up these questions, Mr. Holden?
The Constrained Search Problem
ξ
ο
ο φ
φ
ξ ξ
φ
ξ
ξ
∗ ∗ ∗ ∗
=
= =
=
≤ ≤ =
≤ ≤ ≥=
+ =
= +
∑
∑
∑ ∑
rr r
r rr r r r
K, ,
, , ,
, , , , ,
, , , , min
,
: ( ) 1
,
, ,
1 1
( , , , ) arg min ( , , , )
subject to:
, : ( ) 1,
( ) , ,
,
,
, : 1
a v c
b
v
l v l a
a v c f a v ct t n
L i v i v i v v
i i a i a v
i v f
i i
i a i f
j
N N
v i a j j c
i j
t t n O t t n
t t t i i
T k t t N N
t T
t T
t t l n
duration constraints
target time constraints
Multimedia tie constraints
Skim generation frameworkEntity analysis:
shot detection /
auditory analysis
video utility model
audio utility model
objective function
iterative
maximization
skim generation
audio / video duration constraints
target skim time
tie constraints
visual syntax constraints
constraints
proportional
optimal
EE 6850, F'02, Chang, Columbia U.
Potential Projects
� New video shot visualization tools� New skim generation technique� Improved integrated a/v browsing systems