Video Rewrite Driving Visual Speech with Audio
description
Transcript of Video Rewrite Driving Visual Speech with Audio
![Page 1: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/1.jpg)
Video RewriteVideo RewriteDriving Visual Speech with AudioDriving Visual Speech with Audio
Christoph BreglerMichele Covell Malcolm Slaney
PresenterPresenter:: Jack jeryesJack jeryes
3/3/20083/3/2008
![Page 2: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/2.jpg)
What is video rewriteWhat is video rewrite??
use existing footage to create new video of a person mouthing words that he did not speak in the original footage
![Page 3: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/3.jpg)
Example:Example:
![Page 4: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/4.jpg)
Why video rewrite?Why video rewrite? movie dubbing :
to sync the actors’ lip motions to the new soundtrack
Teleconferencing Special effects
![Page 5: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/5.jpg)
Approach Approach Learn from example footage how Learn from example footage how a person’s face changes during a person’s face changes during speech speech
(dynamics and idiosyncrasies) (dynamics and idiosyncrasies)
![Page 6: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/6.jpg)
Stages Stages Video rewrite have two statges:Video rewrite have two statges:
Analysis stageAnalysis stage Synthesis stageSynthesis stage
![Page 7: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/7.jpg)
Analysis stageAnalysis stage::
![Page 8: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/8.jpg)
Synthesis stage:Synthesis stage:
segments new audio and uses it to select triphones from the video model.Based on labels from the analysis stage, the new mouth images are morphed into a new background face
![Page 9: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/9.jpg)
Analysis for video modeling the analysis stage creates an annotated database of example video clips, derived from unconstrained footage. (video model)
-Annotation Using Image Analysis-Annotation Using Audio Analysis
![Page 10: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/10.jpg)
Annotation Using Image Analysis
As face moves within the frame, need to know
-mouth position-lip shapes at all times.
Using eigenpoints (good for low resolution)
![Page 11: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/11.jpg)
Eigenpoints :
A small set of hand-labeled facial images is usedto train subspace models.Given a new image, the eigenpoint modelstell us the positions of points on the lips and jaw
![Page 12: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/12.jpg)
Eigenpoints (cont.)
54 eigenpoints for each image :34 on the mouth 20 on the chin and jaw line.
Only 26 images hand labeled 26 / 14,218 about 0.2%
Extended the hand-annotated dataset by morphing pairs to form intermediate images
![Page 13: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/13.jpg)
Eigenpoints (cont.)
Eigenpoints doesn’t allow variety of motions.thus, warp each face image into a standard reference plane,
prior to eigpoints labelingUse affine transform to minimize the mean-squared error between a large portion of the face image and a facial template
![Page 14: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/14.jpg)
Mask to estimate global warp
Each image is warped to account for changes in the head’sposition, size, and rotation. The transform minimizes thedifference between the transformed images and the facetemplate. The mask (left) forces the minimization toconsider only the upper face (right).
![Page 15: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/15.jpg)
global mapping…Once the best global mapping is found, it is inverted and applied to the image, putting that face into the standard coordinate frame. We then perform eigenpoints analysis on this pre-warped image to find the fiduciary points. Finally, we back-project the fiduciary points
through the global warp to place them on the original face image
![Page 16: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/16.jpg)
![Page 17: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/17.jpg)
Annotation Using Audio Analysis
All the speech segmented into sequences of phonemes
the /T/ in “beet” looks different from the /T/ in “boot.”
Consider coarticulation
![Page 18: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/18.jpg)
Annotation Using Audio Analysis
Use triphones: collections of three sequential phonemes
“teapot” is split into : /SIL-T-IY/ /T-IY-P/ /IY-P-AA/ /P-AA-T/ and /AA-T-SIL/
![Page 19: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/19.jpg)
Annotation Using Audio Analysis
While synthesize a video,-Emphasize the middle of each triphone. -Cross-fade the overlapping regions of neighboring triphones
![Page 20: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/20.jpg)
Synthesis using a video model
segments new audio and uses it to select triphones from the video model.Based on labels from the analysis stage, the new mouth images are morphed into a new background face
![Page 21: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/21.jpg)
Synthesis using a video model
background, head tilts and the eyes blinktaken from the source footage in the same order as they were shotthe triphone images include the mouth, chin, and part of
the cheeks,
use illumination-matching techniques to avoid visible seams
![Page 22: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/22.jpg)
Selection of Triphone Videos
choosing a sequence of clips thatapproximates the desired transitionsand shape continuity
![Page 23: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/23.jpg)
Selection of Triphone Videos
Given a triphone in the new speech utterance, we compute a matching distanceto each triphone in the video database
Dp = phoneme-context distance Ds = lip-shape distance
sp DDerror 1
![Page 24: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/24.jpg)
Dp = phoneme-context distance
Dp is based on categorical distances between phoneme categories and between viseme classes
Dp= waited sum (viseme-distance , phonemic-distance)
![Page 25: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/25.jpg)
26 viseme classes :
1- /CH/ /JH/ /SH/ /ZH/ 2- /K/ /G/ /N/ /L/ /T/ /D/3- /P/ /B/ /M/..
![Page 26: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/26.jpg)
Dp = phoneme-context distance
-Phonemic-distance ( /P/ , /P/ ) = 0 same phonemic category
-Viseme-distance ( /P/ ,/IY/ ) = 1 different viseme classes
Dp ( /P/ ,/B/ ) = between 0-1
same viseme classdifferent phonemic category
![Page 27: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/27.jpg)
Ds = lip-shape distance
Ds ,measures how closely the mouthContours match in overlapping segmentsof adjacent triphone videos
In “teapot” : /IY/ and /P/ in /T-IY-P/ shall match the contours for /IY/ and /P/ in /IY-P-AA/
![Page 28: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/28.jpg)
Ds = lip-shape distance
Euclidean distance frame by frame betweenEuclidean distance frame by frame between4-elements feature vector4-elements feature vector
(overall lip width , overall lip high,(overall lip width , overall lip high,inner lip height, height of visible teeth)inner lip height, height of visible teeth)
![Page 29: Video Rewrite Driving Visual Speech with Audio](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816834550346895ddde98a/html5/thumbnails/29.jpg)
Stitching all Together
The remaining task is to stitch the triphone videos into the background sequence