A Principal Component Analysis Approach - DiVA...

i

i

“thesis” — 2008/9/1 — 22:31 — page 1 — #2 i

i

i

i

i

i

Very Low Bitrate VideoCommunication

A Principal Component Analysis Approach

Ulrik Soderstrom

Department of Applied Physics and Electronics,Umea University

Umea 2008

i

i

“thesis” — 2008/9/1 — 22:31 — page 2 — #3 i

i

i

i

i

i

Digital Media LabDepartment of Applied Physics and ElectronicsUmea UniversitySE-901 87 Umea, Sweden

Thesis for degree as PhD. of engineering in applied electronics at UmeaUniversity

Vetenskaplig uppsats for avlaggande av teknologie doktorexamen i tillampadelektronik vid Umea universitet

issn 1652-6295:11isbn 978-91-7264-644-5

Author’s email: [email protected] by Ulrik Soderstrom LATEX2εPrinted by Arkitektkopia, Umea University, Umea, September 2008

i

i

“thesis” — 2008/9/1 — 22:31 — page 1 — #4 i

i

i

i

i

i

Till Leif.

Du ar den baste.

i

i

“thesis” — 2008/9/1 — 22:31 — page 2 — #5 i

i

i

i

i

i

i

i

“thesis” — 2008/9/1 — 22:31 — page 3 — #6 i

i

i

i

i

i

Acknowledgement

I want to thank Professor Haibo Li for all the advice and guidance he has providedme with over the years. Without his help and support I hadn’t been able to pursuemy research.

Thanks to everyone who has been associated with Digital Media Lab, both in thepast and present, and have helped me in one way or another. Special gratitudegoes to Zhengrong Yao who helped me several times in the beginning of my re-search career.

Tack till Sara, Ville, Peter, Shafiq och Nazanin. Vara raster och luncher tillsam-mans har gjort att min tillvaro kants battre flera ganger.

Ett stort tack till all personal pa TFE. Ett extra tack till Annemaj, Mona-Lisaoch Valentina som alltid har tid och ett hjalpande rad.

Tack till Leif och Sven som byggde en utrustning at mig infor mitt exjobb varen2001. ”Hjalmen” skulle bara anvandas till mitt examensarbete men den har forfoljtmig under hela min forskarkarriar.

Tack till alla som jag har samarbetat med under min tid som forskare, specielltGreger Wikstrand och Anders Broberg.

Tack till Tomas Andersson Wij. Nu vet jag att jag ar vard sa mycket mer.

Tack till mina vanner som finns dar for mig i bade goda och daliga tider.

Tack till min familj som stottat och hjalpt mig genom manga saker i livet. Detglommer jag aldrig.

Tack for allt, Elin och Smilla.

3

i

i

“thesis” — 2008/9/1 — 22:31 — page 4 — #7 i

i

i

i

i

i

i

i

“thesis” — 2008/9/1 — 22:31 — page 5 — #8 i

i

i

i

i

i

Abstract

A large amount of the information in conversations come from non-verbal cuessuch as facial expressions and body gesture. These cues are lost when we don’tcommunicate face-to-face. But face-to-face communication doesn’t have to happenin person. With video communication we can at least deliver information aboutthe facial mimic and some gestures. This thesis is about video communication overdistances; communication that can be available over networks with low capacitysince the bitrate needed for video communication is low.

A visual image needs to have high quality and resolution to be semanticallymeaningful for communication. To deliver such video over networks require thatthe video is compressed. The standard way to compress video images, used byH.264 and MPEG-4, is to divide the image into blocks and represent each blockwith mathematical waveforms; usually frequency features. These mathematicalwaveforms are quite good at representing any kind of video since they do notresemble anything; they are just frequency features. But since they are completelyarbitrary they cannot compress video enough to enable use over networks withlimited capacity, such as GSM and GPRS.

Another issue is that such codecs have a high complexity because of the re-dundancy removal with positional shift of the blocks. High complexity and bitratemeans that a device has to consume a large amount of energy for encoding, de-coding and transmission of such video; with energy being a very important factorfor battery-driven devices.

Drawbacks of standard video coding mean that it isn’t possible to deliver videoanywhere and anytime when it is compressed with such codecs. To resolve theseissues we have developed a totally new type of video coding. Instead of usingmathematical waveforms for representation we use faces to represent faces. Thismakes the compression much more efficient than if waveforms are used even thoughthe faces are person-dependent.

By building a model of the changes in the face, the facial mimic, this model canbe used to encode the images. The model consists of representative facial imagesand we use a powerful mathematical tool to extract this model; namely principal

5

i

i

“thesis” — 2008/9/1 — 22:31 — page 6 — #9 i

i

i

i

i

i

6 ABSTRACT

component analysis (PCA). This coding has very low complexity since encodingand decoding only consist of multiplication operations. The faces are treated assingle encoding entities and all operations are performed on full images; no blockprocessing is needed. These features mean that PCA coding can deliver highquality video at very low bitrates with low complexity for encoding and decoding.

With the use of asymmetrical PCA (aPCA) it is possible to use only seman-tically important areas for encoding while decoding full frames or a different partof the frames.

We show that a codec based on PCA can compress facial video to a bitratebelow 5 kbps and still provide high quality. This bitrate can be delivered on aGSM network. We also show the possibility of extending PCA coding to encodingof high definition video.

i

i

“thesis” — 2008/9/1 — 22:31 — page 7 — #10 i

i

i

i

i

i

List of publications

Peer review papers

Ulrik Soderstrom and Haibo Li, Very low bitrate full-frame facial video cod-ing based on principal component analysis, Signal and Image Processing Conference(SIP’05), Honolulu, August 2005

Ulrik Soderstrom and Haibo Li, Full-frame video coding for facial video se-quences based on principal component analysis, Proc. of Irish Machine Vision andImage Processing Conference 2005 (IMVIP 2005), pp. 25-32, Belfast, August 2005,Paper I

Ulrik Soderstrom and Haibo Li, Representation bound for human facialmimic with the aid of principal component analysis, under review, submitted De-cember 2007, Paper II

Ulrik Soderstrom and Haibo Li, Eigenspace compression for very low bi-trate transmission of facial video, IASTED International conference on Signal Pro-cessing, Pattern Recognition and Applications (SPPRA’07), Innsbruck, February2007, Paper III

Hung-Son Le, Ulrik Soderstrom and Haibo Li, Ultra low bit-rate video com-munication, video coding = facial recognition, Proc. of 25th Picture Coding Sym-posium, Beijing, April 2006, Paper IV

Ulrik Soderstrom and Haibo Li, Asymmetrical principal component analy-sis for video coding, Electronics letters, Volume 44 (4), February 2008, pp. 276-277

Ulrik Soderstrom and Haibo Li, Asymmetrical Principal Component Analy-sis Theory and Its Applications to Facial Video Coding, submitted, August 2008,Paper V

7

i

i

“thesis” — 2008/9/1 — 22:31 — page 8 — #11 i

i

i

i

i

i

8 PUBLICATIONS

Ulrik Soderstrom and Haibo Li, Side view driven facial video coding, sub-mitted, September 2008, Paper VI

Ulrik Soderstrom and Haibo Li, High definition wearable video communica-tion, submitted, September 2008, Paper VII

Other papers

Ulrik Soderstrom and Haibo Li, Emotion recognition and estimation fromtracked lip features, Proc. of Swedish Symposium for Automated Image Analysis(SSBA’04), pp. 182-185, Uppsala, March 2004

Ulrik Soderstrom and Haibo Li, Emotion recognition and estimation fromtracked lip features, Technical report DML-TR-2004:05

Ulrik Soderstrom and Haibo Li, Customizing lip video into animation forwireless emotional communication, Technical report DML-TR-2004:06

Greger Wikstrand and Ulrik Soderstrom, Internet card play with video con-ferencing, Proc. of Swedish Symposium for Automated Image Analysis (SSBA’06),pp. 93-96, Umea, March 2006

Ulrik Soderstrom, Very low bitrate facial video coding based on principalcomponent analysis, Licentiate thesis, September 2006

Ulrik Soderstrom and Haibo Li, Principal component video coding for simpledecoding on mobile devices, Proc. of Swedish Symposium for Automated ImageAnalysis (SSBA’07), pp. 149-152, Linkoping, March 2007

Ulrik Soderstrom and Haibo Li, Asymmetrical principal component analysisfor encoding and decoding of video sequences, Proc. of Swedish Symposium forAutomated Image Analysis (SSBA’08), Lund, March 2008

i

i

“thesis” — 2008/9/1 — 23:04 — page 9 — #12 i

i

i

i

i

i

Contents

Acknowledgement 3

Abstract 5

Publications 7

List of Figures 13

Abbreviations and mathematical notations 15

1 Introduction and motivation 171.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3 Research goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4 Research strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.5 Research process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Related work and technical solutions used 232.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.1 Video compression . . . . . . . . . . . . . . . . . . . . . . . 242.1.1.1 Discrete Cosine Transform . . . . . . . . . . . . . 242.1.1.2 Block matching . . . . . . . . . . . . . . . . . . . 262.1.1.3 Chroma subsampling . . . . . . . . . . . . . . . . 27

2.1.2 Facial representation with standard techniques . . . . . . . 282.1.3 Other implementations of PCA for video coding . . . . . . 302.1.4 Scalable video coding . . . . . . . . . . . . . . . . . . . . . 30

2.2 Technical solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

9

i

i

“thesis” — 2008/9/1 — 23:04 — page 10 — #13 i

i

i

i

i

i

10 CONTENTS

2.2.1 Hands-free video equipments . . . . . . . . . . . . . . . . . 312.2.1.1 Modeling efficiency with the helmet . . . . . . . . 32

2.2.2 Basic emotions . . . . . . . . . . . . . . . . . . . . . . . . . 322.2.3 Video sequences . . . . . . . . . . . . . . . . . . . . . . . . 352.2.4 Locally linear embedding . . . . . . . . . . . . . . . . . . . 362.2.5 High definition (HD) video . . . . . . . . . . . . . . . . . . 392.2.6 The face space and personal mimic space . . . . . . . . . . 392.2.7 Quality evaluation . . . . . . . . . . . . . . . . . . . . . . . 40

3 Principal component analysis video coding 433.1 Principal component analysis . . . . . . . . . . . . . . . . . . . . . 43

3.1.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . 453.2 Principal component analysis video coding . . . . . . . . . . . . . . 46

3.2.1 Comparison with h264 . . . . . . . . . . . . . . . . . . . . . 473.3 Encoding and decoding time . . . . . . . . . . . . . . . . . . . . . . 473.4 Theoretical bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.1 Distortion bound . . . . . . . . . . . . . . . . . . . . . . . . 503.4.2 Rate-Distortion bound . . . . . . . . . . . . . . . . . . . . . 513.4.3 Comparison of the distortion bounds . . . . . . . . . . . . . 53

3.5 Eigenimage compression . . . . . . . . . . . . . . . . . . . . . . . . 533.5.1 Quantization - uniform or pdf-optimized? . . . . . . . . . . 543.5.2 Compression of the mean image . . . . . . . . . . . . . . . . 543.5.3 Loss of orthogonality . . . . . . . . . . . . . . . . . . . . . . 543.5.4 Compression methods . . . . . . . . . . . . . . . . . . . . . 55

3.6 Eigenspace re-use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6.1 Sensitivity to lightning and positional shift . . . . . . . . . 56

4 Ultra low bitrate video coding 594.1 LLE smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Asymmetrical Principal Component Analysis video coding 635.1 Asymmetrical principal component analysis video coding . . . . . . 63

5.1.1 Reduction of complexity for the encoder . . . . . . . . . . . 655.1.2 Reduction of complexity for the decoder . . . . . . . . . . . 655.1.3 Variance of areas and reconstruction change for asymmetri-

cal principal component analysis video coding . . . . . . . . 675.1.4 Experiments with asymmetrical principal component anal-

ysis video coding . . . . . . . . . . . . . . . . . . . . . . . . 685.1.4.1 Case 1: Encoding with mouth, decoding with en-

tire frame . . . . . . . . . . . . . . . . . . . . . . . 685.1.4.2 Case 2: Encoding with mouth and eyes, decoding

with entire frame . . . . . . . . . . . . . . . . . . . 69

i

i

“thesis” — 2008/9/1 — 23:04 — page 11 — #14 i

i

i

i

i

i

CONTENTS 11

5.1.4.3 Case 3: Encoding with extracted features, decod-ing with entire frame . . . . . . . . . . . . . . . . 70

5.1.4.4 Case 4: Find all edges for encoding, decoding withthe entire frame . . . . . . . . . . . . . . . . . . . 72

5.1.5 Side view driven video coding . . . . . . . . . . . . . . . . . 735.1.6 Profile driven video coding . . . . . . . . . . . . . . . . . . 75

6 High definition wearable video communication 796.1 High definition wearable video equipment . . . . . . . . . . . . . . 806.2 Wearable video communication . . . . . . . . . . . . . . . . . . . . 80

7 Contributions, conclusions and future work 857.1 Contributions and conclusions . . . . . . . . . . . . . . . . . . . . . 857.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Bibliography 89

i

i

“thesis” — 2008/9/1 — 23:04 — page 12 — #15 i

i

i

i

i

i

i

i

“thesis” — 2008/9/1 — 22:31 — page 13 — #16 i

i

i

i

i

i

List of Figures

2.1 DCT coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2 Encoding and decoding with DCT. . . . . . . . . . . . . . . . . . . 262.3 A motion field generated from matching of frames A and B. . . . . 262.4 YUV 4:1:1 subsampling . . . . . . . . . . . . . . . . . . . . . . . . 272.5 Assembled YUV frame . . . . . . . . . . . . . . . . . . . . . . . . . 282.6 Hands-free video equipments. . . . . . . . . . . . . . . . . . . . . . 332.7 Example frames from video recorded without hands-free equipment. 342.8 Difference between using the hands-free helmet and using a fixed

camera. – Hands-free helmet - - Fixed camera . . . . . . . . . . . . 342.9 The six basic emotions. . . . . . . . . . . . . . . . . . . . . . . . . 352.10 Example frames from three video sequences. . . . . . . . . . . . . . 362.11 Example frame from from a HD video sequence. . . . . . . . . . . 372.12 The tilted mirror. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.13 Frame from a video sequence shot in front of the mirror. . . . . . . 382.14 The first three dimension of the personal mimic space for two facial

mimics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1 Comparison between a codec based on PCA and a H.264 codec atvery low bitrates. (– PCA coding - - H.264) . . . . . . . . . . . . . 48

3.2 Example frames from H.264 encoding at 2.4 kbps. . . . . . . . . . 483.3 Reverse water-filling for independent Gaussian principal compo-

nents. Only the components who have a variance larger then γ isallocated bits in the quantization process. . . . . . . . . . . . . . . 52

3.4 Mean rate-distortion bound for all 10 video sequences of facial mimic. 52

13

i

i

“thesis” — 2008/9/1 — 22:31 — page 14 — #17 i

i

i

i

i

i

14 LIST OF FIGURES

3.5 Comparison of the different coding schemes. – Compressed Eigenspaceat both encoder and decoder - - Compressed Eigenspace at decoder &original mean image -· Compressed Eigenspace at decoder & com-pressed mean image ·· LS & GS result – Loss of orthogonality . . . 56

4.1 Ultra low bitrate video coding scheme. . . . . . . . . . . . . . . . . 594.2 Personal mimic face images mapped into the embedding space de-

scribed by the first two coordinates of LLE. . . . . . . . . . . . . . 604.3 The reconstructed frames shown in a two-dimensional LLE space. . 614.4 The reconstructed frames shown in a two-dimensional LLE space

with smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.1 Top left: P=0, M=15 Top right: P=10, M=15 Bottom: P=15,M=15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Variance image of the individual pixels. White = Zero varianceColor indicates variance in a heat map (Yellow = high variance) . 67

5.3 Individual pixel PSNR. red=improved PSNR blue=reduced PSNR 675.4 The entire video frame and the foreground If used in Case 1. . . . 695.5 Foreground with the eyes and the mouth . . . . . . . . . . . . . . . 705.6 Area decided with edge detection and dilation. . . . . . . . . . . . 715.7 Area decided with edge detection on all frames. . . . . . . . . . . . 725.8 Example frames of profiles relating to the side view. . . . . . . . . 755.9 The mean profile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.10 The first three Eigenprofiles φpr

j (Scaled for visualization). . . . . . 76

6.1 Frame reconstructed with aPCA. (25 φfj and 5 φ

pbg

j are used.) . . . 836.2 Frame encoded with H.264. . . . . . . . . . . . . . . . . . . . . . . 83

i

i

“thesis” — 2008/9/1 — 23:04 — page 15 — #18 i

i

i

i

i

i

Abbreviations and mathematical notations

Abbreviations

PCA Principal component analysisaPCA Asymmetrical Principal component analysisPSNR Peak signal to noise ratiomse Mean square errorSVD Singular value decompositionHD High definitionLLE Locally linear embeddingDCT Discrete cosine transformDP Dynamic programming

15

i

i

“thesis” — 2008/9/1 — 23:04 — page 16 — #19 i

i

i

i

i

i

16 ABBREVIATIONS AND MATHEMATICAL NOTATIONS

Mathematical notations

Φ Eigenspaceφ Eigenvector, principal componentI Original data as vectorsC Original data as imagesI0 Mean of dataG A different video sequence than IIf Foreground imageIs Side viewIfr Frontal viewXpr Side view profile{α} Projection coefficientsN Total number of principal components in a sequenceM The number of principal components used for decodingP The number of principal components used for background reconstructionK Total number of pixels in a frame (YUV subsampled)λj Eigenvaluesbij EigenvectorsΦp Pseudo EigenspaceΦpbg Pseudo Eigenspace for the backgroundh Number of pixels in horizontal directionv Number of pixels in vertical directionDi Displacement in block-based motion estimation

i

i

“thesis” — 2008/9/1 — 22:31 — page 17 — #20 i

i

i

i

i

i

Introduction and motivation

1.1 MotivationThe motivation for this work can originally be found in the beginning of 2003 whenI started my employment as a Ph.D.-student. Visual communication can provide agreat improvement in communication but it is not available in many cases (in 2003it was rarely available). Most of the distance communication consists of only audioand we wanted to extend audio communication to audio-visual communicationsince a large part of the information in communication comes from non-verbal cues.Video requires a minimum quality to improve communication; with a poor videoquality or when it is used incorrectly it will actually degrade the communicationquality. So, the motivation was to provide video communication with high qualityfor everyday distance communication. This communication often takes place overlow capacity networks, meaning that the video needs to be compressed to very lowbitrates to be used on these networks. The video compression standards that areavailable today (and were available in 2003) cannot provide video with sufficientquality at these bitrates. The need for a new compression scheme is obvious.

For any kind of data it is preferable to use as low bitrate as possible for trans-mission. For communication data there are especially high demands on transmis-sion time; communication requires online functionality. Any video codec that useslow bitrates is much less sensitive to delay and other errors than video that needs ahigher bitrate. The cost for transmission is also lowered when less bits are needed.This cost might be the price the user has to pay for a service but it might also bethe transmission cost measured in bits. Both of these can be lowered if less dataneeds to be transmitted.

So the original motivation for this work is to create a video codec that canfunction at low bitrates, at low cost and with low sensitivity to errors. During thecourse of time we also wanted to use high definition (HD) resolution for the videoand to always maintain an extremely low bitrate for this kind of video is beyondthe scope of this thesis work. It is possible to encode HD video to very low bitratesbut this will be examined in the future.

17

i

i

“thesis” — 2008/9/1 — 22:31 — page 18 — #21 i

i

i

i

i

i

18 INTRODUCTION AND MOTIVATION

1.2 VisionMy vision for the future is that high quality video can be provided anywhere andanytime regardless of the network and environment.

1.3 Research goalsThis work regards a very fundamental problem; video compression. More specif-ically, video compression at very low bitrates. The techniques that are availableare not sufficiently effective to encode high quality video into very low bitrates andthere is a need for a new technique. We propose that this technique is principalcomponent analysis. We have specified more concrete goals for low bitrate videocoding.

• The bitrate for the video transmission should be lower than 5 kbps.

• The objective quality of the reconstructed video should be approximately 35dB (measured in PSNR).

• The resolution of the video should be approximately 320x240 pixels (CIFresolution).

• The video codec should be able to encode and decode video in real-time; arequirement for online usage of it.

• The video codec should be able to compress facial video sequences; not anarbitrary video sequence.

These are the goals we have for most of out work but we also want to encodeHD video so the resolution goal isn’t valid for all research projects. The bitrateand objective quality is also altered for HD video.

1.4 Research strategyThe goals we have set for video compression are quite challenging to reach. Tosucceed we need a good strategy for achieving these goals and we have used thefollowing strategies.

1. Instead of using waveforms to represent images we use facial images forrepresentation.

2. We decided to model the changes within a face; the facial mimic.

3. We make use of hands-free video equipment to remove the global motion inthe video.

i

i

“thesis” — 2008/9/1 — 22:31 — page 19 — #22 i

i

i

i

i

i

RESEARCH PROCESS 19

4. Our first implementation was made to see that the idea actually works.Before continuing further we wanted to see the theoretical possibility for thecoding.

5. We wanted to focus on the main problem; facial mimic modelling and fa-cial encoding with this model. We do not use advanced image analysistechniques, like warping for facial alignment, for the videos. We use thehands-free video equipment to circumvent several issues and concentrate onthe main issue.

6. We decided that we would build a model of the mimic that contains all thepossible facial changes directly. It is also possible to evaluate the performanceof the model continuously and update it.

7. We are not interested in creating an efficient system for the coding butinstead evaluating the idea. We have therefore implemented most of theresearch in Matlab; an environment which is good for matrix (image) oper-ations but quite slow.

1.5 Research process

All research presented in this thesis has evolved through discussions with my su-pervisor, Prof. Haibo Li. Our first implementations dealt with facial animationssince we thought that we couldn’t achieve real video transmissions at such lowbitrates [1,2]. After we found that natural video is superior to animations we triedto create low bitrate video instead of low bitrate facial representation throughanimations. Haibo Li had tested to use low bitrate representation of facial videowhen he was a student, back in 1993 [3]. We continued to develop his idea sinceit built on real video. We found that the American psychologist Paul Ekman hadshown that facial expressions can be modelled in a very efficient way [4,5]. Wewanted to incorporate this modelling in a coding scheme and chose to make useof principal component analysis [6] for this modelling since it is the most efficientmodelling technique available.

We developed a coding scheme based on principal components. We extracta model of facial mimic through principal component analysis and then use thismodel for encoding and decoding of video (Paper I).

When we evaluated the implementation of principal component video codingwe found that for encoding and decoding we had reached our goals. The codingscheme can encode facial video with a resolution of 240x176 pixels in real-time andthe video can also be reconstructed in real-time with a quality higher than 35 dB.When we implemented the coding scheme we made use of uniform quantization.We wanted to know exactly how low bitrate that could be achieved and whatquality that could be reached when an error-free representation is used. So we

i

i

“thesis” — 2008/9/1 — 22:31 — page 20 — #23 i

i

i

i

i

i

20 INTRODUCTION AND MOTIVATION

examined boundaries for coding through optimal bit allotment and boundaries foran unlimited representation (Paper II).

Together with Greger Wikstrand I examined the use of video for conversationalmultimedia [7]. The result was disappointing at first but they implied that videohas such high impact that it must be used wisely, otherwise video might actuallydegrade the user experience. Low bitrate video coding has a major role to playhere since low bitrates can be used to improve the quality of visual communication.

The problem that needed to be solved is the practical usage of the principalcomponents; they are needed by both the encoder and decoder to enable videocoding. They can either be transmitted between the encoder and decoder andthis requires a high bitrate since the components are images. We have examinedhow to update the principal components and how to transmit new components(Paper III). Together with Hung-Son Le we investigated how it is possible to re-use images through face recognition. In this work we show how we can use locallylinear embedding (LLE) to provide decoded video with a more natural appearance(Paper IV).

To avoid transmission of new or updated Eigenspaces we examined how theEigenspaces can be used again. There are two difficulties for achieving this:

1. The facial features are positioned on different pixel positions in new videoframes compared to the Eigenspace.

2. The facial features have different pixel intensity due to different lightningand shading of the face in the videos.

The lightning problem is not as severe as the positional problem. It can behandled by adjusting the light by normalizing. The positional problem can besolved by either aligning the new video frame so it matches the existing Eigenspaceor aligning the Eigenspace so that it matches the new video frame. We haveevaluated alignment of video frames to match existing Eigenspaces through affinetransformation [8]. Affine transformation is not enough to overcome the positionaldifferences between videos.

We used PCA for encoding of an arbitrary video [9]. The idea was to make useof simple encoding and decoding that is associated with PCA. Simple decoding cansave energy for a device which is powered by batteries and a simple coding schemecan be used without device-specific implementations. The very low bitrates arelost with this implementation because of the need to transmit new Eigenimagesfor each clip. Almost the entire bitrate consists of Eigenimage transmission. Eventhough the idea was to create high-quality video the results show that this kind ofvideo can be used as low-quality preview. A reconstruction with few Eigenimagesdoesn’t give away any high-quality version of a video but the motion in the videois apparent. The content of the video is visible to a user at a very low bitrate cost.

We wanted to improve the modelling of the facial mimic and at the same timereduce the complexity for encoding and decoding. To do this we introduced acoding scheme that can use a part of the frame for encoding and decode the entire

i

i

“thesis” — 2008/9/1 — 22:31 — page 21 — #24 i

i

i

i

i

i

RESEARCH PROCESS 21

frame or a different part of the frame. This scheme is called asymmetric principalcomponent analysis (aPCA) and it is described in Paper V. aPCA allows fasterupdating of Eigenspaces since it reduces the amount of data that needs to beupdated. We have also showed how it is possible to use the side view of a face forencoding and decode the frontal view of this face. This is a totally new way oftreating video coding since the part that is decoded is not even used for encoding(Paper VI). In this work we also use only the profile of the side view for encodingbut still decode the entire frontal view.

In the end of all these contributions result in a video compression scheme thatwork in real-time for facial video sequences and have high quality at extremelylow bitrates. We have also extended the coding scheme to high-definition videomaterial. It is still facial video but the resolution is much higher, e.g. 1080i (1080interlaced). The coding scheme now works at a bitrate which is low consideringHD content but very high compared to the other work we have produced (PaperVII).

i

i

“thesis” — 2008/9/1 — 22:31 — page 22 — #25 i

i

i

i

i

i

i

i

“thesis” — 2008/9/1 — 22:31 — page 23 — #26 i

i

i

i

i

i

Related work and technical solutions used

2.1 Related Work

This work is about video compression that can be decoded in layers so relatedwork spans video coding and scalable video coding. General video coding withstandard methods are described in section 2.1.1 and video compression based onPCA is described in section 3.2. Scalable video coding is discussed in section 2.1.4.

Harashima et.al. [10] provide a differentiation of video coding into generations;this division is shown in Table 2.1. Coding based on discrete cosine transform(DCT) is regarded as 1st generation video coding where the main idea behind thecompression is to remove the redundancy. H.264 uses several other techniques butstill compresses the video by removing redundancy; both in the information itselfand information which is redundant for human observers. The 4th generation videocompression techniques work by treating parts of the video as individual object,e.g., background, house, car or face. Compression of each individual object canthen be specialized with techniques that work well for compression of the differentobject types. Torres and Delp [11] explain how this coding can function andthey also explain how PCA can be used for compression of facial images in videosequences. Encoding of faces with PCA can be regarded as 4th generation videocoding.

Coding generation Approach Technique0th generation Direct waveform coding PCM1st generation Redundancy removal DPCM, DCT, DWT, VQ2nd generation Coding by structure Image segmentation3rd generation Analysis and synthesis Model-based coding4th generation Recognition and reconstruction Knowledge-based coding5th generation Intelligent coding Semantic coding

Table 2.1: Classification of image and video coding.

23

i

i

“thesis” — 2008/9/1 — 22:31 — page 24 — #27 i

i

i

i

i

i

24 RELATED WORK AND TECHNICAL SOLUTIONS USED

2.1.1 Video compressionThe purpose of video compression is to reduce the quantity of information that isneeded to represent a sequence of images. Video compression can be performedlossless or lossy where lossy of course provide a larger data reduction than loss-less compression. Lossy video compression achieve good visual results working onthe premise that much information available before compression is redundant fora human observer and can be removed without loss in visual appearance. Lossycompression can be perceived just as well by a human observer as lossless com-pression is. Not many video codecs are considered to be lossless. There is no pointof retaining all information since some information have no semantic meaning;it doesn’t improve the perceptual quality. Most standard video codecs rely ontransform coding and motion estimation to encode video. The reigning transformtechnique is discrete cosine transform (DCT) (section 2.1.1.1). Motion estimationbetween frames adjacent or close in time is performed through block matching(section 2.1.1.2). Images in a video sequence are encoded differently; there areintracoded frames and intercoded frames. An intracoded frame is only dependenton itself; it is only coded based on it’s content. An intercoded frame is dependenton a previous, and perhaps subsequent, intracoded frame. An intracoded frameis compressed as an image, which it is. Intercoded frames encode the differencesfrom the previous frame. Since frames which are adjacent in time usually sharelarge similarities in appearance it is very efficient to only store one frame andthe differences between this frame and the others. Only the first frame in a se-quence is encoded purely with DCT. For the following frames only the changesbetween the current and first frame is encoded. The number of frames betweenintracoded frames is called the group of pictures (GOP). A large GOP size meansfewer intracoded frames and lower bitrate. Standard video codecs includes theMPEG standards and h.26x codecs, where MPEG-4/AVC is the same as H.264[12,13,14,15,16]. The main parts of these standards; DCT, motion estimation andchroma subsampling are described in the following sections. There are two majordifferences between the way we use PCA and traditional video coding with discretecosine transform (DCT):

1. PCA is used for encoding of full frames while DCT employs block-basedprocessing. DCT can be used for full frame encoding and PCA can be usedfor block-based encoding as well.

2. PCA is signal-dependent while DCT is independent of the input signal. Thismeans that PCA requires different bases dependent on the data it shouldmodel but DCT models every kind of data with the same bases.

2.1.1.1 Discrete Cosine Transform

Discrete cosine transform (DCT) describes a sequence of data as a sum of cosinefunctions with different frequencies. Cosine functions are much more efficient for

i

i

“thesis” — 2008/9/1 — 22:31 — page 25 — #28 i

i

i

i

i

i

RELATED WORK 25

Figure 2.1: DCT coefficients.

signal approximation than sine functions and they also have wanted boundaryconditions when it comes to differential equations. DCT is similar to the Fouriertransform but uses only real numbers and cosine functions; DFT has both complexnumbers and sine functions as well as cosine functions.

DCT expresses the signal in terms of sinusoids with different frequencies andamplitudes. With a DCT it is possible to evaluate the signal even at places wherethere isn’t an input from beginning; DCT has extension to the signal.

Two-dimensional DCT:s, which are used for image compression, can be sepa-rable products of two one-dimensional transforms; one in the horizontal directionand one in the vertical direction but the transform can also be performed in asingle step. The coefficients for two-dimensional DCT are shown in Figure 2.1.Each step from left to right and top to bottom is an increase in frequency by 1/2cycle.

DCT works on blocks; the input image is divided into 8x8 blocks. Each blockis then transformed to a linear combination of the 64 frequency squares; DCTis just as PCA a linear transform. This transform is possible to inverse withouterror. Compression through DCT comes from quantization on the frequencies;with different quantization steps for each frequency square. In this way it ispossible to weight the importance of the different frequencies; low frequencies arealways more important than high frequencies for image content. The encodingand decoding steps of DCT are shown in Figure 2.2.

The coefficients in DCT are completely arbitrary; DCT can be used to encodeany kind of signal. The transform coefficients are therefore not dependent on the

i

i

“thesis” — 2008/9/1 — 22:31 — page 26 — #29 i

i

i

i

i

i


Figure 2.2: Encoding and decoding with DCT.

Figure 2.3: A motion field generated from matching of frames A and B.

signal but the energy compactness is degraded compared to a signal-dependenttransform like PCA.

2.1.1.2 Block matching

Block matching is a technique for motion estimation. The motion between twoframes is calculated and motion vectors that describe the motion are stored. Thisgenerates a motion field for the displacements between the frames. How a motionfield relates two frame to each other is visualized in Figure 2.3. The term blockmatching comes from the fact motion vectors are calculated for blocks, usually8x8 times large blocks. Block matching is performed based on pixel intensities;the matching score between two blocks is the sum of pixel differences between theblocks. This sum is usually the squared sum:

costA B =S∑

i=1

S∑

j=1

(blockAij − blockBij )2 (2.1)

where S is the size of the blocks A and B.It is possible to represent frame B with the pixel intensities from frame A and

the motion vectors between them. The changes in each block is also stored toenable interblock changes; without these changes a block cannot change appear-

i

i

“thesis” — 2008/9/1 — 22:31 — page 27 — #30 i

i

i

i

i

i

RELATED WORK 27

U VY Y Y Y

Figure 2.4: YUV 4:1:1 subsampling

ance between frames, it can only change position. Block matching provides highcompression since adjacent frames often share large similarities. When there is alarge change between frames it is not useful to employ block matching.

2.1.1.3 Chroma subsampling

A step which is used in any video compression scheme is chroma subsampling. Theoriginal video frames are represented in RGB color space. This color space canwithout loss be converted into YUV color space. Conversion is performed withequations (equation 2.2).

Y = 0.299R + 0.587G + 0.114BU = −0.147R− 0.289G + 0.436BV = 0.615R− 0.515G− 0.100B

(2.2)

Conversion between YUV and RGB color space are possible in both directionsand without loss of any information. The YUV color space has a very importantfeature; it is based on the perception of the human eye. A YUV color spacehas three parameters where Y is based on luminance while U and V are basedon chrominance. The human eye is much less sensitive to changes in grayscale(luminance) than color (chrominance) [17], so U and V can be compressed withoutany visible effect for a human observer. This procedure is called subsampling sinceonly a few values of U and V are retained. U and V is first low pass filtered and thensubsampled. There are many different ways to perform chroma subsampling andwe have chosen to use YUV 4:1:1 subsampling. With this subsampling methodonly every fourth value of U and V is stored together with all Y values. Thehorizontal color resolution is reduced with this method but the vertical chromaresolution is not affected. The pixel information that is retained is organized into asingle frame (Figure 2.4). In this way the frame size is reduced to half the originalsize almost without any loss in quality visible to a human observer. The decreasein quality is still measurable with peak signal-to-noise ratio.

Since we use YUV 4:1:1 subsampling the size in horizontal direction is reduced.A video with a horizontal resolution of 240 pixels actually have 720 pixels in thehorizontal direction since a trimatic colorspace is used. After chroma subsamplingthe number of pixels is reduced to 360 pixels.

The organization of Y, U and V components have no importance for modelingof the mimic with PCA or compression of the model with PCA. It does however

i

i

“thesis” — 2008/9/1 — 22:31 — page 28 — #31 i

i

i

i

i

i


Y U V

Figure 2.5: Assembled YUV frame

have a large impact on compression of the Eigenimages with JPEG and H.264. Thecompression for these methods is affected by the assembling of the pixels. JPEGand H.264 is more efficient at compressing similar areas; similarity is measuredas difference in pixel intensity. By organizing all Y pixels adjacent to each otherthe change in intensity is minimized. The same is valid for U and V and thecompression of the images is made as efficient as possible. An example of how thepixels are organized is shown in Figure 2.5.

The subsampled color information can be used to create a full YUV frame byadding more U and V components. From this frame a reconstructed video framein RGB format can be attained.

2.1.2 Facial representation with standard techniques

Images of faces can be represented as a combination of bases. Most modern videocodecs are based on discrete cosine transform (DCT) and block matching. Videocoding based on DCT and block matching is described in section 2.1.1.1 and sec-tion 2.1.1.2. This kind of video does however require bitrates higher than mostlow bandwidth networks can provide, at least when the quality of the video trans-mission is high. The bitrate for video encoding through DCT and block-matchingis highly dependent on the number of frames between each frame encoded onlythrough DCT, the Group Of Picture (GOP) size. Video of a face share largesimilarities between all frames and the GOP size can the kept very high and thebitrate low. Still DCT and block-matching requires several DCT-coefficients toencode the first frame in a GOP and several possible movements of the blocks be-tween the frames. Consequently the best codec available today does not providehigh quality video at very low bitrates even if the video is suitable for high com-pression. A comparison of H.264 and PCA video coding is found in section 3.2.1.Representation of facial images through DCT-coefficients is clearly not sufficientwhen very low bitrates are wanted so there must be other ways to represent facialimages. One way is to represent the images as a collection of features from analphabet. The idea can easily be visualized by the letters in the Arabic alphabet;26 letters, or features, are sufficient to model all the words in the English language.By building an alphabet for video features it should be possible to model all videoframes as a combination of these features. A technique that uses such an alpha-

i

i

“thesis” — 2008/9/1 — 22:31 — page 29 — #32 i

i

i

i

i

i

RELATED WORK 29

bet is Matching Pursuit (MP) [18]. The encoder divides the original video imageinto features from an alphabet and very low bitrate is achieved by only transmit-ting information about which features that are used to the decoder, who uses thefeatures to reconstruct the video frame. This technique uses the same idea forrepresentation of facial images as DCT; the image is represented as a combinationof features. The difference is what these features are. Images of a face can also berepresented in other ways than by a combination of features. Several techniquesmake use of a wireframe to model faces. The wireframe has the same shape as aface and to make it look more natural it can be texture-mapped with a real imageof a face. To make the face move and change appearance between video frames itis enough to transmit the changes in the wireframe; consequently these techniquesachieve a facial representation at very low bitrates. Techniques that make use ofa wireframe to model facial images are for example MPEG4 facial animation [19]and model based coding [20,21]. These techniques reach very low bitrates whileretaining high spatial resolution and framerate. A method that relies on the sta-tistical model of the shapes and gray-levels of a face is Active Appearance Model(AAM) [22]. AAM are statistical models; models of the shape of an object. Themodels are iteratively deformed so that they fit an object. For facial coding amodel of the facial features are mapped onto a facial image. The model cannotvary in any possible way; it is constrained by the changes which occur within atraining set. Depending on the statistics that are used with the training set it ispossible to allow the model to vary more or less than the training set. To make thefitting more robust the model is combined with models of the appearance aroundall the point in the model. These models work on the statistical variation of gra-dients or lines. AAM always starts with an initial estimate of the model and endwhen the fit is considered good enough; often solved with an error threshold.

All of these techniques have drawbacks that are critical for efficient usage invisual communication. Pighin et al. provides a good explanation why high visualquality is important and why video is superior to animations [23]. The face simplyexhibits so many tiny creases and wrinkles that it is impossible to model withanimations or low spatial resolution. Therefore any technique based on animationor texture-mapping to a model is not sufficient. Some approaches have focusedon retaining the spatial quality of the video frames at the expense of frame rate.Wang and Cohen presented a solution where high quality images are used for tele-conferencing over low bandwidth networks with a framerate of one frame each 2-3seconds [24]. The idea of using low framerate is however not acceptable since bothhigh framerate and high spatial resolution is important for many visual tasks. Ac-cording to Lee and Eleftheriadis different facial parts have different encoding needs[25]. The sense of having eye contact is sensitive to low resolution while lip syn-chronization is sensitive to low framerate. Therefore it is not sufficient to provideeither high framerate or spatial resolution; both are important. Any techniquethat want to provide video at very low bitrates must be able to provide video withhigh spatial resolution, high framerate and have natural-looking appearance.

i

i

“thesis” — 2008/9/1 — 22:31 — page 30 — #33 i

i

i

i

i

i


2.1.3 Other implementations of PCA for video coding

There are some previous implementations that use principal component analysisfor video coding. These implementations aim at encoding facial sequences intovery low bitrates with high quality. Torres et.al. have published several articlesabout using PCA for encoding of facial parts of video frames [11,26,27,28,29].Their implementation is conveyed within the MPEG-7 standard [30] where thevideo frame is divided into objects which can be encoded separately. They encodethe facial parts of the frames through PCA and the rest of the objects with othertechniques. Their implementation of PCA is called Adaptive PCA and they use itin combination with DCT coding. The first image in a sequence is coded with DCTintracoding. Each following frame is projected onto an Eigenspace for encoding. Ifthe result of the reconstruction with the Eigenspace satisfies a certain quality this isthe only coding for the facial image that is used. If the reconstruction result is notgood enough the frame is intercoded with DCT using the first frame as reference.The DCT encoded image is added to the Eigenspace so that y representation ofthe images is improved.

Crowley et.al. name their implementation orthonormal basis coding [31,32,33,34]but they use PCA since it is based on a combination of orthonormal bases. Theymake use of a threshold method to extract the bases that are used for encodingand the results are achieved for video sequences of 400 frames. They do not use afull PCA calculation of the bases but instead a threshold method. The normalizedcross-correlation is computed between the first image and a new frame. When thecross-correlation is below a threshold the new image is added to the Eigenspace.The cross-correlation is then calculated between this image and the subsequentimages. When this cross-correlation drops below the threshold the new image isadded to the bases and so on. To do this in a more efficient way they compare thecross-correlation between the first image and all the other images. Very similarimages are placed in the same set as the first frame. Reasonably similar imagesare placed in a new set and images which are not similar are placed in a third set.The third set is then compared to an image in the second set and either grouped asvery similar, reasonably similar or completely different. This procedure continuesuntil all images are grouped in sets and a representative image from each set ischosen as a basis for the Eigenspace.

2.1.4 Scalable video coding

Scalable video coding (SVC) enables encoding and transmission of parts of a videosequences divided into partial bitstreams. The partial bitstreams are designedto provide video with lower temporal resolution, spatial resolution and/or visualquality when fewer bitstreams than possible is used. As more bitstreams are addedthe quality and resolutions can be improved. This enables heterogeneous decodersto receive as many bitstreams as the network allows or the device can make use of.Scalability refers to a meaningful removal of certain parts of the video. Scalable

i

i

“thesis” — 2008/9/1 — 22:31 — page 31 — #34 i

i

i

i

i

i

TECHNICAL SOLUTIONS 31

video can provide a solution to the use of heterogenous networks and devices sincethe video can be optimized for transmission with several different parameters. Thegoal for SVC is to encode video into several layers which can be decoded with thesame complexity and reconstruction quality as if the video is encoded into a singlelayer, without using any more bits. This is not really possible so a complexity,quality and bitrate similar to one-layer encoded video is a more realistic goal.Degradation of the video can be needed because of restrictions in bitrate, videoformat, power usage or large transmission losses in the network. The encodermust encode the video into layers for a decoder to be able to decode the video ina scalable manner. The decoder can only choose from the layers that the encoderhas created; it cannot select the reconstruction way all by itself.

Scalable video encoding has been standardized for quite some time; it is partof several old standards [13,14]. But the scalable features of these standards areseldom used since scalability is achieved only with large losses in complexity andquality even if the bitrate for scalable video is similar to single-layer encoded video.The current state-of-the-art standard for video coding, H.264, have scalable exten-sions meaning that video can be layer encoded with this technique. Schwartz et.al.provide an extensive introduction to scalable video within the H.264 standard [35].They constitute a list of of requirements for scalable video:

• Similar coding efficiency for layered encoding as single-layer encoding. Thisshould be valid for each subset of the scalable bit stream.

• Small increase in decoding complexity compared to decoding single-layerencoded video.

• Possibility of scaling in quality, resolution and framerate.

• Support of a backward compatible base layer; the layer-encoded video shallfunction with single-layer decoders.

• Simple bitstream adaptations of the encoded video. It should be possible todesign the bitstream based on the layers that the decoder wants to receive.

A scalable video compression scheme should meet all of these requirementsto be able to provide scalable video over different networks and to devices withdifferent needs and/or functionality.

2.2 Technical solutions

2.2.1 Hands-free video equipmentsThe fact that it’s mostly the facial mimic that changes when a person communi-cates verbally can be exploited in compression algorithms to reach very low bi-trates. By removing the global motion of the head we can concentrate on modeling

i

i

“thesis” — 2008/9/1 — 22:31 — page 32 — #35 i

i

i

i

i

i


the local motion, i.e., the facial mimic. Consequently, we have used a hands-freevideo equipment both to allow the user to use both hands and to make encodingmore efficient. We have used two different system (Figure 2.6); they will from nowon be referred to as the helmet and the backpack. The helmet emulates a hands-free video system with automatical normalization. Video sequences recorded withthe helmet consist of the user’s face or head and shoulders. To emulate such asolution for video communication we use a construction helmet and a web camera.A light metal arm is attached to the helmet and the web camera is attached tothe metal arm. The camera is positioned so that it films the face of the personwearing the helmet. The helmet ensures that the facial features are positioned atapproximately the same pixel position from the beginning.

The backpack consists of a backpack, an aluminium arm and a mounting fora camera at the tip of the arm. This equipment is built by the company EasyrigAB [36]. A major difference between the helmet and the backpack is that for thehelmet the camera follows the head motion and ensures that the face is filmed atnear frontal view. The backpack follows the motion of the back of the wearer andnormalization of the facial features is not automatical. The face can be movedindependent of the camera and the face isn’t always filmed at a near frontal view.But the backpack allows the user to have free hands and be able to move quicklywithout placing the camera on a tripod or on the ground.

2.2.1.1 Modeling efficiency with the helmet

To show how hands-free equipment improves the modeling and coding efficiencywe have performed a test with the helmet. We have recorded a video sequencewhere the user is displaying Ekman’s six basic emotions but they are recordedwithout the use of the hands-free helmet. The video sequence is recorded with afixed camera and when the user moves and rotates the position and angle of thepersons head is changed in the video. Example frames are shown in Figure 2.7.

The contents of the video sequences are not exactly the same so they can notbe compared objectively, e.g., with PSNR. We use the bound from section 3.4.1 forcomparison. This bound describes the efficiency of a facial mimic model extractedfrom a video sequences by stating the highest quality that can be achieved by usinga certain number of model dimensions. The bound is still measured objectivelybut the different video sequences can be compared. It is clear from the figure thatthe theoretical quality of sequences recorded with the hands-free video equipmentis much higher than the sequences recorded without it.

2.2.2 Basic emotionsThe creation of a model for facial mimic requires that it is possible to create aspace which is spanned by some representative facial expression. This idea havehas been addressed within the psychological community. According to Americanpsychologist Paul Ekman all facial emotions consist of blended versions of only

i

i

“thesis” — 2008/9/1 — 22:31 — page 33 — #36 i

i

i

i

i

i


(a) helmet

(b) backpack

Figure 2.6: Hands-free video equipments.

i

i

“thesis” — 2008/9/1 — 22:31 — page 34 — #37 i

i

i

i

i

i


Figure 2.7: Example frames from video recorded without hands-free equipment.

Figure 2.8: Difference between using the hands-free helmet and using a fixed camera. –Hands-free helmet - - Fixed camera

i

i

“thesis” — 2008/9/1 — 22:31 — page 35 — #38 i

i

i

i

i

i


(a) happiness (b) sadness (c) surprise

(d) fear (e) anger (f) disgust

Figure 2.9: The six basic emotions.

six basic emotions [4,5]. They are visualized in Figure 2.9. He has later added aneutral expression to his definition so that there are 7 emotions needed. The neu-tral expression is however contained in the transition between the other emotionsso there is no need to model it. By modelling the 6 basic emotions you actuallymodel all possible facial emotions.

2.2.3 Video sequences

Several experiments in this work are based on the same 10 video sequences recordedwith the helmet (Section 2.2.1). Each of these video sequences shows the face orhead-and-shoulders of a person. The person is displaying the six basic emotionswhich Ekman has proposed. Between the emotions the person in the video returnsto a neutral facial expression. Each video sequence is approximately 30 secondslong and a new emotion is displayed every fifth second. Each emotion is displayedfor 2-4 seconds, consisting of a change from a neutral expression to a basic emo-tion and returning to a neutral expression. The spatial resolution for the videosequences is 240x176 pixels and the framerate is 15 fps. The values for framerateand resolution are chosen to approximately match the settings used in 3G videotelephony from 2003. Two different backgrounds are used for the video sequences.Eight of the sequences are shot against a white background and two of the se-quences are shot against a non-homogenous background. Example frames fromsome of the video sequences are shown in Figure 2.10.

i

i

“thesis” — 2008/9/1 — 22:31 — page 36 — #39 i

i

i

i

i

i


Figure 2.10: Example frames from three video sequences.

Theses sequences are used for calculation of theoretical bounds, evaluationof PCA video coding and Eigenspace updating. They are also used for asym-metrical principal component analysis. Four of these sequences show the sameperson recorded on different time instances and they are used for examination ofEigenspace re-use.

We also make use of video sequences that show one person when he is com-municating verbally. These video sequences are 2 minutes long. The facial mimicin these sequences are less expressive than the mimic in the previously describedvideo sequences but they have the same framerate and spatial resolution (15 fps,240x176 pixels).

We furthermore use 5 video sequences recorded with the backpack which have ahigh definition (HD) resolution of 1440x1080 pixels (Figure 2.11). These video se-quences have a framerate of 25 fps and show one person when he is communicatingor displaying the basic emotions.

For side view and profile encoding we have recorded video sequences withoutany hands-free video equipment. These video sequences are shot in front of mirrorwhich is tilted 90 degrees (Figure 2.12) which allow us to record both the frontaland side view in the same video with a correspondence between the two views(Figure 2.13).

To verify the usability of the hands-free video helmet (section 2.2.1) we usetwo video sequences recorded without any hands-free equipment. These sequencesshow a person when he is displaying the basic emotions but where the head isfree to move independently of the camera (example figures can be found in section2.2.1.1). The sequences have the same framerate and resolution as the videosequences recorded with the helmet; 15 fps and 240x176 pixels.

2.2.4 Locally linear embeddingWhen sequential data is evaluated individually an unnatural scene transition mightoccur and the result is individually good but as a sequence the results is notsatisfactory. The result for video will be frames with high individual quality but asa sequence it will be jerky and unnatural. To generate a natural-looking video thetransition between the frames also has to be considered; as a cost for smoothness.

i

i

“thesis” — 2008/9/1 — 22:31 — page 37 — #40 i

i

i

i

i

i


Figure 2.11: Example frame from from a HD video sequence.

i

i

“thesis” — 2008/9/1 — 22:31 — page 38 — #41 i

i

i

i

i

i


Figure 2.12: The tilted mirror.

Figure 2.13: Frame from a video sequence shot in front of the mirror.

i

i

“thesis” — 2008/9/1 — 22:31 — page 39 — #42 i

i

i

i

i

i


A dimension reduction tool can measure the similarities between two adja-cent frames. The non-linear dimension reduction tool called locally linear embed-ding (LLE) [37] is good for this task. LLE is an unsupervised learning algorithmthat computes low-dimensional embeddings of high-dimensional data. The low-dimensional embedding has the feature of preserving the neighborhood betweenthe data and LLE can learn global structure of nonlinear manifolds. We use LLEto embed the changes in the face, i.e., the facial mimic, in a very low-dimensionalspace. This space should reflect the structure of the facial mimic.

The approach which is used is to project two frames into a LLE space and usethe distance there to measure the similarity between the frames:

S(Ii−1(x),Jk(x)) = D(L(Ii−1(x), L(Jk(x)))) (2.3)

where S() is a cost function, L() is the LLE projection operation and D()is adistance measure on the locally linear embedding space. I is an input frame andJ is an image form the gallery.

By enforcing smooth transitions we reduce the unnatural transitions betweenthe frames.

2.2.5 High definition (HD) videoHigh-definition (HD) video refers to any video system of higher resolution thanstandard-definition video that is used in regular TV broadcasts and DVD-movies.The display resolutions for HD video are called 720p (1280x720), 1080i and 1080p(both 1929x1080). A resolution of 1440x1080 is called 1080 anamorphic. Thedifference between i and p is that i uses interlaced frames with half the statedresolution while progressive (p) uses frames with full resolution. As of today theTV-transmissions that are labelled HDTV use either 720p or 1080i; in Sweden itis mostly 1080i. The video that we use as HD video has a resolution of 1440x1080.It is originally recorded as interlaced video with 50 interlace fields per second butit is transformed into progressive video with 25 frames per second.

2.2.6 The face space and personal mimic spaceA space which contains all possible human faces can be called face space. It ishigh-dimensional and faces are distributed within the space. This space is mostlyused for facial recognition where the goal is to find a difference between faces.PCA is an important technique for the creation of this space since PCA reducesthe dimensionality of a space but still retains the distance, the difference, betweenthe contents of the space. A face space extracted through PCA is compact enoughto enable real-time usage of the space and it is possible to find differences betweenthe faces within the space.

The Euclidian distance between the faces in the face space represents the dif-ference in appearance that exists between the faces, the appearance difference

i

i

“thesis” — 2008/9/1 — 22:31 — page 40 — #43 i

i

i

i

i

i


between people. This space contains no changes within the face; no facial mimic.A personal mimic space contains the same persons face but with different facialexpressions. This space models all the possible expressions a person can exhibitthrough her face. A good explanation of the face space, or personal mimic space,is given by Ohba et al. [38,39].

Just as the face space can be reduced in dimensionality through PCA, thepersonal mimic space can as well. The separation of the different faces in thepersonal mimic space corresponds to the separation of the facial expressions. Thespace is high-dimensional but this cannot be visualized so the first three dimensionsfor two mimic spaces are shown in Figure 2.14. Each point within this space is acertain expression and every possible position contained by, and close to, the pointsare other expressions. By modeling this space you actually model a person’s facialmimic.

2.2.7 Quality evaluationIn most experiments we make use of objective quality assessment through peaksignal to noise ratio (PSNR). This quality depends on the mean square error(mse) between the original and reconstructed images. mse and PSNR is calculatedaccording to:

mse =h∗v∑

j=1

(Ij − Ij)2

h ∗ v(2.4)

where x and y are the horizontal and vertical resolution of the frames, respectively.I is the original image and I is the reconstructed image.

PSNR = 20 ∗ log( 255√mse

) (2.5)

where 255 is the maximum value for the pixel intensity; both for RGB and YUVcolor spaces. A higher PSNR value means that there is a low difference in pixelintensity between the two images (original and reconstructed). It doesn’t haveto mean that the reconstruction is visually better, but a low objective differenceusually means a visually better result.

In the thesis there are tow different PSNR calculations:

• PSNR calculated for the R, G and B color channels. For these calculationsthe PSNR is presented as a single value.

• PSNR calculated for the Y, U and V channel of the YUV color space. Forthese calculations the PSNR is presented as three separate values; one forY, U and V each.

i

i

“thesis” — 2008/9/1 — 22:31 — page 41 — #44 i

i

i

i

i

i


Figure 2.14: The first three dimension of the personal mimic space for two facial mim-ics.

i

i

“thesis” — 2008/9/1 — 22:31 — page 42 — #45 i

i

i

i

i

i

i

i

“thesis” — 2008/9/1 — 22:31 — page 43 — #46 i

i

i

i

i

i

Principal component analysis video coding

3.1 Principal component analysis

Principal component analysis (PCA) [6] is a transform for vector spaces that canbe used to reduce the dimensionality of a data set. It is sometimes called themost valuable result from applied linear algebra. PCA is a linear transform thatcreates a new coordinate system for data. The new coordinates are selected basedon the variance of the data. The first coordinate spans the highest variance of thedata and the second spans the highest variance which is orthogonal to the firstcoordinate. The third coordinate spans the highest variance which is orthogonalto all previous coordinates and so on. The technique works by calculating theEigenvalues of a matrix and deciding the importance of each part of the matrix,i.e., each row or column of the matrix. The coordinates can be called bases andthe goal of PCA is to find the most meaningful bases for the given data set. Thenew bases are a linear combination of the old bases so PCA is a linear transform.

If the original data is multi-dimensional it is first transformed into one-dimensionaldata, i.e., into vectors. The length of each vector is equal to the product of thelength of all dimensions. In this work we deal with images so the original data istwo-dimensional and the length of each vector is h*v ; the number of elements inthe images:

I =

C1,1 C2,1 ... CN,1

. . . .C1,h∗v C2,h∗v ... CN,h∗v

(3.1)

where I are the original data as vectors and C are the original data as images. Nis the number of elements in the original data.

The mean of the data is removed so that the remaining data is centered onzero. The mean I0 is calculated as:

43

i

i

“thesis” — 2008/9/1 — 22:31 — page 44 — #47 i

i

i

i

i

i

44 PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

I0 = 1N

N∑

j=1

Ij (3.2)

The mean is then subtracted from each basis in the original data, ensuring thatthe data is zero-centered.

Ij = (Ij − I0) (3.3)

In a mathematical expression PCA is a transform of the data and the newdata is a linear combination of the original data. The combination is decided bya matrix F .

I = F I (3.4)

PCA is theoretically the optimal transformation of data in a least square errorsense, meaning that there is no there linear representation that outperforms arepresentation yielded through PCA.

We want to find an orthonormal matrix F where I = F I and each row of I isdiagonal. The columns of I are then the principal components of the original dataI. Something orthonormal is a set of bases which are orthogonal to each otherand have the norm of 1. This orthonormal matrix can be found by calculatingthe covariance matrix of the original data. The covariance matrix uses the meanof the mean centered data µI . One way to find the covariance matrix is to usesingular value decomposition (SVD) (section 3.1.1).

The covariance matrix {IIT } for I will have a very large size since it is asquare matrix with the the size of x*y. To reduce the computational complexityfor calculating SVD or covariance matrix it is possible to create {IT

I} instead.This matrix will be a square matrix with the size of N . The Eigenvectors for thismatrix can be multiplied with the original data I to create Eigenvectors φj for I.

φj =∑

i

bij(Ii − I0) (3.5)

where I are the original data and I0 is the mean of all the original data. bij

are the Eigenvectors from the the covariance matrix {(Ii − I0)T (Ij − I0)}. TheEigenvectors with the highest Eigenvalues corresponds to the principal componentsthat have the highest variance. When PCA has been performed for a data set theEigenvectors can be arranged based on their importance. The Eigenvectors withlow Eigenvalue can be discarded without any significant loss in data accuracy. AnEigenspace Φ can consist of N Eigenvectors φj :

Φ =

φ1,1 φ2,1 ... φN,1

. . . .φ1,h∗v φ2,h∗v ... φN,h∗v

(3.6)

i

i

“thesis” — 2008/9/1 — 22:31 — page 45 — #48 i

i

i

i

i

i

PRINCIPAL COMPONENT ANALYSIS 45

Only the chosen number M of principal components in Φ, j=(1,2,...M) withthe largest Eigenvalues Λ have to be stored as the Eigenspace.

Φ =

φ1,1 φ2,1 ... φM,1

. . . .φ1,h∗v φ2,h∗v ... φM,h∗v

(3.7)

Coefficients {αj} is extracted for any vector of the same size of the data byremoving the mean I0 from the data and multiplying it with the Eigenvectors:

αj = (I− I0)T φj (3.8)

A coefficient αj is available for every vector φj in the Eigenspace Φ. Thecoefficients can be used with the Eigenvectors φj and the mean I0 to approximatethe original data. If all coefficients are used the representation is error-free:

I = I0 +N∑

j=1

αjφj (3.9)

If only a chosen number of coefficients M are used to reconstruct the data thiswill incur an error in the representation.

I = I0 +M∑

j=1

αjφj (3.10)

Because PCA is very efficient at compacting the important data the numberM can be kept very small at the same time as the representation error (I − I) islow.

An important assumption which is used for PCA in this work is that the datais Gaussian distributed (normal distributed). Gaussian distribution ensures thatthe principal components which are extracted are not only orthogonal but alsoindependent.

3.1.1 Singular Value DecompositionSingular value decomposition (SVD) is a factorization method from linear algebra.It divides a square matrix into three matrices:

I = UΣV (3.11)

where U and V are unitary matrices. Together the matrices contain the Eigenvec-tors and Eigenvalues of the original matrix I.

SVD is used in PCA to extract the Eigenvectors (principal components) andEigenvalues of matrix I. Only the Eigenvectors are used for further processing inPCA; the Eigenvalues are used for calculation of model efficiency.

i

i

“thesis” — 2008/9/1 — 22:31 — page 46 — #49 i

i

i

i

i

i


3.2 Principal component analysis video codingVideo coding through principal component analysis actually consists of two parts:

1. Creation of model for facial mimic.

2. Encoding and decoding (reconstruction) of video frames.

PCA is only used for the first part; model extraction. Both encoding anddecoding of video is done through vector multiplication.

Assume that there is an original video sequence C with a given number offrames N . The video consists of a person showing the basic emotions and it isnormalized regarding the pixel positions of the facial features. The most prominentinformation in this video is the changes in the face, i.e., the facial mimic. A modelof this video which is extracted through PCA is a personal mimic space for thisperson. According to section 3.1 a model, i.e., an Eigenspace, is extracted formthe original video frames. Before the original data is transformed to vectors it isYUV subsampled (section 2.1.1.3) so the length of each vector is halved comparedto using all available RGB pixels.

I =

C1,1 C2,1 ... CN,1

. . . .C1,(x/2)∗y C2,(x/2)∗y ... CN,(x/2)∗y

(3.12)

where I are the original data as vectors and C are the original data as images.N is the number of elements in the original data, x is the horizontal resolution inRGB pixels and y is the vertical resolution.

An Eigenspace Φ = [φ1φ2...φN ] is extracted from I with the mean I0 subtractedfrom I.

φj =∑

i

bij(Ii − I0) (3.13)

where bij are the Eigenvectors from the the covariance matrix {(Ii−I0)T (Ij−I0)}.Encoding is the same as extracting projection coefficients {αj} for each video framethat should be encoded.

αj = φj(I− I0)T (3.14)

where I is the video frame that is encoded.Decoding is performed by combining the mean I0, the coefficients {αj} and

the Eigenvectors φj .

I = I0 +M∑

j=1

αjφj (3.15)

i

i

“thesis” — 2008/9/1 — 22:31 — page 47 — #50 i

i

i

i

i

i

ENCODING AND DECODING TIME 47

where M is a selected number of principal components used for reconstruction(M < N). The extent of the error incurred by using fewer components thanpossible (M < N) is examined in section 3.4. With the model it is possible toencode entire video frames to only a few coefficients {αj} and reconstruct theframes with high quality. Only these coefficients {αj} needs to be transmittedbetween encoder and decoder when they both have access to the model, i.e., accessto the Eigenspace Φ. This allows video coding to be performed at extremely lowbitrates. Exactly how low is also discussed in section 3.4.

PCA allows the decoder to scale the resulting video based on reconstructionquality. A high number M of principal components will increase the quality ofthe reconstructed video compared to a low number of principal components. Theencoder doesn’t have to do anything different for decoding with different qualityof the video. Heterogeneous decoding of the same encoding is achieved with PCA;usually the encoder has to encode the video into layers and the decoder scales thequality based on the number of layers which are used for decoding.

A detailed explanation of PCA video coding is found in Paper I.

3.2.1 Comparison with h264

The state-of-the-art in video coding today is a standard called H.264/AVC [15,16].Any new video coding technique should be compared with the best available today.To show the benefits of our PCA video codec we compress the same video sequenceswith H.264 and compare the results. We have used ffdshow v. 20051129 to performH.264-encoding. The I-frame interval is set to 500 so the video sequences, which are450 frames long, are encoded with an initial I-frame and then P-frames (IPPPP...).We use one-pass encoding based on quantization steps and we use the quantizationsteps to encode the video at different bitrates. It is actually not possible to reachbitrates lower than 2.4 kbps with H.264. Even when the highest quantization stepsand only one I-frame are used the bitrate never goes below 2.4 kbps. The resultmeasured in PSNR for two video sequences are shown in Figure 3.1.

Some example frames for H.264 encoding at 2.4 kbps are shown in Figure 3.2.It can clearly be seen that the visual quality at this rate is far from satisfying.It is possible to discriminate that this is a face but nothing more. When we usecoding based on PCA the result is vastly better, even at lower bitrates.

3.3 Encoding and decoding time

The encoding and decoding of this video is very simple compared to other videocoding schemes. But to ensure that the coding scheme can work in real-time onmobile devices we measured the encoding and decoding time on a computer withlimited computational and memory capacity (Pentium III Mhz workstation with256 Mb RAM).

i

i

“thesis” — 2008/9/1 — 22:31 — page 48 — #51 i

i

i

i

i

i


Figure 3.1: Comparison between a codec based on PCA and a H.264 codec at very lowbitrates. (– PCA coding - - H.264)

Figure 3.2: Example frames from H.264 encoding at 2.4 kbps.

i

i

“thesis” — 2008/9/1 — 22:31 — page 49 — #52 i

i

i

i

i

i

THEORETICAL BOUNDS 49

The average time for encoding one video frame was 10,9 ms for an Eigenspaceof 10 Eigenimages and 26,6 ms for 25 Eigenimages. On average it took 13,8 msto decode one frame for 10 Eigenimages and 34,9 ms for 25 Eigenimages. Theencoding and decoding time is equal to a framerate of 40 fps for 10 Eigenimagesand 16 fps for 25 Eigenimages. The video sequences have a framerate of 15 fps soit is possible to encode them with up to 25 Eigenimages in real-time. The capacityfor mobile devices is today higher than the capacity of our test setup.

3.4 Theoretical boundsThe performance of PCA video coding is dependent on the facial mimic model ex-tracted through PCA. The efficiency of this model will directly decide how manydimensions that are needed to reach a certain representation quality. By mod-elling the 6 basic emotions it is possible to model all possible facial emotions. Itis however not so straightforward to say that you can use 6 images of the basicexpressions to model all images of the facial expressions. There is a differencebetween the psychological representation of expressions and the representation indigital images. The representation in digital images, and video, is usually mea-sured objectively and not subjectively. So, the result will be a certain quality fora given number of model dimensions; not that it models all possible emotions.We have examined how compact representation that is needed to reach a recon-struction quality on images approximated by the model. Kirby and Sirovich havepreviously stated that it is enough with 100 male Caucasian faces to model allpossible male Caucasian faces [40,41]. This means that it is enough with a spaceof 100 dimensions to model millions of faces. All these face do however have thesame facial expression; usually a neutral expression where teeth aren’t showing.This doesn’t model any facial expression and the modelling of facial mimic isn’tconsidered.

There are implementations that take the personal mimic space in consideration.These implementations use a personal mimic space for encoding of facial mimic,either still images or video sequences. The results show that models of the facialmimic can represent facial images at very low bitrates with high quality but thereis no bound that describes how well the modelling can be. Torres et.al. havean average reconstruction quality of 29.6 dB when they use 16 basis images [26].Crowley et.al. show that they can produce a reconstruction quality of almost 35dB (PSNR) when they use 15 basis images for reconstruction [31]. We have shownthat when 10 basis images are used an average reconstruction quality above 31 dBcan be achieved [42]. More information about these implementations is found insection 2.1.3 and section 3.2. We will describe two different boundaries for facialmimic modelling. The bounds are affected by several factors so we have limitedthe boundaries to the following circumstances:

• A spatial resolution of 240x176 pixels.

i

i

“thesis” — 2008/9/1 — 22:31 — page 50 — #53 i

i

i

i

i

i


• A color depth of 8 bits per pixel (0-255).

• RGB color space in the original video (24 bits).

• Theoretical bounds are calculated individually for each person.

• The objectively quality is measured for the R, G and B-channel together.

The spatial resolution is chosen to match the resolution of mobile phones. Thestandard with the highest quality that was used when these boundaries wherecalculated was called QVGA and has a total pixel number of approximately 77000[43]. A resolution of 240x176 pixels is equal to 42420 pixels. The resolution canbe different and the rate is still unchanged; the quality might be affected.

We have calculated two bounds for the modelling efficiency; a distortion boundsand a rate-distortion bound. The distortion bound shows the minimum distortionthat can be achieved for a specific source when there are no restrictions on therate. Rate-distortion bound describes the minimum rate that can be used at agiven distortion. The source cannot be modelled correctly by a rate lower thanthe rate described in the function. The bounds are measured objectively since itis impossible to calculate bounds for something that is subjective; the achievedsubjective quality is dependent on the subjects.

3.4.1 Distortion boundThe representation quality of facial mimic is affected by the number of dimensionsthat are used for modelling of the mimic. The mean value for each pixel position I0

also affects the modeling in a large extent. In section 3.1 we describe how data ismodelled through PCA and this representation is error-free when all N principalcomponents are used. An error in representation occurs when less features, orcomponents (M < N) are used. The error is calculated as:

mse(opt) =N∑

j=M+1

λj (3.16)

where λij are the Eigenvalues for the principal components. The modeling effi-ciency is calculated as the sum of the error which is incurred by not using a numberof Eigenvectors.

How the mean square error can be calculated for a high-dimensional sourcewhen the mean has been subtracted is explained by Fukunaga [44]. This willonly explain how much information that is removed from mean-centered data. Tocalculate the quality of the total representation you also need to include the mean.

A mean square error bound can be calculated for the number of Eigenvectorsφj that are used for image representation (equation 3.16) by varying the number ofdimensions M of the model which are used for representation of the facial mimic.This is the distortion bound for representing the signal with the selected number of

i

i

“thesis” — 2008/9/1 — 22:31 — page 51 — #54 i

i

i

i

i

i

THEORETICAL BOUNDS 51

Number of Eigenvectors φj

5 10 15 20 25PSNR [dB] 34.56 36.82 38.13 39.07 39.75

Table 3.1: Average PSNR values for 10 facial mimic video sequences.

Eigenvectors. If the mean square error (mse) is reduced the error for representationis reduced. Peak signal-to-noise ratio (PSNR) is calculated from the mse so thata higher measurement value means better representation.

Even though the distortion bound is calculated individually for each personthe average result of facial mimic from 10 video sequences (6 different persons) areshown in Table 3.1. This table shows the average of maximum quality that canbe reached for facial mimic representation.

The bound will start at 0 Eigenvectors and continue above 25 as well. We havechosen to calculate the boundary from 5 to 25 Eigenvectors.

3.4.2 Rate-Distortion bound

With the previous distortion bound we only examined the maximum quality thatcan be reached for modelling of facial mimic with a certain number of dimen-sion. For the rate-distortion bound we also calculate the minimum bitrate whichis needed to reach a certain representation quality. In many low bandwidth ap-plications (almost all) the limiting factor is bandwidth so the bitrate becomesan important factor; often more important than the quality. The quality will bemaximized based on the available bitrate, or the available bandwidth.

If quantization is used one important question is how to assign the bits tomodel the source in the most efficient way, assuming that the bitrate is fixed. Theactual assignment of bits is not discussed here; the bound describes how many bitsyou can use with an optimal assignment. The theoretical bound can be calculatedwith a rate-distortion function. Rate-distortion functions refer to the minimumrate for representing a source at a given distortion, or the lowest possible distortionat a given rate. High-dimensional sources can be assumed to consist of severalvariables. When PCA is used to create the model these variable are Gaussiandistributed and independent of each other. The rate-distortion function for such asource is calculated through reverse ”water-filling” where the rate and distortionis controlled by a variable γ [45]. The variable γ controls how many dimensions,i.e., principal components, that are used for source representation; the numberof dimensions relates to both the rate and the distortion. An example of reverse”water-filling” when six principal components are used is shown in Figure 3.3.Only the principal components who have a variance higher than γ is representedwith bits. All others are ignored.

The rate distortion function controlled by γ is given by

i

i

“thesis” — 2008/9/1 — 22:31 — page 52 — #55 i

i

i

i

i

i


X1

σ2

1

D1

X1 X2

σ2

2

D2

X3 X4 X5 X6

σ2

3

D3

σ2

4

D4

σ2

5

D5

σ2

6

D6

γ

D7 D8 D9 D10

X7 X8 X9 X10

Figure 3.3: Reverse water-filling for independent Gaussian principal components. Onlythe components who have a variance larger then γ is allocated bits in the quantizationprocess.

Figure 3.4: Mean rate-distortion bound for all 10 video sequences of facial mimic.

R(D) =M∑

j=1

12

logσ2

j

Dj(3.17)

where

Dj = { γ, if γ < σ2j ,

σ2j , if γ ≥ σ2

j ,(3.18)

where γ is chosen so thatM∑

j=1

Dj = D

The average results for 10 video sequences are shown in Figure 3.4.

i

i

“thesis” — 2008/9/1 — 22:31 — page 53 — #56 i

i

i

i

i

i

EIGENIMAGE COMPRESSION 53

3.4.3 Comparison of the distortion boundsThe two different bounds are calculated for the same video sequences so they aredirectly comparable. The distortion bound describes the representation qualitywhich is achievable with an unlimited bitrate. The rate-distortion bound describesthe quality that is achievable with a given number of bits. As the bitrate isincreased the two bounds will be closer and if the rate is high enough in the rate-distortion bound they will exactly equal. The difference between the two boundscan be measured as:

M ∗ γ (3.19)

where M is the number of Eigenvectors φj used for reconstruction and γ is thelevel from equation 3.18.

A more detailed description of the boundaries can be found in Paper II.

3.5 Eigenimage compressionVideo coding based on PCA requires that principal images of the facial mimic areavailable at both the encoder and the decoder side. These images take up a lotof space, originally an Eigenspace of 10 Eigenimages need 5 MB for storage. Thevideo sequences used in this section have a spatial resolution of 240x176 pixels.After chroma subsampling the storage need is reduced to 2.5 MB. This need is notso important since mobile devices; any device in fact, has a storage capacity muchhigher than this. But the transmission cost for sending 2.5 MB over any networkcan be costly and for low capacity networks it is extremely costly. It would forexample take more than 35 minutes to transmit 2.5 MB over a GSM network, andapproximately 3 minutes over a GPRS network.

To enable use of the Eigenimages for transmission over a low capacity networkthe Eigenspace needs to be compressed. The Eigenspace consists of images so astraightforward way to compress them is to use image compression; we will usethe JPEG image compression standard [46,47]. The mean image I0 is also neededfor both encoding and decoding and it should also be compressed with JPEGcompression.

Compression of the Eigenimages φj and the mean image I0 will affect thereconstruction quality of video which is encoded and decoded with them.

The compression will affect the Eigenimages φj and the mean image I0 andthis will affect the reconstruction quality of video which is encoded and decodedwith them.

The images are first quantized and the reconstruction levels are stored. Thequantized images are then JPEG encoded and the encoded images are used withthe reconstruction levels from the quantization to reconstruct the Eigenimages.

1. Quantization of the Eigenspace. The quantization values are stored in animage. The reconstruction values are stored without loss.

i

i

“thesis” — 2008/9/1 — 22:31 — page 54 — #57 i

i

i

i

i

i


2. JPEG-encoding of the quantization values.

3. Inverse quantization mapping of the JPEG-encoded values with the quanti-zation reconstruction values.

3.5.1 Quantization - uniform or pdf-optimized?The quantization for the Eigenspace Φ and mean image I0 can be performed inmany different ways. We have found that when ≈ 7 bits are used for quantizationthere is no difference between using uniform or pdf-optimized quantization. 8bit quantization is used since it is a suitable number for computers (1 Byte).Therefore we use uniform quantization in our work since it also reduces the dataamount needed for inverse quantization.

3.5.2 Compression of the mean imageThe mean image is compressed in the same way as the Eigenimages. Since acompressed mean is subtracted before the model is extracted with PCA and thesame compressed mean can be added in the decoding process it can be compressedmore heavily than the Eigenimages. The mean is compressed with a JPEG qualityfactor of 25, resulting in a size of 6 kB for the mean image.

3.5.3 Loss of orthogonalityThe principal components in the original, uncompressed Eigenspace are orthogo-nal, meaning that they follow

φTi φj = {1 for i = j

0 for i 6= j(3.20)

The orthogonality between the principal components ensures that the infor-mation in one principal component is independent from information in anotherprincipal component. This increases the compactness and efficiency of the modelsince no information is described in more than one component. Compression ofthe Eigenimages will result in a loss of orthogonality and model degradation. Theloss of orthogonality is measured as the average of the sum of the inner productbetween all principal components,

∑φT

i φj . When there is perfect orthogonalitythis sum is zero.

Even though the orthogonality between the principal components is lost incompression it can be regained. We have examined two different methods toensure orthogonality between the Eigenimages:

• Least-square calculation of projection coefficients

• Re-orthogonalization of the Eigenspace through modified Gram-Schmidt pro-jection

i

i

“thesis” — 2008/9/1 — 22:31 — page 55 — #58 i

i

i

i

i

i

EIGENSPACE RE-USE 55

Method Storage need [kB]Original 256

Quantized [8 bits] 64JPEG-compressed 1-37

Table 3.2: Storage need for one Eigenimage.

Both of these methods ensures orthogonality between the principal componentbut in different ways.

3.5.4 Compression methodsTo compare between different methods we have evaluated their performance. Asreference we have used the original Eigenimages and mean images to see how muchthe different methods degrade the reconstruction performance. Table 3.2 depictsthe storage need each Eigenimage has when it is compressed. The methods alsorequire a different number of operations for encoding and decoding; they yielddifferent complexities.

Orthogonality loss is only an important factor when the compression ratio ishigh. Figure 3.5 show the orthogonality loss for the different methods together witha chosen compression level. The size of the compressed Eigenimages is depictedalong the horizontal axis and the left side of the y-axis shows the reconstructionquality in PSNR. The right side of the y-axis depicts the loss of orthogonalityand the vertical line in the figure symbolizes the compression ratio that we havechosen to use. It can clearly be seen in this figure that the orthogonality loss onlyhave an effect for very high compression. There is no need to use any kind ofre-orthogonalization. The best method to use is compressed Eigenspace and meanimage at both encoder and decoder without any re-orthogonalization technique.

If 10 Eigenimages are used and they are compressed to 8 kB each and the meanimage is compressed to 6 kB they can be transmitted over a GSM network in ≈ 1minute and in 6 seconds over a GPRS network. The average reconstruction qualityfor video encoded with the compressed Eigenspace is still above 34 dB, reducedwith ≈ 2 dB.

3.6 Eigenspace re-useExisting Eigenspaces can be used for a new communication event instead of creat-ing a new Eigenspace and transmitting it. Since both encoder and decoder alreadyhave access to this Eigenspace it can be used directly.

There is an existing Eigenspace Φ which is extracted from video sequence Iwith the mean I0. There is another video sequence G and this video is encodedwith Eigenspace Φ and mean I0. Both I and G show the same person and are

i

i

“thesis” — 2008/9/1 — 22:31 — page 56 — #59 i

i

i

i

i

i


Figure 3.5: Comparison of the different coding schemes. – Compressed Eigenspace atboth encoder and decoder - - Compressed Eigenspace at decoder & original mean image-· Compressed Eigenspace at decoder & compressed mean image ·· LS & GS result – Lossof orthogonality

recorded with the helmet. But the facial image is different in viewing angle, facialsize and illumination.

Encoding of G with Φ and I0 is performed in the same as if video sequenceI is encoded. The projection coefficients that are extracted are denoted αG

j sincethey come from projection of video sequence G.

αGj = (G− I0)T φj (3.21)

Decoding is performed by combining the coefficients {αGj }, the Eigenvectors

φj and the mean image I0. The projection coefficients {αGj } are multiplied with

the Eigenvectors φj and the mean image I0 is added to reconstruct the image G.

G = I0 +M∑

j=1

αGj φj (3.22)

The resulting video G will resemble I instead of G since it is the mean imageand Eigenimages extracted from I that are used for reconstruction.

The result is not good since the facial features are not positioned at the samepixel position for I and G. When the mean image I0 is subtracted from G theresult is an image that actually isn’t an image of a face.

3.6.1 Sensitivity to lightning and positional shiftThe result is not satisfactory and it may depend on two factors:

1. Different pixel positions for the facial features in I and G.

i

i

“thesis” — 2008/9/1 — 22:31 — page 57 — #60 i

i

i

i

i

i

EIGENSPACE RE-USE 57

2. Different pixel intensities for the facial features in I and G.

The effect that different lightning has is negligible but it can still be adjustedfor. The mean intensity of the mean image I0 is calculated. The same value iscalculated for each frame in G and the difference between them can be subtractedfrom the new frames in G.

The difference in pixel positions for the facial features between I and G needsto be addressed. We have used affine transformation to normalize between them;i.e., align one face to the other.

Affine, or any, transformation of the Eigenimages φj will reduce the orthogonal-ity of the space and consequently the encoding efficiency. So, it is more convenientto transform the frames in G instead. Affine transformation is performed throughmultiplication with a rotation matrix A and addition of a translation vector b.

[xy

]= A

[xy

]+ b (3.23)

where matrix A refers to a rotation and/or scaling and vector b refers to a trans-lation. x, y, x and y are the horizontal and vertical positions for the original andtransformed image respectively.

Semantically important feature points from the eyes and the mouth are col-lected along with points from the nostrils for both I and G. The points are selectedfrom the mean of the respective video sequence. The rotation matrix A and trans-lation vector b are calculated with these points. The frames from video sequenceG is then transformed to G.

G = AG + b (3.24)

This video sequence is encoded with Eigenspace Φ and mean image I0, yieldingprojection coefficients αG

j .

αGj = (G− I0)T φj (3.25)

These coefficients are then used with Φ to reconstruct the frames of G.

ˆG = I0 +M∑

j=1

αGj φj (3.26)

The result is much better than the result without affine transformation but itis still not acceptable. A detailed description about this experiment is found inPaper III.

Affine transformation normalization is not enough to solve the problem withdifferent positions for the facial features. One issue might be that a global affinetransformation is performed so that all features are transformed according to a sin-gle transform matrix. There are still both visually poor frames and emotions whichare mapped incorrectly between I and G. A better normalization (alignment) isneeded.

i

i

“thesis” — 2008/9/1 — 22:31 — page 58 — #61 i

i

i

i

i

i

i

i

“thesis” — 2008/9/1 — 22:31 — page 59 — #62 i

i

i

i

i

i

Ultra low bitrate video coding

With ultra low bitrate we talk about a bitrate which is close to 100 bits/s (0.1kbps). The basic idea here is not principal component analysis but instead facialrecognition. The encoder and decoder has a personal mimic gallery for a person,which contains several images of the person with different facial mimic; it shouldcover all the possible facial expressions a person can exhibit through her face.Encoding is done by matching an input frame against the personal mimic spacebased on facial recognition. The only information that needs to be transferredbetween the encoder and decoder is the index number of the selected frame fromthe gallery. The decoder simply has to retrieve the correct number from the galleryand display it. The idea for the coding scheme is visualized in Figure 4.1.

The personal mimic gallery may contain several images but all the images arehighly correlated since it is the same person in all images. Highly correlated imagesrequire very few bits for indexing; our experiments show that on average we need4 bits for indexing of a specific mimic. With a framerate of 25 fps this will onlyrequire 100 bits every second for transmission. The personal mimic gallery canbe downloaded, or pre-installed, for a specific user and high quality can be usedfor the images in the gallery. The displayed images are taken from the gallery soimage quality in the gallery directly decides the quality of the displayed frames.

This coding gives high quality video at extremely low bitrates but each frameis treated individually; not as a part of a sequence which each frame actually is.

The dynamic behavior of a reconstructed video sequence isn’t considered. To

Figure 4.1: Ultra low bitrate video coding scheme.

59

i

i

“thesis” — 2008/9/1 — 22:31 — page 60 — #63 i

i

i

i

i

i

60 ULTRA LOW BITRATE VIDEO CODING

−0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

Figure 4.2: Personal mimic face images mapped into the embedding space described bythe first two coordinates of LLE.

study this we create two-dimensional locally linear embedding (LLE) [37] spacesfor the original images and the images in the reconstructed video. Figure 4.2 showa two-dimensional LLE space for the original images. The point which links thetwo parts of the space corresponds to a neutral face.

When the frames from the reconstructed video sequence are plotted in the LLEspace (Figure 4.3) it is clear that a lot of jumps occur between the two emotionbranches. Each jump corresponds to a jerky movement in a sequence.

4.1 LLE smoothing

After smoothing in the LLE space the number of jumps is reduced significantly.Dynamic programming (DP) is used to optimize the index sequence of the

reconstructed frames. Instead of using the similarity cost with smoothing for eachindividual image the globally optimal matching is selected. The smoothness costis calculated from the LLE space so the optimization is done in the LLE space.

With this smoothing the resulting video will get slightly improper expressions

i

i

“thesis” — 2008/9/1 — 22:31 — page 61 — #64 i

i

i

i

i

i

LLE SMOOTHING 61

−0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Figure 4.3: The reconstructed frames shown in a two-dimensional LLE space.

−0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Figure 4.4: The reconstructed frames shown in a two-dimensional LLE space withsmoothing.

i

i

“thesis” — 2008/9/1 — 22:31 — page 62 — #65 i

i

i

i

i

i

62 ULTRA LOW BITRATE VIDEO CODING

Table 4.1: Subjective evaluation of reconstructed sequences

Good Acceptable UnacceptableHMM 73.7% 19.6% 6.7%HMM+LLE 80.3% 16.2% 3.5%

for each individual frame but the sequence of images will appear much more nat-ural (Figure 4.4). Since each image have a high individual quality the resultingvideo will still have the same high quality. Since PSNR is measured as the dif-ference between the original and reconstructed image an incorrect expression willhave a lower PSNR even if the visual quality is the same. As evaluation of theimprovement LLE smoothing can give we performed a subjective evaluation. Asubject is shown two frames, the original and reconstructed image, and is askedabout the match between these frames regarding the facial expression. The sub-ject can grade the match as good, acceptable and unacceptable. The result forthis evaluation is shown in Table 4.1.

A more detailed description of this coding is found in Paper IV.

i

i

“thesis” — 2008/9/1 — 22:31 — page 63 — #66 i

i

i

i

i

i

Asymmetrical Principal Component Analysis video coding

5.1 Asymmetrical principal component analysis video codingVideo coding based on PCA offers high quality video at very low bitrates. Thevideo coding is very good but it can be improved further. There are essentially twomajor issues with PCA video coding. The first regards the model efficiency. PCAextracts the most important information based on the variance of the informationthroughout a data set. If the facial mimic changes the underlying pixels will havea large variance and are given high importance. But the background or edgesbetween the face and the background may also shift rapidly and exhibit largevariance. These are pixels without semantically important information but withhigh variance. These pixels will degrade the efficiency of a PCA model sincechanging background may occur without any change in facial expression. Themodel should only be dependent on semantically important information. Thesecond issue regards the complexity for encoding and decoding of the video. Thecomplexity is much lower than conventional video coding based on DCT (section2.1.1) but it can still be reduced. All operations regarding encoding and decodingare linearly dependent on the number of elements (pixels) in the frames and theEigenspace Φ. A video with high spatial resolution will require more computationsthan a video with low resolution. But when the frame is decoded it is a benefit ofhaving large spatial resolution (frame size) since this provides better visual quality.A small frame should be used for encoding to reduce the complexity and a largeframe should be used for decoding to optimize the quality. This is possible toachieve through the use of pseudo principal components; information where notthe entire frame is a principal component. Parts of the video frames are consideredto be important; they are regarded as foreground If .

If = crop(I) (5.1)

The Eigenspace for the foreground Φf={φf1 φf

2 ... φfN} is constructed according

to the following formula:

63

i

i

“thesis” — 2008/9/1 — 22:31 — page 64 — #67 i

i

i

i

i

i

64 ASYMMETRICAL PRINCIPAL COMPONENT ANALYSIS VIDEO CODING

φfj =

∑

i

bfij(I

fi − If

0 ) (5.2)

where bfij are the Eigenvalues from the foreground of the video frame If and If

0 isthe mean of the foreground. Encoding and decoding is performed as:

αfj = φf

j (If − If0 )T (5.3)

If

= If0 +

M∑

j=1

αfj φf

j (5.4)

where {αfj } are coefficients extracted using information from the foreground If .

The reconstructed frame If

has smaller size and contains less information than afull size frame. A space which is spanned by components where only the foregroundis orthogonal can be created. The components spanning this space are calledpseudo principal components and this space has the same size as a full frame:

φpj =

∑

i

bfij(Ii − I0) (5.5)

From the coefficients {αfj } it is possible to reconstruct the entire frame:

I = I0 +M∑

j=1

αfj φp

j (5.6)

where M is the selected number of pseudo components used for reconstruction. Afull frame video can be reconstructed (Eq. 5.6) using the projection coefficientsfrom only the foreground of the video (Eq. 5.3) so the foreground is used forencoding and the entire frame is decoded.

It is easy to prove that

If

= crop(I) (5.7)

since φfj = crop(φp

j ) and If0 = crop(I0).

aPCA adds something very interesting to video coding based on PCA; namelyspatial scalability. We have already explained that PCA offers scalability in qualityfor the decoder without the encoder having to do anything extra. With the useof pseudo principal components a decoder can decide how much of the frame itwants to decode without decision from the encoder. The encoder encodes thevideo with If and the encoder can produce several different versions of Φp; givingthe decoder freedom of spatial scalability. The spatial scalability doesn’t simplyfunction as a downsampling in size; parts which are decoded can be decoded withfull resolution. Reduction in spatial resolution is not a size reduction of the entire

i

i

“thesis” — 2008/9/1 — 22:31 — page 65 — #68 i

i

i

i

i

i

ASYMMETRICAL PRINCIPAL COMPONENT ANALYSIS VIDEO CODING 65

frame since parts of the frame can be decoded. No quality is lost in the decodedparts; it is up to the decoder to choose how much and which parts of the frame itwants to decode.

5.1.1 Reduction of complexity for the encoder

The complexity for encoding is directly dependent on the spatial resolution of theframe that should be encoded. The important factor for complexity is K ∗ M ,where K is the number of pixels and M is the chosen number of Eigenvectors.When aPCA is used the number of pixels k in the selected area gives a factor ofn = K

k in complexity reduction.

5.1.2 Reduction of complexity for the decoder

The complexity for decoding can be reduced when a part of the frame is usedfor both encoding and decoding. In the formulas above we only use the pseudoprincipal components for the full frame φp

j for decoding but if both Φp and Φf areused for decoding the complexity can be reduced. Only a few principal componentsof Φp is used to reconstruct the entire frame. More principal components from Φf

is used to add details to the foreground.

I = I0 +P∑

j=1

αfj φp

j +M∑

j=P+1

αfj φf

j (5.8)

The result is reconstructed frames with slightly lower quality for the back-ground but with the same quality for the foreground If as if only Φp

j was usedfor reconstruction. The quality of the background is decided by parameter P :a high P -value will increase the information used for background reconstructionand increase the decoder complexity. A low P -value has the opposite effect. Thereduction in complexity (compression ratio CR) is calculated as:

CR =K(M + 1)

(1 + P )K + (M − P )k(5.9)

When k << K the compression ratio can be approximated to CR ≈ M+1P+1 .

Figure 5.1 show example frames for different P -values when M is 15.The reduction in reconstruction quality for the entire frame is shown in Table

5.1. This quality is compared to the reconstruction quality for when the entireframe I is used for both encoding and decoding (Table 5.2). The PSNR for theforeground is the same regardless of the value of P since M always is 15 for theresults in the table. The complexity reduction is also noted in Table 5.1. Thecomplexity is calculated for a mouth area of 80x64 pixels and a total frame size of240x176 pixels with M=15.

i

i

“thesis” — 2008/9/1 — 22:31 — page 66 — #69 i

i

i

i

i

i


Figure 5.1: Top left: P=0, M=15 Top right: P=10, M=15 Bottom: P=15, M=15

Lowered rec. qual. (PSNR) [dB]P Y U V CR factor1 1,5 0,7 0,5 5,65 1,4 0,5 0,4 2,410 1,3 0,3 0,2 1,415 1,2 0,3 0,2 1

Table 5.1: Average lowered reconstruction quality for 10 video sequences when the de-

coding complexity is reduced by combining Φf and Φp. The complexity reduction (CR)factor is also shown and it is calculated with M=15.

i

i

“thesis” — 2008/9/1 — 22:31 — page 67 — #70 i

i

i

i

i

i


5.1.3 Variance of areas and reconstruction change for asymmetrical princi-pal component analysis video coding

Figure 5.2: Variance image of the individual pixels. White = Zero variance Colorindicates variance in a heat map (Yellow = high variance)

Figure 5.3: Individual pixel PSNR. red=improved PSNR blue=reduced PSNR

PCA extracts the most important information based on the variance of theinformation. Pixels which have high variance but no semantic importance, e.g.,pixels from the nostrils, forehead or background, can affect the model. This is

i

i

“thesis” — 2008/9/1 — 22:31 — page 68 — #71 i

i

i

i

i

i


Reconstructed Quality (PSNR) [dB]Φ Y U V1 28,0 33,6 40,25 32,7 35,8 42,210 35,2 36,2 42,715 36,6 36,4 43,020 37,7 36,5 43,125 38,4 36,5 43,2

Table 5.2: Reference results. Video encoded with PCA.

especially the case when the entire image is used for model creation.With aPCA it is possible to use specified areas or pixels for encoding so that

semantically unimportant areas are excluded from modelling. In Figure 5.2 thevariance of the individual pixels in a video sequence is shown. Pixels from thebackground and the nostrils have a high variance and will be considered important.The background can change regardless of the facial mimic and this informationshould be ignored. By only choosing semantically important pixels the efficiencyof the model is increased and with the use of aPCA it is still possible to decodethe entire frames so no spatial resolution is lost.

Figure 5.3 shows the change in quality for each individual pixel when the moutharea is used for encoding of the video instead of the entire frames. Improved qual-ity (red pixels) appears within the semantically important areas while decreasedquality (blue pixels) is visible for the background. The figure shows the advantagesaPCA have over PCA when it comes to reconstruction quality for important areaswithin the frames.

5.1.4 Experiments with asymmetrical principal component analysis videocoding

We have tried several different approaches for extracting the semantically mean-ingful foreground f from video sequences. In these subsections we describe themethods and show the reconstruction quality and complexity reduction for them.

As a reference we present the quality for encoding the video sequences withregular PCA in Table 5.2. This is equal to using the Eigenspace for the entireframe Φ for encoding of the video sequences.

5.1.4.1 Case 1: Encoding with mouth, decoding with entire frame

The most prominent facial part for facial mimic is the mouth and by representingthe shape of the mouth it is possible to model all facial mimic reasonably well.For this experiment we use the area around the mouth as foreground If and onlyconsider the mouth as semantically important for the mimic. If has a spatial

i

i

“thesis” — 2008/9/1 — 22:31 — page 69 — #72 i

i

i

i

i

i


Figure 5.4: The entire video frame and the foreground If used in Case 1.

Lowered rec. qual. (PSNR) [dB]Φ Y U V5 1,4 0,3 0,210 1,8 0,3 0,215 2,0 0,2 0,220 2,0 0,1 0,225 2,1 0,1 0,2

Table 5.3: Average lowered reconstruction quality for 10 video sequences using the setupfor Case 1.

resolution of 80x64 and the entire frame (I) is 240x176 pixels large (Figure 5.4).The reconstruction quality is compared to the quality when the entire frame I

is used for encoding (Table 5.2). The results in Table 5.3 show the reduction ofquality for using the foreground If instead of I for encoding. This is the averagereconstruction result for all pixels in the frame; pixels from the foreground actuallyhave increased reconstruction quality. The complexity for the two methods is alsocompared. With 7680 pixels in the foreground and 633360 pixels in the entireframe the complexity reduction is 8,25 times.

5.1.4.2 Case 2: Encoding with mouth and eyes, decoding with entire frame

The mouth area is the most important regarding facial mimic but there are otherareas which also affect the facial mimic. By using the mouth and eyes the mimiccan be quite accurately modelled. Figure 5.5 shows the two areas which are chosenas foreground If . This area has a size of 176x64 and I still has a size of 240x176.

To use the area around the area as well as the mouth area for encoding increases

i

i

“thesis” — 2008/9/1 — 22:31 — page 70 — #73 i

i

i

i

i

i


Figure 5.5: Foreground with the eyes and the mouth

the complexity with 55 % compared to only using the mouth area as foreground.The complexity is still vastly reduced compared to using the entire frame forencoding; approximately 3,75 times lower. The reconstruction quality is at thesame time increased compared to encoding with only the mouth area.

5.1.4.3 Case 3: Encoding with extracted features, decoding with entire frame

Instead of choosing the foreground area based on knowledge about the facial mimicthe important area can be extracted automatically with low-level processing.

1. A representative frame is selected. This frame should contain a highly ex-pressive appearance of the face.

2. Face detection according to the method in [48] detects the face in the selectedframe.

3. Edge detection extracts all the edges in the face for the selected frame.

4. Dilation is used on the edge image to make the edges thicker. This imagedecides which pixels that are used for encoding.

The resulting area is similar to the chosen area from the previous experiment.Since this area is extracted automatically the size varies between the video

sequences. The average size is 11364 pixels large and this would correspond to asquare area with the size of ≈ 118x64. This corresponds to a reduction in encodingcomplexity of ≈ 5,6 times compared to using the entire frame.

i

i

“thesis” — 2008/9/1 — 22:31 — page 71 — #74 i

i

i

i

i

i


(a) Original (b) Detected edges

(c) Dilated edges (White pixels are foregroundIf )

Figure 5.6: Area decided with edge detection and dilation.

i

i

“thesis” — 2008/9/1 — 22:31 — page 72 — #75 i

i

i

i

i

i


(a) Detected edges frame1 (b) Detected edges frame2

(c) Detected edges frame3 (d) Detected edges in all frames

Figure 5.7: Area decided with edge detection on all frames.

5.1.4.4 Case 4: Find all edges for encoding, decoding with the entire frame

Another way to automatically extract the foreground is to use feature detectionwithout dilation. This is a fully automatical procedure since no key frame isselected manually.

1. Face detection [48] detects the face in each frame.

2. Edge detection extracts all the edges in the face for each frame.

3. Every edge is gathered in one edge image. Where there is an edge is anyframe is chosen in the total frame. This image decides which pixels that areused for encoding.

The complexity is on average reduced more than 11 times when the area ex-tracted in this way is used for encoding compared to using the entire frame and≈ 3 times compared to using the area around the mouth and eyes in Case 2. Thereconstruction quality is shown in table 5.4.

i

i

“thesis” — 2008/9/1 — 23:23 — page 73 — #76 i

i

i

i

i

i


Lowered rec. qual. (PSNR) [dB]Φ Y U V5 1,2 0,3 0,310 1,6 0,3 0,115 1,7 0,2 0,220 1,7 0,2 0,125 1,6 0,1 0,2

Table 5.4: Average lowered reconstruction quality for 10 video sequences using the setupfor Case 4.

The reconstruction quality is almost the same when information from the eyesand the mouth is used for encoding (Case 2,3 and 4). When only the mouth isused for encoding the quality is lower. The complexity is at the same time reducedfor all the different aPCA implementations. It is reduced heavily when the area isextracted from edges (Case 4).

5.1.5 Side view driven video codingWith the formulas described in section 6.1 it is possible to decode an entire framewhile only using a part of the frame for encoding. In this case the part whichis used for encoding is part of the decoded frame. A part of the frame whichisn’t used for encoding can also be decoded correctly if there is correspondencebetween the features in the different parts of the frame. The frame will be correctlydecoded when there is a correspondence between the spaces used for encoding anddecoding. Video sequences with both the front and side of face are used. The partsare named side view Is and frontal view Ifr. These video sequences are describedin section 2.2.1.

The side view Is is used for encoding of the video and an Eigenspace for theside view is constructed as:

φsj =

∑1

bsij(I

si − Is

0) (5.10)

where bsij are the Eigenvectors from the the covariance matrix (Is − Is

0)T (Is − Is

0)and Is

0 is the mean of the side view. The frontal view is used for decoding and apseudo principal component space is created as:

φpfr

j =∑

i

bsij(I

fri − Ifr

0 ) (5.11)

where Ifr is the frontal view of the frames and Ifr0 is the mean image of the frontal

view. The side view of the video (Is) is then used for encoding through projection:

αsj = (Is − Is

0)T φs

j (5.12)

i

i

“thesis” — 2008/9/1 — 22:31 — page 74 — #77 i

i

i

i

i

i


Reconstruction quality (PSNR)φ Y U V5 34,8 39,3 41,010 36,9 39,3 41,115 38,1 39,3 41,220 39,0 39,3 41,225 39,7 39,3 41,2

Table 5.5: Result for video encoded with the frontal view.

Red. of rec. qual. (PSNR)φ Y U V5 0,8 0,1 0,110 1,3 0,0 0,115 1,3 0,0 0,120 1,5 0,0 0,125 1,3 0,0 0,1

Table 5.6: Reduction of reconstruction quality for encoding with side view compared toencoding with frontal view.

where {αsj} are coefficients extracted using information from the side view Is.

Decoding is performed by combining the coefficients extracted from the side view{αs

j} and the pseudo principal components from the frontal view φpfr

j .

Ifr

= Ifr0 +

M∑

j=1

αsjφ

pfr

j (5.13)

The decoded video is a video with the frontal view of the face. This is a newway of compressing video since the decoded information isn’t used for encoding.The reconstruction quality is measured for the frontal view and as a reference weuse the quality when the frontal view is used for encoding as well (Table 5.5). Thereconstruction quality is slightly reduced when the side view is used for encoding(Table 5.6) but the complexity is reduced vastly (55%).

The usefulness of this encoding is not found in quality or complexity since itprovides a new technique for video coding. Different hands-free equipment wherethe front of the face doesn’t have to be filmed can be used; it is enough to filmthe side of the face. Since the frontal view is fundamental for communication itis very useful to be able to decode this view from another view which is easier torecord. For communication through web cameras the sense of having eye-contactis often lost since the camera isn’t positioned in the screen; where the image(s) ofthe other(s) are shown. With a model where the eyes are looking at a screen thissense can be available since the information which is used for encoding doesn’t

i

i

“thesis” — 2008/9/1 — 22:31 — page 75 — #78 i

i

i

i

i

i


(a) Anger (b) Fear

Figure 5.8: Example frames of profiles relating to the side view.

affect the position of the decoded eyes.

5.1.6 Profile driven video codingThe profile of the side view is the edge between the face and the background in theside view. For each vertical position the edge is one horizontal position. The profileXpr only consists of the positions for the edges while the side view Is consists ofthe pixel intensities in the image:

Is =[I(x1, y1) I(x2, y1) I(x1, y2) ... I(xh, yv)

](5.14)

where I(x,y) is the intensity for the specific pixel and h and v are the horizontaland vertical size of the images respectively. The profile only consist of the positionsfor the edges:

Xpr =[xe1 , ye1 xe2 , ye2 ... xeT , yeT

](5.15)

where T is the number of points in the profile. Examples of how the profiles relateto the side views are shown in Figure 5.8.

The principal components for the profile are called Eigenprofiles and are cal-culated according to:

φprj =

∑

i

bprij (Xpr

i −Xpr0 ) (5.16)

i

i

“thesis” — 2008/9/1 — 22:31 — page 76 — #79 i

i

i

i

i

i


Figure 5.9: The mean profile.

where Xpr are the profile of the side view, bprij are the Eigenvectors from the the

covariance matrix (Xpr−Xs0)

T (Xs−Xs0) and Xpr

0 is the mean image of the profilepositions. Xpr

0 is shown in Figure 5.9 and the first three Eigenprofiles φprj are

shown in Figure 5.10.

(a) φpr1 (b) φpr

2 (c) φpr3

Figure 5.10: The first three Eigenprofiles φprj (Scaled for visualization).

A space for the frontal view is calculated with the Eigenvectors from (Xpr −Xpr

0 )T (Xpr −Xpr0 ) instead of the Eigenvectors from (Is − Is

0)T (Is − Is

0).

φpfr

j =∑

i

bprij (Ifr

i − Ifr0 ) (5.17)

Encoding and decoding is performed as:

αprj = (Xpr −Xpr

0 )T φprj (5.18)

i

i

“thesis” — 2008/9/1 — 22:31 — page 77 — #80 i

i

i

i

i

i


Red. of rec. qual. (PSNR)φ Y U V5 1,5 0,2 0,210 3,0 0,2 0,215 3,4 0,1 0,320 3,9 0,1 0,325 4,2 0,1 0,3

Table 5.7: Reduction of reconstruction quality for encoding with profile compared toencoding with frontal view.

Ifr

= Ifr0 +

M∑

j=1

αprj φ

pfr

j (5.19)

The profile consists of 98 points (T ) and the encoding information is only 196values (x and y). The complexity reduction compared to encoding with the frontalview is 99%.

The objective quality reduction when the profile is used for encoding is sig-nificantly lower compared to using the side or frontal view for encoding. Manualevaluation of the video show that the reconstructed video is quite jerky and doesn’thave a natural transition between the frames. With Local linear embedding (LLE)and dynamic programming (DP) the video can be given a smooth, natural appear-ance. A drawback with the profile that cannot be handled without adding moreinformation is that it contains no information about the eyes. The pixel intensitiesof the eye region can be used to make the model more correct.

More detailed descriptions of aPCA and the experiments are found in PaperV and Paper VI.

i

i

“thesis” — 2008/9/1 — 22:31 — page 78 — #81 i

i

i

i

i

i

i

i

“thesis” — 2008/9/1 — 22:31 — page 79 — #82 i

i

i

i

i

i

High definition wearable video communication

Communication between humans is up to 93 % determined by non-verbal cues.These cues consist of body language and expressions in the face. Meetings face-to-face are seen as more productive, easier to understand than phone or email, morepersonal and the content of the meeting is easier to remember. But, face-to-facedoes not need to be in person; it can be achieved trough distance communication.In any form distance communication allow people to collaborate without actuallyhaving to meet or travel. The most common form of distance communication isa telephone conversation, but there are several other ways to communicate overdistances. Video conferencing with high definition (HD) resolution can give theimpression of face-to-face communication even over networks. The HD resolutionis an important factor since it offers communication improvement over standardresolution. It is straightforward to think that the low quality of many video con-ference systems is a major cause to why such systems aren’t used very often.

Video communication should be available anywhere and anytime, meaning thatcommunication can occur at any location and any time regardless of the availablenetwork and surrounding network traffic. To achieve this there are several issuesthat needs to be resolved; a high compression is needed, a low bitrate is neededand mobility for the users must be realized.

A solution for HD video conferencing is to use the H.264 [15,16] video com-pression standard. This standard can deliver high quality video but there are twomajor problems with H.264. The complexity and bitrate for H.264 encoded videoare both too high to fulfill the requirements for providing video anywhere andanytime. Can our video coding based on PCA solve these problems? PCA videocoding can reach very low bitrates but even if the complexity is lower than forH.264, it is still too high. The complexity is linearly dependent on the spatialresolution and when aPCA is used the number of pixels needed for encoding isreduced. This will lower the complexity and the power consumption.

79

i

i

“thesis” — 2008/9/1 — 22:31 — page 80 — #83 i

i

i

i

i

i

80 HIGH DEFINITION WEARABLE VIDEO COMMUNICATION

6.1 High definition wearable video equipment

High definition (HD) video refers to a video system with a resolution higher thanregular standard-definition video. The video that we use as HD video has a reso-lution of 1440x1080 (HD anamorphic). It is originally recorded as interlaced videowith 50 interlace fields per second but it is transformed into progressive videowith 25 frames per second. Not the entire frame is semantically important forcommunication so even if the spatial resolution is very high the important area issmaller.

Wearable video equipment allows the user to move freely and have both handsfree for use while the camera follows the movements of the user. PCA needsthe facial features to be positioned at approximately the same pixel position sowearable video equipment can solve two issues. We have used the backpack in thisexperiment (section 2.2.1).

6.2 Wearable video communication

A space for the foreground in a video sequences is created and the video framesare encoded with this space. A pseudo principal component space for only thebackground is created:

φpbg

j =∑

i

bfij(I

bgi − Ibg

0 ) (6.1)

where bfij are the Eigenvectors from the the covariance matrix (If − If

0 )T (If − If0 ).

Ibg are the frames I minus the foreground If and Ibg0 is the mean of the same area.

Encoding is performed using only the foreground:

αfj = (If − If

0 )T φfj (6.2)

where {αfj } are coefficients extracted using information from the foreground If .

By combining the Eigenspaces from the background Φpbg

j and the foreground Φfj it

is possible to reconstruct frames with full frame size and still reduce the decodingcomplexity. Few components from Φpbg are used to reconstruct the backgroundwhile many components from Φf are used for foreground reconstruction.

I′= I0 +

P∑

j=1

αjφpbg

j +M∑

j=1

αjφfj (6.3)

where P is the selected number of components used for background reconstruc-tion. By adjusting this parameter it is possible to control the bitrate needed forEigenimage transmission since P decides how many components from Φpbg that

i

i

“thesis” — 2008/9/1 — 22:31 — page 81 — #84 i

i

i

i

i

i

WEARABLE VIDEO COMMUNICATION 81

needs to be transmitted. The Eigenimages needs to be compressed for transmis-sion; they are too large to be transmitted uncompressed. The compression anddecompression is performed in the following steps.

• Quantization of the Eigenimages. ΦQ=Q(Φ)

• Eigenimage compression. ΦComp=C(ΦQ)

• Reconstruction of the Eigenimages from compressed versions. ΦQ=C′(ΦComp)

• Inverse quantization mapping of the quantization values with the reconstruc-tion values. Φ=Q′(ΦQ)

The Eigenimages Φ and the mean image I0 is compressed in this manner. Forthe mean image we use still image compression (JPEG) and for the Eigenimageswe use sequence encoding with DCT. Since most of the complexity is associatedwith block matching we don’t use any motion estimation at all.

The bitrate that we select as a target for video transmission is 300 kbps. TheEigenimages (Φpbg

j and Φf ) and the coefficients for the frames {αfj } needs to be

transmitted below this bitrate at all times.The bitrate for the compressed foreground Eigenimages φfComp

j is 13 kbps butthe bitrate for the first Eigenimage is higher since it is intracoded. The bitrate

for the background φpComp

bg

j is 42 kbps. Transmission of 10 Eigenimages for the

foreground φfComp

j , 1 pseudo Eigenimage for the background φpComp

bg

j plus the meanfor both areas can be done within 1 second. The mean of the foreground If

0 iscompressed to a size of 5 kB and Ipbg

0 is compressed to 15 kB.After ≈ 220 ms the first Eigenimage and the mean for the foreground is avail-

able and decoding of the video can start. All the other Eigenimages for φfComp

areintercoded and a new image arrives every 34th ms. After ≈ 520 ms the decoderhas 10 Eigenimages for the foreground. The mean and the first Eigenimage forthe background needs ≈ 460 ms for transmission and a new Eigenimage for thebackground can be transmitted in 87 ms. The quality of the reconstructed videois increased as more Eigenimages arrive.

To compare HD video encoded with aPCA with standard video encoding weencode the video sequence with H.264 as well. The entire video is encoded withH.264 with a target bitrate of 300 kbps. The complexity for H.264 encoding islinearly dependent on the frame size. Most of the complexity for H.264 encodingcomes from motion estimation through block matching. The complexity for blockmatching is dependent on block displacement Di which grows in square (Di2) andincreases fast for high spatial resolution.

We compare the quality of the foreground and background separately sincethey have different qualities when aPCA is used. With standard H.264 encodingthe quality for the background and foreground are approximately equal.

i

i

“thesis” — 2008/9/1 — 22:31 — page 82 — #85 i

i

i

i

i

i

82 HIGH DEFINITION WEARABLE VIDEO COMMUNICATION

Rec. qual. PSNR [dB]Y U V

H.264 36.4 36.5 36.5aPCA 44.2 44.3 44.3

Table 6.1: Reconstruction quality for the foreground.

Rec. qual. PSNR [dB]Y U V

H.264 36.3 36.5 36.6aPCA 29.6 29.7 29.7

Table 6.2: Reconstruction quality for the background.

The average results measured in PSNR for 5 video sequences are shown inTable 6.1 and Table 6.2. The results in the tables are for a decoding quality when25 φf

j and 5 φpbg

j are used. An example of a frame reconstructed with aPCAis shown in Figure 6.1. A reconstructed frame from H.264 encoding is shown inFigure 6.2.

As it can be seen from the tables and the figures the background quality isalways lower for aPCA compared with H.264. The foreground quality for aPCAis better than H.264 already when 10 Eigenimages (after ≈ 1 second) are used forreconstruction and just improves after that.

The complexity for both encoding and decoding is reduced vastly when aPCAis used compared to DCT encoding with motion estimation. This can be anextremely important factor since the power consumption is reduced. Any devicethat is driven by batteries will have longer operating time when the complexityfor any operation is reduced. Since the bitrate also can be reduced the devices cansave power on lower transmission costs as well.

There are possibilities of combining PCA or aPCA with DCT encoding suchas H.264 and this will be a hybrid codec. For an initial period the frames canbe encoded with H.264 and transmitted between the encoder and decoder. Theframes are available at both the encoder and decoder so they can both performPCA for the images and produce the same Eigenimages. All other frames canthen be encoded with the Eigenimages to very low bitrates with low encoding anddecoding complexity.

High definition wearable communication is described more thoroughly in Pa-per VII.

i

i

“thesis” — 2008/9/1 — 22:31 — page 83 — #86 i

i

i

i

i

i

WEARABLE VIDEO COMMUNICATION 83

Figure 6.1: Frame reconstructed with aPCA. (25 φfj and 5 φ

pbg

j are used.)

Figure 6.2: Frame encoded with H.264.

i

i

“thesis” — 2008/9/1 — 22:31 — page 84 — #87 i

i

i

i

i

i

i

i

“thesis” — 2008/9/1 — 22:31 — page 85 — #88 i

i

i

i

i

i

Contributions, conclusions and future work

7.1 Contributions and conclusions• We have introduced a full-frame video coding scheme based on principal

component analysis. This coding scheme is able to encode video into very lowbitrates and still have a high reconstruction quality. Only the complexity andencoding and decoding time are increased when higher spatial resolution isencoded; the bitrate is not affected. For any standard video coding techniquea higher spatial quality results in higher bitrate or lowered quality.

• We have calculated two different bounds that describe how much informationthat is at least needed for facial mimic representation. These bounds arevalid for facial mimic representation so they can be useful in many morefields than video compression.

• We have examined how the Eigenimages can be compressed so that fewerbits are needed for transmission and storage of them. We have examined howlong time it will take to update or transmit a completely new Eigenspaceover low capacity networks. It is possible to compress the Eigenimages sothat they can be transmitted in 6 seconds over a GPRS network and stillmaintain good reconstruction quality. We have at the same time shownthat the orthogonality loss that is incurred through compression not is animportant factor.

• We have shown that it is not sufficient to use global affine transformation be-tween the mean of different videos to re-use an Eigenspace. Better alignmentmethods will be used in the future.

• We have introduced a coding scheme based on facial recognition whichreaches extremely low bitrates (≈ 100 bits/s). This coding scheme usesa personal mimic gallery of facial images and it can be combined with PCAvideo coding. Instead of having an entire database of facial images it is pos-sible to use a number of Eigenimages to create all the different faces. The

85

i

i

“thesis” — 2008/9/2 — 7:47 — page 86 — #89 i

i

i

i

i

i

86 CONTRIBUTIONS, CONCLUSIONS AND FUTURE WORK

coefficients that are needed for Eigenimage combination can be stored in atable and only the table index is transmitted between encoder and decoder.The same low bitrate as with a facial mimic gallery can be achieved with anEigenspace gallery.

• Asymmetrical PCA is a technique where it is possible to use a part of theframe for encoding while another part or the entire frame can be decoded.This technique and examples of how only semantically important informationcan be used for encoding is introduced in this thesis. We furthermore showhow a completely different view of the face than the decoded view can beused for encoding if there is a correspondence between the features in thedifferent views. The side view of a face can be used for encoding while thefrontal view is decoded. We also show that it is possible to only use theprofile of the side view for encoding. The amount of data and encodingcomplexity is decreased significantly when the profile is used instead of theside view or frontal view. The use of the side view or profile also enablesa different camera use since it is much easier to film the face from the sidethan from the front.

• Decoding through aPCA can reduce the complexity since the importance ofdifferent areas can be weighted against the complexity. We have used thisfor regular video sequences but we have also shown how this can be used forHD resolution video. An area of the video frames can be chosen so that theimportant information in a video is protected.

• We have shown how aPCA can encode video with HD resolution into lowbitrates and after an initial starting phase the bitrate can be lowered under5 kbps.

• Scalable video is incorporated in PCA and aPCA without the encoder havingto encode the video into layers. For DCT-encoded video there is a loss ofquality and/or complexity when the video is encoded into layers. With PCAit is easy to scale the video based on quality since a decoder can choosehow many Eigenspace dimensions it wants to use for decoding. With aPCAit is also possible to scale the video regarding the spatial area that shouldbe decoded. The encoder still doesn’t have to do anything differently fromsingle-layer encoding.

7.2 Future workAn important future issue is to enable re-use of Eigenspaces with high reconstruc-tion quality. We are currently working on frame alignment through block matchingof the faces within the frames. Block matching by itself produce decent alignmentbut creates blocky effects within the frames so we are implementing block match-ing with dynamic programming. The sole purpose of dynamic programming is

i

i

“thesis” — 2008/9/1 — 22:31 — page 87 — #90 i

i

i

i

i

i

FUTURE WORK 87

to restrict interframe movements and generate a natural movement of the blockswithout blocky effects. We are working on several different ways to enable align-ment of frames through block matching. This alignment is for the frontal viewsof the frames; either for aPCA or PCA video coding. For such implementationsit is possible to transmit an Eigenspace in the beginning of each communicationevent instead of re-using an existing Eigenspace. But functional re-use is a mustfor some applications with aPCA. If a different part than the frontal view is usedfor encoding and the frontal view isn’t available it is impossible to transmit a newmodel; an existing model has to be used. If the side view is used it is possibleto use block matching between existing and new views in the same manner as weare examining right now. For the profile of the side view it is possible to make aneasier matching by using templates of the profiles. The profile is a collection ofpositions which can be seen as a template. By matching new templates to existingones it is possible to align existing Eigenspaces and new video frames. Functionalalignment of faces within frames also enables us to allow more global motion withinthe frames and to display the result with the global motion visible. Most futurework regards handling of global motion which is an important issue that needs tobe handled to reach very low bitrates. If existing Eigenspaces for HD resolutionframes can be re-used it will be possible to transmit HD video at a bitrate below5 kbps; an achievement that we believe is possible for facial video when PCA isused.

i

i

“thesis” — 2008/9/1 — 22:31 — page 88 — #91 i

i

i

i

i

i

i

i

“thesis” — 2008/9/1 — 23:23 — page 89 — #92 i

i

i

i

i

i

Bibliography

[1] U. Soderstrom and H. Li, “Emotion recognition and estimation from trackedlip features,” Tech. Rep., DML-TR-2004:05, 2004.

[2] U. Soderstrom and H. Li, “Customizing lip video into animation for wirelessemotional communication,” Tech. Rep., DML-TR-2004:06, 2004.

[3] H. Li, Low Bitrate Image Sequence Coding, Ph.D. thesis, Linkoping Univer-sity, 1993, Linkoping Studies in Science and Technology. Dissertation No: 318pp. 239-252.

[4] P. Ekman and W.V. Friesen, Unmasking the face. A guide to recognizingemotions from facial clues, Prentice-Hall, Englewood Cliffs, New Jersey, 1975.

[5] P. Ekman, Emotion in the Human Face, Cambridge University Press, NewYork, 1982.

[6] I. Jolliffe, Principal Component Analysis, Springer-Verlag, New York, 1986.

[7] G. Wikstrand and U. Soderstrom, “Internet card play with video confer-encing,” in Proc. of Swedish Symposium for Automated Image Analysis(SSBA’06), Umea, March 2006, pp. 93–96.

[8] U. Soderstrom, “Very low bitrate facial video coding based on principalcomponent analysis,” Licentiate thesis, September 2006.

[9] U. Soderstrom and H. Li, “Principal component video coding for simpledecoding on mobile devices,” in Proc. of Swedish Symposium for AutomatedImage Analysis (SSBA’07), Linkoping,, March 2007, pp. 149–152.

89

i

i

“thesis” — 2008/9/1 — 23:23 — page 90 — #93 i

i

i

i

i

i

90 BIBLIOGRAPHY

[10] H. Harashima, K. Aizawa, and T. Saito, “Model-based analysis synthesiscoding of videotelephone images–conception and basic study of intelligentimage coding,” IEICE TRANSACTIONS, vol. E72, pp. 452–458, 1989.

[11] L. Torres and E. Delp, “New trends in image and video compression,” in Pro-ceedings of the European Signal Processing Conference (EUSIPCO), Tampere,Finland, September 5-8 2000.

[12] “Video codec for audiovisual services at p*64kb/s, 1993. itu-t recommenda-tion h.261,” .

[13] “Iso/iec 13818-2 (mpeg-2 video), information technology - generic coding ofmoving pictures and associated audio information: Video,” 1995.

[14] K. Rijkse, “H.263: video coding for low-bit-rate communication,” Commu-nications Magazine, IEEE, vol. 34, pp. 42–45, 1996.

[15] R. Schafer et al., “The emerging h.264 avc standard,” EBU Technical Review,vol. 293, 2003.

[16] T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview ofthe h.264/avc video coding standard,” IEEE Trans. Circuits Syst. VideoTechnol.,, vol. 13, no. 7, pp. 560–576, 2003.

[17] G.J.C. van der Horst and M.A. Bouman, “Spatiotemporal chromaticity dis-crimination,” Journal of the Optical Society of America, vol. 59, pp. 1482–1488, 1969.

[18] L. Torres and D Prado, “A proposal for high compression of faces in videosequences using adaptive eigenspaces,” Image Processing. 2002. Proceedings.2002 International Conference on, vol. 1, pp. I–189– I–192, 2002.

[19] L. Torres and D Prado, “High compression of faces in video sequences formultimedia applications,” in Proceedings. ICME ’02, Lausanne, Switzerland,2002, vol. 1, pp. 481–484.

[20] R. Pique and L. Torres, “Efficient face coding in video sequences combiningadaptive principal component analysis and a hybrid codec approach,” inAcoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03).2003 IEEE International Conference on, April 2003, vol. 3, pp. III– 629–32.

[21] R. Pique and L. Torres, “Combining adaptive principal component analysisand a hybrid codec approach for face coding in video sequences,” in PictureCoding Symposium, Saint Malo, France, April 23-25 2003.

[22] “Iso mpeg-7. text of iso/iec cd 15938-2 information technology - multimediacontent description interface - part 5 multimedia description schemes, iso/iecjtc 1/sc 29/wg,” 2001.

i

i

“thesis” — 2008/9/1 — 23:23 — page 91 — #94 i

i

i

i

i

i

BIBLIOGRAPHY 91

[23] W. E. Vieux, K. Schwerdt, and J. L. Crowley, “Face-tracking and coding forvideo compression,” in ICVS, 1999, pp. 151–160.

[24] K. Schwerdt and J. L. Crowley, “Robust face tracking using color,” in Proc.4th Int. Conf. on Automatic Face and Gesture Recognition, 2000, pp. 90–95.

[25] J. L. Crowley and K. Schwerdt, “Robust tracking and compression for videocommunication,” 1999.

[26] K. Schwerdt and J. L. Crowley, “Contributions of computer vision to thecoding of video sequences,” 1998.

[27] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable videocoding extension of the h.264/avc standard,” Circuits and Systems for VideoTechnology, IEEE Transactions on, vol. 17, no. 9, pp. 1103–1120, September2007.

[28] Easyrig AB, “www.easyrig.se,” .

[29] S.T. Roweis and L.K. Saul, “Nonlinear dimension reduction by locally linearembedding,” Science, vol. 290, pp. 2323–2326, 2000.

[30] K. Ohba, G. Clary, T. Tsukada, T. Kotoku, and K. Tanie, “Facial expressioncommunication with fes,” in International conference on Pattern Recognition,1998, pp. 1378–1378.

[31] K. Ohba, T. Tsukada, T. Kotoku, and K. Tanie, “Facial expression spacefor smooth tele-communications,” in Proceedings Third IEEE InternationalConference on Automatic Face and Gesture Recognition, 1998, pp. 378–383.

[32] R. Neff and A. Zakhor, “Very low bit-rate video coding based on matchingpursuits,” IEEE Transactions on Circuits and Systems for Video Technology,vol. 7, no. 1, pp. 158–171, 1997.

[33] J. Ostermann, “Animation of synthetic faces in mpeg-4,” in Proc. of Com-puter Animation, IEEE Computer Society, Junes 1998, pp. 49–55.

[34] K. Aizawa and T.S. Huang, “Model-based image coding: Advanced videocoding techniques for very low bit-rate applications,” Proc. of the IEEE, vol.83, no. 2, pp. 259–271, 1995.

[35] R. Forchheimer, O. Fahlander, and T. Kronander, “Low bit-rate codingthrough animation,” in In Proc. International Picture Coding SymposiumPCS83, 1983, pp. 113–114.

[36] T. Cootes, G. Edwards, and C. Taylor, “Active appearance models,” in InProc. European Conference on Computer Vision. (ECCV), 1998, vol. 2, pp.484–498.

i

i

“thesis” — 2008/9/1 — 23:23 — page 92 — #95 i

i

i

i

i

i

92 BIBLIOGRAPHY

[37] F. Pighin, J. Hecker, D. Lishchinski, R. Szeliski, and D. H. Salesin, “Synthesiz-ing realistic facial expression from photographs,” in SIGGRAPH Proceedings,1998, pp. 75–84.

[38] J. Wang and M.l F. Cohen, “Very low frame-rate video streaming for face-to-face teleconference,” in DCC ’05: Proceedings of the Data CompressionConference, 2005, pp. 309–318.

[39] J. Lee and A. Eleftheriadis, “Spatio-temporal model-assisted compatible cod-ing for low and very low bitrate video telephony,” in Proceedings, 3rd IEEEInternational Conference on Image Processing (ICIP 96), Lausanne, Switzer-land, 1996, pp. II.429–II.432.

[40] M. Kirby and L Sirovich, “Application of the karhunen-loeve procedure forthe characterization of human faces,” IEEE Transactions on pattern Analysisand Machine Intelligence, vol. 12(1), pp. 103–108, 1990.

[41] L. Sirovich and M. Kirby, “Low-dimensional procedure for the characteriza-tion of human faces,” Journal of the Optical Society of America, vol. 4, pp.519–524, 1987.

[42] Ulrik Soderstrom and Haibo Li, “Very low bitrate full-frame facial videocoding based on principal component analysis,” Signal and Image ProcessingConference (SIP’05), August, 15-17 2005, online: www.medialab.tfe.umu.se.

[43] J. Rasmusson et al., “Multimedia in mobile phones - the ongoing revolution,”Ericsson review, vol. 02, 2004.

[44] K. Fukunaga, Introduction to statistical pattern recognition (2nd ed.), Aca-demic Press Professional, Inc., San Diego, CA, USA, 1990.

[45] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley Seriesin Telecommunications. John Wiley & Sons, New York, NY, USA, 1991.

[46] W.B. Pennebaker and J.L. Mitchell, JPEG Still Image Data CompressionStandard, Van Nostrand Reinhold, New York, 1993.

[47] G. K. Wallace, “The jpeg still picture compression standard,” Communica-tions of the ACM, vol. 34, no. 4, pp. 30–44, 1991.

[48] H-S. Le and H. Li, “Face identification from one single sample face image,”in Proc. of the IEEE Int. Conf. on Image Processing (ICIP), 2004.

A Principal Component Analysis Approach - DiVA...

Documents

Transcript of A Principal Component Analysis Approach - DiVA...