Machine Learning for Multimedia Content Analysis

29
Machine Learning Multimedia Content Analysis for

Transcript of Machine Learning for Multimedia Content Analysis

Machine Learning Multimedia

Content Analysis for

MULTIMEDIA SYSTEMS AND

APPLICATIONS SERIES Consulting Editor

Borko Furht Florida Atlantic University

Recently Published Titles: DISTRIBUTED MULTIMEDIA RETRIEVAL STRATEGIES FOR LARGE SCALE

NETWORKED SYSTEMS by Bharadwaj Veeravalli and Gerassimos Barlas; ISBN: 978-0-387-28873-4

MULTIMEDIA ENCRYPTION AND WATERMARKING by Borko Furht, Edin Muharemagic, Daniel Socek: ISBN: 0-387-24425-5

SIGNAL PROCESSING FOR TELECOMMUNICATIONS AND MULTIMEDIA edited by T.A Wysocki,. B. Honary, B.J. Wysocki; ISBN 0-387-22847-0

ADVANCED WIRED AND WIRELESS NETWORKS by T.A.Wysocki,, A. Dadej, B.J. Wysocki; ISBN 0-387-22781-4

CONTENT-BASED VIDEO RETRIEVAL: A Database Perspective by Milan Petkovic and Willem Jonker; ISBN: 1-4020-7617-7

MASTERING E-BUSINESS INFRASTRUCTURE edited by Veljko Milutinović, Frédéric Patricelli; ISBN: 1-4020-7413-1

SHAPE ANALYSIS AND RETRIEVAL OF MULTIMEDIA OBJECTS by Maytham H. Safar and Cyrus Shahabi; ISBN: 1-4020-7252-X

MULTIMEDIA MINING: A Highway to Intelligent Multimedia Documents edited by Chabane Djeraba; ISBN: 1-4020-7247-3

CONTENT-BASED IMAGE AND VIDEO RETRIEVAL by Oge Marques and Borko Furht; ISBN: 1-4020-7004-7

ELECTRONIC BUSINESS AND EDUCATION: Recent Advances in Internet Infrastructures edited by Wendy Chin, Frédéric Patricelli, Veljko Milutinović; ISBN: 0-7923-7508-4

INFRASTRUCTURE FOR ELECTRONIC BUSINESS ON THE INTERNET by Veljko Milutinović; ISBN: 0-7923-7384-7

DELIVERING MPEG-4 BASED AUDIO-VISUAL SERVICES by Hari Kalva; ISBN: 0-7923-7255-7

CODING AND MODULATION FOR DIGITAL TELEVISION by Gordon Drury, Garegin Markarian, Keith Pickavance; ISBN: 0-7923-7969-1

CELLULAR AUTOMATA TRANSFORMS: Theory and Applications in Multimedia Compression, Encryption, and Modeling by Olu Lafe; ISBN: 0-7923-7857-1

COMPUTED SYNCHRONIZATION FOR MULTIMEDIA APPLICATIONS by Charles B. Owen and Fillia Makedon; ISBN: 0-7923-8565-9

Visit the series on our website: www.springer.com

Content Analysis

by

and

Multimedia

Yihong Gong

Wei Xu NEC Laboratories America, Inc.

Machine Learning for

USA

c© 2007 Springer Science+Business Media, LLCAll rights reserved. This work may not be translated or copied in whole or in part without the writtenpermission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connectionwith any form of information storage and retrieval, electronic adaptation, computer software, or by similaror dissimilar methodology now known or hereafter developed is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even if they arenot identified as such, is not to be taken as an expression of opinion as to whether or not they are subjectto proprietary rights.

9 8 7 6 5 4 3 2 1

springer.com

Yihong Gong Wei Xu

Laboratories of America, Inc. Laboratories of America, Inc. NEC NEC

10080 N. Wolfe Road SW3-350 10080 N. Wolfe Road SW3-350 Cupertino CA 95014 Cupertino CA 95014

[email protected]

Library of Congress Control Number: 2007927060

Machine Learning for Multimedia Content Analysis by Yihong Gong and Wei Xu

ISBN 978-0-387-69938-7 e-ISBN 978-0-387- 69942-4

Printed on acid-free paper.

[email protected]

Preface

Nowadays, huge amount of multimedia data are being constantly generated invarious forms from various places around the world. With ever increasing com-plexity and variability of multimedia data, traditional rule-based approacheswhere humans have to discover the domain knowledge and encode it into aset of programming rules are too costly and incompetent for analyzing thecontents, and gaining the intelligence of this glut of multimedia data.

The challenges in data complexity and variability have led to revolutionsin machine learning techniques. In the past decade, we have seen many newdevelopments in machine learning theories and algorithms, such as boosting,regressions, Support Vector Machines, graphical models, etc. These develop-ments have achieved great successes in a variety of applications in terms of theimprovement of data classification accuracies, and the modeling of complex,structured data sets. Such notable successes in a wide range of areas havearoused people’s enthusiasms in machine learning, and have led to a spate ofnew machine learning text books. Noteworthily, among the ever growing listof machine learning books, many of them attempt to encompass most partsof the entire spectrum of machine learning techniques, resulting in a shallow,incomplete coverage of many important topics, whereas many others chooseto dig deeply into a specific branch of machine learning in all aspects, result-ing in excessive theoretical analysis and mathematical rigor at the expense ofloosing the overall picture and the usability of the books. Furthermore, despitea large number of machine learning books, there is yet a text book dedicatedto the audience of the multimedia community to address unique problems andinteresting applications of machine learning techniques in this area.

VI Preface

The objectives we set for this book are two-fold: (1) bring together thoseimportant machine learning techniques that are particularly powerful andeffective for modeling multimedia data; and (2) showcase their applicationsto common tasks of multimedia content analysis. Multimedia data, such asdigital images, audio streams, motion video programs, etc, exhibit much richerstructures than simple, isolated data items. For example, a digital image iscomposed of a number of pixels that collectively convey certain visual contentto viewers. A TV video program consists of both audio and image streams thatcomplementally unfold the underlying story and information. To recognize thevisual content of a digital image, or to understand the underlying story of avideo program, we may need to label sets of pixels or groups of image and audioframes jointly because the label of each element is strongly correlated with thelabels of the neighboring elements. In machine learning field, there are certaintechniques that are able to explicitly exploit the spatial, temporal structures,and to model the correlations among different elements of the target problems.In this book, we strive to provide a systematic coverage on this class of machinelearning techniques in an intuitive fashion, and demonstrate their applicationsthrough various case studies.

There are different ways to categorize machine learning techniques. Chap-ter 1 presents an overview of machine learning methods through four differentcategorizations: (1) Unsupervised versus supervised; (2) Generative versusdiscriminative; (3) Models for i.i.d. data versus models for structured data;and (4) Model-based versus modeless. Each of the above four categorizationsrepresents a specific branch of machine learning methodologies that stem fromdifferent assumptions/philosophies and aim at different problems. These cate-gorizations are not mutually exclusive, and many machine learning techniquescan be labeled with multiple categories simultaneously. In describing thesecategorizations, we strive to incorporate some of the latest developments inmachine learning philosophies and paradigms.

The main body of this book is composed of three parts: I. unsupervisedlearning, II. Generative models, and III. Discriminative models. In Part I, wepresent two important branches of unsupervised learning techniques: dimen-sion reduction and data clustering, which are generic enabling tools for manymultimedia content analysis tasks. Dimension reduction techniques are com-monly used for exploratory data analysis, visualization, pattern recognition,etc. Such techniques are particularly useful for multimedia content analysis be-cause multimedia data are usually represented by feature vectors of extremely

Preface VII

high dimensions. The curse of dimensionality usually results in deterioratedperformances for content analysis and classification tasks. Dimension reduc-tion techniques are able to transform the high dimensional raw feature spaceinto a new space with much lower dimensions where noise and irrelevantinformation are diminished. In Chapter 2, we describe three representativetechniques: Singular Value Decomposition (SVD), Independent ComponentAnalysis (ICA), and Dimension Reduction by Locally Linear Embedding(LLE). We also apply the three techniques to a subset of handwritten dig-its, and reveal their characteristics by comparing the subspaces generated bythese techniques.

Data clustering can be considered as unsupervised data classification thatis able to partition a given data set into a predefined number of clusters basedon the intrinsic distribution of the data set. There exist a variety of dataclustering techniques in the literature. In Chapter 3, instead of providing acomprehensive coverage on all kinds of data clustering methods, we focus ontwo state-of-the-art methodologies in this field: spectral clustering, and clus-tering based on non-negative matrix factorization (NMF). Spectral clusteringevolves from the spectral graph partitioning theory that aims to find the bestcuts of the graph that optimize certain predefined objective functions. Thesolution is usually obtained by computing the eigenvectors of a graph affin-ity matrix defined on the given problem, which possess many interesting andpreferable algebraic properties. On the other hand, NMF-based data cluster-ing strives to generate semantically meaningful data partitions by exploringthe desirable properties of the non-negative matrix factorization. Theoreticallyspeaking, because the non-negative matrix factorization does not require thederived factor-space to be orthogonal, it is more likely to generate the set offactor vectors that capture the main distributions of the given data set.

In the first half of Chapter 3, we provide a systematic coverage on fourrepresentative spectral clustering techniques from the aspects of problem for-mulation, objective functions, and solution computations. We also reveal thecharacteristics of these spectral clustering techniques through analytical ex-aminations of their objective functions. In the second half of Chapter 3, wedescribe two NMF-based data clustering techniques, which stem from our orig-inal works in recent years. At the end of this chapter, we provide a case studywhere the spectral and NMF clustering techniques are applied to the textclustering task, and their performance comparisons are conducted throughexperimental evaluations.

VIII Preface

In Part II and III, we focus on various graphical models that are aimedto explicitly model the spatial, temporal structures of the given data set, andtherefore are particularly effective for modeling multimedia data. Graphicalmodels can be further categorized as either generative or discriminative. InPart II, we provide a comprehensive coverage on generative graphical mod-els. We start by introducing basic concepts, frameworks, and terminologies ofgraphical models in Chapter 4, followed by in-depth coverages of the most ba-sic graphical models: Markov Chains and Markov Random Fields in Chapter5 and 6, respectively. In these two chapters, we also describe two importantapplications of Markov Chains and Markov Random Fields, namely MarkovChain Monte Carlo Simulation (MCMC) and Gibbs Sampling. MCMC andGibbs Sampling are the two powerful data sampling techniques that enableus to conduct inferences for complex problems for which one can not ob-tain closed-form descriptions of their probability distributions. In Chapter 7,we present the Hidden Markov Model (HMM), one of the most commonlyused graphical models in speech and video content analysis, with detaileddescriptions of the forward-backward and the Viterbi algorithms for trainingand finding solutions of the HMM. In Chapter 8, we introduce more generalgraphical models and the popular algorithms such as sum-production, max-product, etc. that can effectively carry out inference and training on graphicalmodels.

In recent years, there have been research works that strive to overcomethe drawbacks of generative graphical models by extending the models intodiscriminative ones. In Part III, we begin with the introduction of the Con-ditional Random Field (CRF) in Chapter 9, a pioneer work in this field.In the last chapter of this book, we present an innovative work, Max-MarginMarkov Networks (M3-nets), which strives to combine the advantages of boththe graphical models and the Support Vector Machines (SVMs). SVMs areknown for their abilities to use high-dimensional feature spaces, and for theirstrong theoretical generalization guarantees, while graphical models have theadvantages of effectively exploiting problem structures and modeling corre-lations among inter-dependent variables. By implanting the kernels, and in-troducing a margin-based objective function, which are the core ingredientsof SVMs, M3-nets successfully inherit the advantages of the two frameworks.In Chapter 10, we first describe the concepts and algorithms of SVMs andKernel methods, and then provide an in-depth coverage of the M3-nets. Atthe end of the chapter, we also provide our insights into why discriminative

Preface IX

graphical models generally outperform generative models, and M3-nets aregenerally better than discriminative models.

This book is devoted to students and researchers who want to apply ma-chine learning techniques to multimedia content analysis. We assume that thereader has basic knowledge in statistics, linear algebra, and calculus. We donot attempt to write a comprehensive catalog covering the entire spectrum ofmachine learning techniques, but rather to focus on the learning methods thatare powerful and effective for modeling multimedia data. We strive to writethis book in an intuitive fashion, emphasizing concepts and algorithms ratherthan mathematical completeness. We also provide comments and discussionson characteristics of various methods described in this book to help the readerto get insights and essences of the methods. To further increase the usabilityof this book, we include case studies in many chapters to demonstrate exam-ple applications of respective techniques to real multimedia problems, and toillustrate factors to be considered in real implementations.

California, U.S.A. Yihong GongMay 2007 Wei Xu

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Basic Statistical Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Categorizations of Machine Learning Techniques . . . . . . . . . . . . 4

1.2.1 Unsupervised vs. Supervised . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Generative Models vs. Discriminative Models . . . . . . . . . 41.2.3 Models for Simple Data vs. Models for Complex Data . . 61.2.4 Model Identification vs. Model Prediction . . . . . . . . . . . . 7

1.3 Multimedia Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Part I Unsupervised Learning

2 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.2 Why Gaussian is Forbidden . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Dimension Reduction by Locally Linear Embedding . . . . . . . . . . 262.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Data Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.1 Problem Formulation and Criterion Functions . . . . . . . . . 393.2.2 Solution Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

XII Contents

3.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.2.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3 Data Clustering by Non-Negative Matrix Factorization . . . . . . . 513.3.1 Single Linear NMF Model . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3.2 Bilinear NMF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4 Spectral vs. NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.5 Case Study: Document Clustering Using Spectral and NMF

Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.5.1 Document Clustering Basics . . . . . . . . . . . . . . . . . . . . . . . . 623.5.2 Document Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.5.4 Performance Evaluations and Comparisons . . . . . . . . . . . 65

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Part II Generative Graphical Models

4 Introduction of Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . 734.1 Directed Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2 Undirected Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.3 Generative vs. Discriminative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.4 Content of Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5 Markov Chains and Monte Carlo Simulation . . . . . . . . . . . . . . . 815.1 Discrete-Time Markov Chain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.2 Canonical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.3 Definitions and Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.4 Stationary Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.5 Long Run Behavior and Convergence Rate . . . . . . . . . . . . . . . . . . 945.6 Markov Chain Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . 100

5.6.1 Objectives and Applications . . . . . . . . . . . . . . . . . . . . . . . . 1005.6.2 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.6.3 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 1045.6.4 Rejection Sampling vs. MCMC . . . . . . . . . . . . . . . . . . . . . . 110

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6 Markov Random Fields and Gibbs Sampling . . . . . . . . . . . . . . . 1156.1 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.2 Gibbs Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Contents XIII

6.3 Gibbs – Markov Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.5 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.6 Case Study: Video Foreground Object Segmentation

by MRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.6.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1346.6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1346.6.3 Method Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.6.4 Overview of Sparse Motion Layer Computation . . . . . . . 1366.6.5 Dense Motion Layer Computation Using MRF . . . . . . . . 1386.6.6 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.6.7 Solution Computation by Gibbs Sampling . . . . . . . . . . . . 1416.6.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497.1 Markov Chains vs. Hidden Markov Models . . . . . . . . . . . . . . . . . . 1497.2 Three Basic Problems for HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . 1537.3 Solution to Likelihood Computation . . . . . . . . . . . . . . . . . . . . . . . 1547.4 Solution to Finding Likeliest State Sequence . . . . . . . . . . . . . . . . 1587.5 Solution to HMM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1607.6 Expectation-Maximization Algorithm and its Variances . . . . . . 162

7.6.1 Expectation-Maximization Algorithm . . . . . . . . . . . . . . . . 1627.6.2 Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7.7 Case Study: Baseball Highlight Detection Using HMMs . . . . . . 1677.7.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1677.7.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1677.7.3 Camera Shot Classification . . . . . . . . . . . . . . . . . . . . . . . . . 1697.7.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1727.7.5 Highlight Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1737.7.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

8 Inference and Learning for General Graphical Models . . . . . 1798.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1798.2 Sum-product algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1828.3 Max-product algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

XIV Contents

8.4 Approximate inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1898.5 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

Part III Discriminative Graphical Models

9 Maximum Entropy Model and ConditionalRandom Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2019.1 Overview of Maximum Entropy Model . . . . . . . . . . . . . . . . . . . . . 2029.2 Maximum Entropy Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

9.2.1 Feature Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2049.2.2 Maximum Entropy Model Construction . . . . . . . . . . . . . . 2059.2.3 Parameter Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

9.3 Comparison to Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . 2109.4 Relation to Conditional Random Field . . . . . . . . . . . . . . . . . . . . . 2139.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2159.6 Case Study: Baseball Highlight Detection Using Maximum

Entropy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2179.6.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2189.6.2 Highlight Detection Based on Maximum

Entropy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2209.6.3 Multimedia Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 2229.6.4 Multimedia Feature Vector Construction . . . . . . . . . . . . . 2269.6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

10 Max-Margin Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23510.1 Support Vector Machines (SVMs) . . . . . . . . . . . . . . . . . . . . . . . . . 236

10.1.1 Loss Function and Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23710.1.2 Structural Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . 23710.1.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 23910.1.4 Theoretical Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24310.1.5 SVM Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24410.1.6 Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24510.1.7 SVM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24810.1.8 Further Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

10.2 Maximum Margin Markov Networks . . . . . . . . . . . . . . . . . . . . . . . 25710.2.1 Primal and Dual Problems . . . . . . . . . . . . . . . . . . . . . . . . . 257

Contents XV

10.2.2 Factorizing Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 25910.2.3 General Graphs and Learning Algorithm . . . . . . . . . . . . . 26210.2.4 Max-Margin Networks vs. Other Graphical Models . . . . 262

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

1

Introduction

The term machine learning covers a broad range of computer programs. Ingeneral, any computer program that can improve its performance at some taskthrough experience (or training) can be called a learning program [1]. Thereare two general types of learning: inductive, and deductive. Inductive learningaims to obtain or discover general rules/facts from particular training exam-ples, while deductive learning attempts to use a set of known rules/facts toderive hypotheses that fit the observed training data. Because of its commer-cial values and variety of applications, inductive machine learning has beenthe focus of considerable researches for decades, and most machine learningtechniques in the literature fall into the inductive learning category. In thisbook, unless otherwise notified, the term machine learning will be used todenote inductive learning.

During the early days of machine learning research, computer scientistsdeveloped learning algorithms based on heuristics and insights into humanreasoning mechanisms. Many early works modeled the learning problem asa hypothesis search problem where the hypothesis space is searched throughto find the hypothesis that best fits the training examples. Representativeworks include concept learning, decision trees, etc. On the other hand, neuro-scientists attempted to devise learning methods by imitating the structure ofhuman brains. Various types of neural networks are the most famous achieve-ment from such endeavors.

Along the course of machine learning research, there are several majordevelopments that have brought significant impacts on, and accelerated evo-lutions of the machine learning field. The first such development is the mergingof research activities between statisticians and computer scientists. This has

2 1 Introduction

resulted in mathematical formulations of machine learning techniques usingstatistical and probabilistic theories. A second development is the significantprogress in linear and nonlinear programming algorithms which have dramat-ically enhanced our abilities to optimize complex and large-scale problems. Athird development, less relevant but still important, is the dramatic increasein computing power which has made many complex, heavy weight train-ing/optimaization algorithms computationally possible and feasible. Com-pared to early stages of machine learning techniques, recent methods are moretheoretic instead of heuristic, reply more on modern numerical optimizationalgorithms instead of ad hoc search, and consequently, produce more accurateand powerful inference results.

As most modern machine learning methods are either formulated using,or can be explained by statistical/probabilisic theories, in this book, our mainfocus will be devoted to statistical learning techniques and relevant theo-ries. This chapter provides an overview of machine learning techniques andshows the strong relevance between typical multimedia content analysis andmachine learning tasks. The overview of machine learning techniques is pre-sented through four different categorizations, each of which characterizes themachine learning techniques from a different point of view.

1.1 Basic Statistical Learning Problems

Statistical learning techniques generally deal with random variables and theirprobabilities. In this book, we will use uppercase letters such as X, Y , or Z todenote random variables, and use lowercase letters to denote observed valuesof random variables. For example, the i’th observed value of the variable X

is denoted as xi. If X is a vector, we will use the bold lowercase letter x todenote its values. Bold uppercase letters (i.e., A, B, C) are used to representmatrices.

In real applications, most learning tasks can be formulated as one of thefollowing two problems.

Regression: Assume that X is an input (or independent) variable, and thatY is an output (or dependent) variable. Infer a function f(X) so thatgiven a value x of the input variable X, y = f(x) is a good prediction ofthe true value y of the output variable Y .

Classification: Assume that a random variable X can belong to one of afinite set of classes C = {1, 2, . . . ,K}. Given the value x of variable X,

1.1 Basic Statistical Learning Problems 3

infer its class label l = g(x), where l ∈ C. It is also of great interest toestimate the probability P (k|x) that X belongs to class k, k ∈ C.

In fact both the regression and classification problems in the above list canbe formulated using the same framework. For example, in the classificationproblem, if we treat the random variable X as an independent variable, usea variable L (a dependent variable) to represent X’s class label, L ∈ C, andthink of the function g(X) as a regression function, then it becomes equivalentto the regression problem. The only difference is that in regression Y takescontinuous, real values, while in classification L takes discrete, categoricalvalues.

Despite the above equivalence, quite different loss functions and learningalgorithms, however, have been employed/devised to tackle each of the twoproblems. Therefore, in this book, to make the descriptions less confused,we choose to clearly distinguish the two problems, and treat the learningalgorithms for the two problems separately.

In real applications, regression techniques can be applied to a variety ofproblems such as:

• Predict a person’s age given one or more face images of the person.• Predict a company’s stock price in one month from now, given both the

company’s performances measures and the macro economic data.• Estimate tomorrow’s high and low temperatures of a particular city, given

various meteorological sensor data of the city.

On the other hand, classification techniques are useful for solving the followingproblems:

• Detect human faces from a given image.• Predict the category of the object contained in a given image.• Detect all the home run events from a given baseball video program.• Predict the category of a given video shot (news, sport, talk show, etc).• Predict whether a cancer patient will die or survive based on demographic,

living habit, and clinical measurements of that patient.

Besides the above two typical learning problems, other problems, such asconfidence interval computing and hypothesis testing, have been also amongthe main topics in the statistical learning literature. However, as we will notcover these topics in this book, we omit their descriptions here, and recom-mend interested readers to additional reading materials in [1, 2].

4 1 Introduction

1.2 Categorizations of Machine Learning Techniques

In this section, we present an overview of machine learning techniquesthrough four different categorizations. Each categorization represents a spe-cific branch of machine learning methodologies that stem from different as-sumptions/philosophies and aim at different problems. These categorizationsare not mutually exclusive, and many machine learning techniques can belabeled with multiple categories simultaneously.

1.2.1 Unsupervised vs. Supervised

In Sect. 1.1, we described two basic learning problems: regression and classifi-cation. Regression aims to infer a function y = f(x) that is a good predictionof the true value y of the output variable Y given a value x of the input vari-able X, while classification attempts to infer a function l = g(x) that predictsthe class label l of the variable X given its value x. For inferring the functionsf(x) and g(x), if pairs of training data (xi, yi) or (xi, li), i = 1, . . . , N areavailable, where yi is the observed value of the output variable Y given thevalue xi of the input variable X, li is the true class label of the variable X

given its value xi, then the inference process is called a supervised learningprocess; otherwise, it is called a unsupervised learning process.

Most regression methods are supervised learning methods. Conversely,there are many supervised as well as unsupervised classification methods in theliterature. Unsupervised classification methods strive to automatically parti-tion a given data set into the predefined number of clusters based on theanalysis of the intrinsic data distribution of the data set. Normally no train-ing data are required by such methods to conduct the data partitioning task,and some methods are even able to automatically guess the optimal numberof clusters into which the given data set should be partitioned. In the machinelearning field, we use a special name clustering to refer to unsupervised clas-sification methods. In Chap. 3, we will present two types of data clusteringtechniques that are the state of the art in this field.

1.2.2 Generative Models vs. Discriminative Models

This categorization is more related to statistical classification techniques thatinvolve various probability computations.

Given a finite set of classes C = {1, 2, . . . ,K} and an input data x, proba-bilistic classification methods typically compute the probabilities P (k|x) that

1.2 Categorizations of Machine Learning Techniques 5

x belongs to class k, where k ∈ C, and then classify x into the class l that hasthe highest conditional probability l = arg maxk P (k|x). In general, there aretwo ways of learning P (k|x): generative and discriminative. Discriminativemodels strive to learn P (k|x) directly from the training set without the at-tempt to modeling the observation x. Generative models, on the other hand,compute P (k|x) by first modeling the class-conditional probabilities P (x|k)as well as the class probabilities P (k), and then applying the Bayes’ rule asfollows:

P (k|x) ∝ P (x|k)P (k) . (1.1)

Because P (x|k) can be interpreted as the probability of generating the obser-vation x by class k, classifiers exploring P (x|k) can be viewed as modeling howthe observation x is generated, which explains the name ”generative model”.

Popular generative models include Naive Bayes, Bayesian Networks,Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), etc, whilerepresentative discriminative models include Neural Networks, Support Vec-tor Machines (SVM), Maximum Entropy Models (MEM), Conditional Ran-dom Fields (CRF), etc. Generative models have been traditionally popular fordata classification tasks because modeling P (x|k) is often easier than mod-eling P (k|x), and there exist well-established, easy-to-implement algorithmssuch as the EM algorithm [3] and the Baum-Welch algorithm [4] to efficientlyestimate the model through a learning process. The ease of use, and the the-oretical beauty of generative models, however, do come with a cost. Manycomplex data entities, such as a beach scene, a home run event, etc, need tobe represented by a vector x of many features that depend on each other. Tomake the model estimation process tractable, generative models commonly as-sume conditional independence among all the features comprising the featurevector x. Because this assumption is for the sake of mathematical conveniencerather than the reflection of a reality, generative models often have limitedperformance accuracies for classifying complex data sets. Discriminative mod-els, on the other hand, typically make very few assumptions about the dataand the features, and in a sense, let the data speak for themselves. Recent re-search studies have shown that discriminative models outperform generativemodels in many applications such as natural language processing, webpageclassifications, baseball highlight detections, etc.

In this book, Part II and III will be devoted to covering representative gen-erative and discriminative models that are particularly powerful and effectivefor modeling multimedia data, respectively.

6 1 Introduction

1.2.3 Models for Simple Data vs. Models for Complex Data

Many data entities have simple, flat structures that do not depend on otherdata entities. The outcome of each coin toss, the weight of each apple, theage of each person, etc are examples of such simple data entities. In contrast,there exist complex data entities that consist of sub-entities that are stronglyrelated one to another. For example, a beach scene is usually composed of ablue sky on top, an ocean in the middle, and a sand beach at the bottom. Inother words, beach scene is a complex entity that is composed of three sub-entities with certain spatial relations. On the other hand, in TV broadcastedbaseball game videos, a typical home run event usually consists of four ormore shots, which starts from a pitcher’s view, followed by a panning outfieldand audience view in which the video camera tracks the flying ball, and endswith a global or closeup view of the player running to home base. Obviously, ahome run event is a complex data entity that is composed of a unique sequenceof sub-entities.

Popular classifiers for simple data entities include Naive Bayes, GaussianMixture Models (GMM), Neural Networks, Support Vector Machines (SVM),etc. These classifiers all take the form of k = g(x) to independently classifythe input data x into one of the predefined classes k, without looking at otherspatially, or temporally related data entities.

For modeling complex data entities, popular classifiers include BayesianNetworks, Hidden Markov Models (HMM), Maximum Entropy Models(MEM), Conditional Random Fields (CRF), Maximum Margin Markov Net-works (M3-nets), etc. A common character of these classifiers is that, insteadof determining the class label li of each input data xi independently, a jointprobability function P (. . . , li−1, li, li+1, . . . | . . . ,xi−1,xi,xi+1, . . .) is inferredso that all spatially, temporally related data . . . ,xi−1,xi,xi+1, . . . are exam-ined together, and the class labels . . . , li−1, li, li+1, . . . of these related dataare determined jointly. As illustrated in the proceeding paragraph, complexdata entities are usually formed by sub-entities that possess specific spatio-temporal relationships, modeling complex data entities using the above jointprobability is a very natural yet powerful way of capturing the intrinsic struc-tures of the given problems.

Among the classifiers for modeling complex data entities, HMM has beencommonly used for speech recognition, and has become a pseudo standard formodeling sequential data for the last decade. CRF and M3-net are relativelynew methods that are quickly gaining popularity for classifying sequential, or

1.2 Categorizations of Machine Learning Techniques 7

interrelated data entities. These classifiers are the ones that are particularlypowerful and effective for modeling multimedia data, and will be the mainfocus of this book.

1.2.4 Model Identification vs. Model Prediction

Research on modern statistics has been profoundly influenced by R.A. Fisher’spioneer works conducted during the decade 1915–1925 [5]. Since then, andeven now, most researchers have been following his framework for the de-velopment of statistical learning techniques. Fisher’s framework models anysignal Y as the sum of two components: deterministic and random:

Y = f(X) + ε . (1.2)

The deterministic part f(X) is defined by the values of a known family offunctions determined by a limited number of parameters. The random partε corresponds to the noise added to the signal, which is defined by a knowdensity function. Fisher considered the estimation of the parameters of thefunction f(X) as the goal of statistical analysis. To find these parameters, heintroduced the maximum likelihood method.

Since the main goal of Fisher’s statistical framework is to estimate themodel that generates the observed signal, his paradigm in statistics can becalled Model Identification (or inductive inference). The idea of estimatingthe model reflects the traditional goal of Science: To discover an existing Lawof Nature. Indeed, Fisher’s philosophy has attracted numerous followers, andmost statistical learning methods, including many methods to be covered inthis book, are formulated based on his model identification paradigm.

Despite Fisher’s monumental works on modern statistics, there have beenbitter controversies over his philosophy which still continue nowadays. It hasbeen argued that Fisher’s model identification paradigm belongs to the cat-egory of ill-posed problems, and is not an appropriate tool for solving highdimensional problems since it suffers from the ”curse of dimensionality”.

From the late 1960s, Vapnik and Chervonenkis started a new paradigmcalled Model Prediction (or predictive inference). The goal of model predictionis to predict events well, but not necessarily through the identification of themodel of events. The rationale behind the model prediction paradigm is thatthe problem of estimating a model of events is hard (ill-posed) while theproblem of finding a rule for good prediction is much easier (better-posed).It could happen that there are many different rules that predict the events

8 1 Introduction

well, and are very different from the model. nonetheless, these rules can stillbe very useful predictive tools.

To go beyond the model prediction paradigm one step further, Vapnik in-troduced the Transductive Inference paradigm in 1980s [6]. The goal of trans-ductive inference is to estimate the values of an unknown predictive functionat a given point of interest, but not in the whole domain of its definition.Again, the rationale here is that, by solving less demanding problems, onecan achieve more accurate solutions. In general, the philosophy behind theparadigms of model prediction and transductive inference can be summarizedby the following Imperative [7]:

Imperative: While solving a problem of interest, do not solve a more generalproblem as an intermediate step. Try to get the answer that you need,but not a more general one. It is quite possible that you have enoughinformation to solve a particular problem of interest well, but not enoughinformation to solve a general problem.

The Imperative constitutes the main methodological differences betweenthe philosophy of science for simple and complex worlds. The classical phi-losophy of science has an ambitious goal: discovering the universal laws ofnature. This is feasible in a simple world, such as physics, a world that can bedescribed with only a few variables, but might not be practical in a complexworld whose description requires many variables, such as the worlds of patternrecognition and machine intelligence. The essential problem in dealing witha complex world is to specify less demanding problems whose solutions arewell-posed, and find methods for solving them.

Table 1.1 summarizes discussions on the three types of inferences, andcompares their pros and cons from various view points. The development ofstatistical learning techniques based on the paradigms of model prediction andtransductive inference (the complex world philosophy) has a relatively shorthistory. Representative methods include neural networks, SVMs, M3-nets, etc.In this book, we will cover SVMs and M3-nets in Chap. 10.

1.3 Multimedia Content Analysis

During 1990s, the field of multimedia content analysis was predominated byresearches on content-based image and video retrieval. The motivation be-hind such researches is that traditional keyword-based information retrieval

1.3 Multimedia Content Analysis 9

Table 1.1. Summary of three types of inferences

inductive inference predictive inference transductive inference

goal identify a model discover a rule for estimate values of an

of events good prediction unknown predictive

of events function at some points

complexity most difficult easier easiest

applicability simple world with complex world with complex world with

a few variables numerous variables numerous variables

computation cost low high highest

generalization

power low high highest

techniques are no longer applicable to images and videos due to the followingreasons. First, the prerequisite for applying keyword-based search techniquesis that we have a comprehensive content description for each image/videostored in the database. Given the state of the art of computer vision andpattern recognition techniques, by no means such content descriptions canbe generated automatically by computers. Second, manual annotations of im-age/video contents are extremely time consuming and cost prohibiting; there-fore, they can be justified only when the searched materials have very highvalues. Third but not the last, as there are many different ways of annotatingthe same image/video content, manual annotation tends to be very subjectiveand diverse, making the keyword-based content search even more difficult.

Given the above problems associated with keyword-based search, content-based image/video retrieval techniques strive to enable users to retrieve de-sired images/videos based on similarities among low level features, such ascolors, textures, shapes, motions, etc [8, 9, 10]. The assumption here is that

10 1 Introduction

visually similar images/videos consist of similar image/motion features, whichcan be measured by appropriate metrics. In the past decade, great efforts havebeen devoted to many fundamental problems such as features, similarity mea-sures, indexing schemes, relevance feedbacks, etc. Despite the great amountof research efforts, the success of content-based image/video retrieval systemsis quite limited, mainly due to the poor performances of these systems. Morethan often, the use of a red car image as a query will bring back more imageswith irrelevant objects than the images with red cars. A main reason for theproblem of poor performances is that big semantic gaps exist between thelow level features used by the content-based image/video retrieval systemsand the high level semantics expressed by the query images/videos. Userstend to judge the similarity between two images based more on the semanticsthan the appearances of colors and textures of the images. Therefore, a con-clusion that can be drawn here is that, the key to the success of content-basedimage/video retrieval systems lies in the degree to which we can bridge, orreduce the semantic gaps.

A straightforward yet effective way of bridging the sematic gaps is todeepen our analysis and understanding of image/video contents. While un-derstanding the contents of general images/videos is still unachievable now,recognizing certain classes of objects/events under certain environment set-tings is already within our reach. From 2003, the TREC Conference spon-sored by the National Institute of Standards and Technology (NIST) andother U.S. government agencies, started the video retrieval evaluation track(TRECVID)1 to promote research on deeper image/video content analysis.To date, TRECVID has established the following four main tasks that areopen for competitions:

• Shot boundary determination: Identify the shot boundaries by theirlocations and types (cut or gradual) in the given video sequences.

• Story segmentation: Identify the boundary of each story by its locationand type (news or miscellaneous) in the given video sequences. A story isdefined as a segment of video with a coherent content focus which can becomposed of multiple shots.

• High-level feature extraction: Detect the shots that contain varioushigh-level semantic concepts such as “Indoor/Outdoor”, “People”, “Vege-tation”, etc.

1 The official homepage of TRECVID is located at http://www-nlpir.nist.gov/

projects/trecvid.

1.3 Multimedia Content Analysis 11

• Search: Given a multimedia statement of the information need (topic),return all the shots from the collection that best satisfy the informationneed.

Comparing the above tasks to the typical machine learning tasks describedin Sect. 1.1, we can find many analogies and equivalences. Indeed, with everincreasing complexity and variability of multimedia data, machine learningtechniques have become the most powerful modeling tool to analyze the con-tents, and gain intelligence of this kind of complex data. Traditional rule-based approaches where humans have to discover the domain knowledge andencode it into a set of programming rules are too costly and incompetentfor multimedia content analysis because knowledge for recognizing high-levelconcepts/events could be very complex, vague, or difficult to define.

In the following chapters of this book, we intend to bring together thoseimportant machine learning techniques that are particularly powerful and ef-fective for modeling multimedia data. We do not attempt to write a compre-hensive catalog covering the entire spectrum of machine learning techniques,but rather to focus on the learning methods effective for multimedia data. Tofurther increase the usability of this book, we include case studies in manychapters to demonstrate example applications of respective techniques to realmultimedia problems, and to illustrate factors to be considered in real imple-mentations.

Part I

Unsupervised Learning

2

Dimension Reduction

Dimension reduction is an important research topic in the area of unsuper-vised learning. Dimension reduction techniques aim to find a low-dimensionalsubspace that best represents a given set of data points. These techniqueshave a broad range of applications including data compression, visualization,exploratory data analysis, pattern recognition, etc.

In this chapter, we present three representative dimension reduction tech-niques: Singular Value Decomposition (SVD), Independent Component Analy-sis (ICA), and Local Linear Embedding (LLE). Dimension reduction based onsingular value decomposition is also referred to as principal component analy-sis (PCA) by many papers in the literature. We start the chapter by discussingthe goals and objectives of dimension reduction techniques, followed by de-tailed descriptions of SVD, ICA, and LLE. In the last section of the chapter,we provide a case study where the three techniques are applied to the samedata set and the subspaces generated by these techniques are compared toreveal their characteristics.

2.1 Objectives

The ultimate goal of statistical machine learning is to create a model thatis able to explain a given phenomenon, or to model the behavior of a givensystem. An observation x ∈ R

p obtained from the phenomenon/system can beconsidered as a set of indirect measurements of an underlying source s ∈ R

q.Since we generally have no ideas on what measurements will be useful formodeling the given phenomemon/system, we usually attempt to measure allwe can get from the target, resulting in a q that is often larger than p.

16 2 Dimension Reduction

Since an observation x is a set of indirect measurements of a latent sources, its elements may be distorted by noises, and may contain strong correlationsor redundancies. Using x in analysis will not only result in poor performanceaccuracies, but also incur excessive modeling costs for estimating an excessivenumber of model parameters, some of which are redundant.

The primary goal of dimension reduction is to find a low-dimensional sub-space R

p′ ∈ Rp that is optimal for representing the given data set with respect

to a certain criterion function. The use of different criterion functions leadsto different types of dimension reduction techniques.

Besides the above primary goal, one is often interested in inferencing thelatent source s itself from the set of observations x1, . . . ,xn ∈ R

p. Consider ameeting room with two microphones and two simultaneous talking people. Thetwo microphones pick up two different mixtures x1, x2 of the two independentsources s1, s2. It will be very useful if we can estimate the two original speechsignals s1 and s2 using the recorded (observed) signals x1 and x2. This is anexample of the classical cocktail party problem, and independent componentanalysis is intended to provide solutions to blind source separations.

2.2 Singular Value Decomposition

Assume that x1, . . . ,xn ∈ Rp are a set of centered data points1, and that we

want to find a k-dimensional subspace to represent these data points with theleast loss of information. Standard PCA strives to find a p×k linear projectionmatrix Vk so that the sum of squared distances from the data points xi totheir projections is minimized:

L(Vk) =n∑

i=1

||xi − VkVTk xi||2 . (2.1)

In (2.1), VTk xi is the projection of xi onto the k-dimensional subspace spanned

by the column vectors of Vk, and VkVTk xi is the representation of the pro-

jected vector VTk xi in the original p-dimensional space. It can be easily verified

that (2.1) can be rewritten as (see Problem 2.2 at the end of the chapter):

n∑

i=1

||xi − VkVTk xi||2 =

n∑

i=1

||xi||2 −n∑

i=1

||VkVTk xi||2 . (2.2)

1 A centered vector xc is generated by subtracting the mean vector m from the

original vector x: xc = x − m, so that xc is a zero-mean vector.

2.2 Singular Value Decomposition 17

This means that minimizing L(Vk) is equivalent to maximizing the term∑ni=1 ||VkVT

k xi||2, which is the empirical variance of these projections. There-fore, the projection matrix Vk that minimizes L(Vk) is the one that maxi-mizes the variance in the projected space.

The solution Vk can be computed by Singular Value Decomposition(SVD). Denote by X the n × p matrix where the i’th row corresponds to theobservation xi. The singular value decomposition of the matrix X is definedas:

X = UDVT , (2.3)

where U is an n × p orthogonal matrix (UT U = I) whose column vectors ui

are called the left singular vectors, V is a p×p orthogonal matrix (VT V = I)whose column vectors vj are called the right singular vectors, and D is a p×p

diagonal matrix with the singular values d1 ≥ d2 · · · dp ≥ 0 as its diagonalelements.

For a given number k, the matrix Vk that is composed of the first k

columns of V constitutes the rank k solution to (2.1). This result stems fromthe following famous theorem [11].

Theorem 2.1. Let the SVD of matrix X be given by (2.3), U =[u1 u2 · · · up], D = diag(d1, d2, . . . , dp), V = [v1 v2 · · · vp ], and rank(X) =r. Matrix Xτ defined below is the closest rank-τ matrix to X in terms of theEuclidean and Frobenius norms.

Xτ =τ∑

i=1

uidivTi . (2.4)

The use of τ -largest singular values to approximate the original matrixwith (2.4) has more implications than just dimension reduction. Discardingsmall singular values is equivalent to discarding linearly semi-dependent orpractically nonessential axes of the original feature space. Axes with smallsingular values usually represent either non-essential features or noise withinthe data set. The truncated SVD, in one sense, captures the most salient un-derlying structure, yet at the same time removes the noise or trivial variationsin the data set. Minor differences between data points will be ignored, anddata points with similar features will be mapped near to each other in the τ -dimensional partial singular vector space. Similarity comparison between datapoints in this partial singular vector space will certainly yield better resultsthan in the raw feature space.

The singular value decomposition in (2.3) has the following interpretations: