Code 2 - daggerfs.comdaggerfs.com/assets/pdf/nips12_rsvm_poster.pdf · R 2 SVM RBF-SVM Layer 2...

1
L EARNING WITH R ECURSIVE P ERCEPTUAL R EPRESENTATIONS Oriol Vinyals Yangqing Jia Li Deng Trevor Darrell UC Berkeley EECS & ICSI & Microsoft Research 1. C ONTRIBUTION The key contributions of our work are: We propose a new method based on linear SVMs, ran- dom projections, and deep architectures. The method enriches linear SVMs without forming ex- plicit kernels. The learning only involves training linear SVMs, which is very efficient. No fine-tuning is needed in training the deep structure. The training could be easily parallelized. Layer 2 Layer 3 Code 1 Code 2 Input Space Layer1 (linear SVM) Layer2 Layer3 Based on the success of sparse coding + linear SVMs, we stacked linear SVMs introducing a non-linear discriminative bias to achieve nonlinear separation of the data. 2. T HE P IPELINE o l = θ T l x l x l+1 = σ (d + β W l+1 [o T 1 , o T 2 , ··· , o T l ] T ) θ l are the linear SVM parameters trained with x l W l+1 is the concatenation of l random projection matrices [W l+1,1 , W l+1,2 , ··· , W l+1,l ] Each W l is a random matrix sampled from N (0, 1) RSVM RSVM RSVM ··· d prediction transformed data The whole R 2 SVM pipeline SVM x l-1 o l W o 1···l-1 d o 1···l + x l Each RSVM component 3. A NALYSIS We would like to “pull apart” data from different classes. Quasi-orthogonality: two random vectors in a high- dimensional space are much likely to be approxi- mately orthogonal. In the perfect label case, we can prove that Lemma 3.1. T , set of N tuples (d (i) ,y (i) ) θ R D×C the corresponding linear SVM solution with objective function value f T ,θ There exist w i s.t. T 0 =(d (i) + w y (i) ,y (i) ) has a linear SVM solution θ 0 with f T 0 ,θ 0 <f T ,θ . With imperfect prediction, each layer incrementally “improves” the separability of the original data. Randomness helps avoid over-fitting (as will be shown in the experiments). layer 1 layer 2 final layer layer 1 layer 2 final layer 5. CIFAR-10 Going deeper with randomness helps, while naive combination does not. 0 10 20 30 40 50 60 Layer Index 64 65 66 67 68 69 Accuracy Accuracy vs. Number of Layers on CIFAR-10 random concatenation deterministic Performance under different feature size Small codebook size: R 2 SVM improves perfor- mance without much additional complexity. Large codebook size: R 2 SVM avoids the over- fitting issue of nonlinear SVMs. 0 200 400 600 800 1000 1200 1400 1600 Codebook Size 64 66 68 70 72 74 76 78 80 Accuracy Linear SVM R 2 SVM RBF-SVM 6. R ESULTS Experimental results on both the vision (CIFAR-10) and the speech (TIMIT) data. CIFAR10 Method Tr. Size Code. Size Acc. Linear SVM 25/class 50 41.3% RBF SVM 25/class 50 42.2% R 2 SVM 25/class 50 42.8% DCN 25/class 50 40.7% Linear SVM 25/class 1600 44.1% RBF SVM 25/class 1600 41.6% R 2 SVM 25/class 1600 45.1% DCN 25/class 1600 42.7% TIMIT Method Phone state accuracy Linear SVM 50.1% (2000 codes) 53.5% (8000 codes) R 2 SVM 53.5% (2000 codes) 55.1% (8000 codes) DCN, learned per-layer 48.5% DCN, jointly fine-tuned 54.3% MNIST Method Err. Linear SVM 1.02% RBF SVM 0.86% R 2 SVM 0.71% DCN 0.83% NCA w/ DAE 1.0% Conv NN 0.53% 7. S UMMARY AND D ISCUSSIONS Comparison over Different Models Method Tr Te Sca Rep Deep NN × X ? X Linear SVM X X X × Kernel SVM ? ? × X DCN × X ? X R 2 SVM X X X X Tr: ease of training the model. Te: testing time complexity. Sca: scalability (does it handle large-scale well?). Rep: the representation power of the model. Final Remarks 1. Non-sparse coded features: we applied the method on several UCI datasets and observed sim- ilar performance to kernel SVMs. 2. Number of layers: 5 (TIMIT / MNIST), 10- 20 (CIFAR), depending on the nonlinear nature of data. 8. R EFERENCES J Yang, K Yu, and Y Gong. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009. A Coates and A Ng. The importance of encoding ver- sus training with sparse coding and vector quantiza- tion. In ICML, 2011. L Deng and D Yu. Deep convex network: A scalable architecture for deep learning. In Interspeech, 2011.

Transcript of Code 2 - daggerfs.comdaggerfs.com/assets/pdf/nips12_rsvm_poster.pdf · R 2 SVM RBF-SVM Layer 2...

Page 1: Code 2 - daggerfs.comdaggerfs.com/assets/pdf/nips12_rsvm_poster.pdf · R 2 SVM RBF-SVM Layer 2 Layer 3 Code 1 Code 2 Input Space Layer1 (linear SVM) Layer2 Layer3 Based on the success

LEARNING WITH RECURSIVE PERCEPTUAL REPRESENTATIONSOriol Vinyals Yangqing Jia Li Deng Trevor DarrellUC Berkeley EECS & ICSI & Microsoft Research

1. CONTRIBUTIONThe key contributions of our work are:

• We propose a new method based on linear SVMs, ran-dom projections, and deep architectures.

• The method enriches linear SVMs without forming ex-plicit kernels.

• The learning only involves training linear SVMs, whichis very efficient. No fine-tuning is needed in training thedeep structure.

• The training could be easily parallelized.

Layer 2

Layer 3Code 1

Cod

e 2

Input Space

Layer1 (linear SVM)Layer2Layer3

• Based on the success of sparse coding + linear SVMs, we stacked linear SVMs introducing a non-lineardiscriminative bias to achieve nonlinear separation of the data.

2. THE PIPELINE

• ol = θTl xl

• xl+1 = σ(d+ βWl+1[oT1 ,o

T2 , · · · ,oT

l ]T )

• θl are the linear SVM parameters trainedwith xl

• Wl+1 is the concatenation ofl random projection matrices[Wl+1,1,Wl+1,2, · · · ,Wl+1,l]

• Each Wl is a random matrix sampledfrom N(0, 1)

RSVM RSVM RSVM· · ·d

predictiontransformed data

The whole R2SVM pipeline

SVMxl�1 o

lW

o1···l�1

d

o1···l

+ xl

Each RSVM component

3. ANALYSIS• We would like to “pull apart” data from different

classes.

• Quasi-orthogonality: two random vectors in a high-dimensional space are much likely to be approxi-mately orthogonal.

• In the perfect label case, we can prove that

Lemma 3.1. – T , set of N tuples (d(i), y(i))

– θ ∈ RD×C the corresponding linear SVM solutionwith objective function value fT ,θ

– There exist wi s.t. T ′ = (d(i) + wy(i) , y(i)) has alinear SVM solution θ′ with fT ′,θ′ < fT ,θ.

• With imperfect prediction, each layer incrementally“improves” the separability of the original data.

• Randomness helps avoid over-fitting (as will beshown in the experiments).

layer 1 layer 2 final layer

layer 1 layer 2 final layer

5. CIFAR-10• Going deeper with randomness helps, while naive

combination does not.

0 10 20 30 40 50 60Layer Index

64

65

66

67

68

69

Accu

racy

Accuracy vs. Number of Layers on CIFAR-10

random

concatenation

deterministic

• Performance under different feature size

– Small codebook size: R2SVM improves perfor-mance without much additional complexity.

– Large codebook size: R2SVM avoids the over-fitting issue of nonlinear SVMs.

0 200 400 600 800 1000 1200 1400 1600Codebook Size

64

66

68

70

72

74

76

78

80

Accura

cy

Linear SVM

R2SVM

RBF-SVM

6. RESULTSExperimental results on both the vision (CIFAR-10)

and the speech (TIMIT) data.

CIFAR10

Method Tr. Size Code. Size Acc.Linear SVM 25/class 50 41.3%RBF SVM 25/class 50 42.2%R2SVM 25/class 50 42.8%DCN 25/class 50 40.7%Linear SVM 25/class 1600 44.1%RBF SVM 25/class 1600 41.6%R2SVM 25/class 1600 45.1%DCN 25/class 1600 42.7%

TIMIT

Method Phone state accuracyLinear SVM 50.1% (2000 codes)

53.5% (8000 codes)R2SVM 53.5% (2000 codes)

55.1% (8000 codes)DCN, learned per-layer 48.5%DCN, jointly fine-tuned 54.3%

MNIST

Method Err.Linear SVM 1.02%RBF SVM 0.86%R2SVM 0.71%DCN 0.83%NCA w/ DAE 1.0%Conv NN 0.53%

7. SUMMARY AND DISCUSSIONSComparison over Different Models

Method Tr Te Sca RepDeep NN × X ? X

Linear SVM X X X ×Kernel SVM ? ? × X

DCN × X ? XR2SVM X X X X

• Tr: ease of training the model.

• Te: testing time complexity.

• Sca: scalability (does it handle large-scale well?).

• Rep: the representation power of the model.

Final Remarks

1. Non-sparse coded features: we applied themethod on several UCI datasets and observed sim-ilar performance to kernel SVMs.

2. Number of layers: ∼5 (TIMIT / MNIST), ∼10-20 (CIFAR), depending on the nonlinear nature ofdata.

8. REFERENCES• J Yang, K Yu, and Y Gong. Linear spatial pyramid

matching using sparse coding for image classification.In CVPR, 2009.

• A Coates and A Ng. The importance of encoding ver-sus training with sparse coding and vector quantiza-tion. In ICML, 2011.

• L Deng and D Yu. Deep convex network: A scalablearchitecture for deep learning. In Interspeech, 2011.