1. Stat 231. A.L. Yuille. Fall 2004

14
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Practical Issues with SVM. Handwritten Digits: US Post Office, MNIST Datasets. No Handout. For people seriously interested in this material see Learning with Kernels by B. Schoelkopf and A.J. Smola. MIT Press. 2002.

description

1. Stat 231. A.L. Yuille. Fall 2004. Practical Issues with SVM. Handwritten Digits: US Post Office, MNIST Datasets. No Handout. For people seriously interested in this material see Learning with Kernels by B. Schoelkopf and A.J. Smola. MIT Press. 2002. 2. Practical SVM. - PowerPoint PPT Presentation

Transcript of 1. Stat 231. A.L. Yuille. Fall 2004

Page 1: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

1. Stat 231. A.L. Yuille. Fall 2004

Practical Issues with SVM. Handwritten Digits: US Post Office, MNIST Datasets.

No Handout. For people seriously interested in this material see Learning with Kernels by B. Schoelkopf and A.J. Smola. MIT Press. 2002.

Page 2: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

2. Practical SVM

Support Vector Machines first showed major success for the task of handwritten digit/character recognition. US Post-Office database. MNIST database. Issues with real problems: (1) Multiclassification – not just yes/no. (2) Large datasets – quadratic programming impractical. (3) Invariance in data. Prior Knowledge. (4) Which kernels? When are kernels generalizing?

Page 3: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

3. Multiclassification.

Two solutions for M classes. (A): One versus Rest. For each class, i =1,...,M

construct a binary classifier where (n no. of data samples). Classify Comment: simple, but heuristic.

Page 4: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

4. Multiclassification

(B): Hyperplanes for each class label

Data and Slack Variables:

Quadratic Programming:

Page 5: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

5. Multiclass and Data Size

Empirically, methods A and B give similar quality results. Method (B) is more attractive. But the solution is more computationally intensive. This leads to issue:

(2) Large Datasets.

The Quadratic Programming problem is most easily formulated in terms of the dual

For large datasets n is enormous. Quadratic Programming is computationally expensive.

Page 6: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

6. Large Datasets

Chunking is the favored solution. Observe that the will be non-zero only for the support vectors. “Chunk” the training data into k sets of size n/k. Train on these k sets and keep the support vectors for these

sets. Then train on the combined support vectors of the k sets. Note: need to check the original data to make sure that it is

correctly classified. If not, add more support vectors.

Page 7: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

7. Large Datasets.

Chunking is successful and computationally efficient provided the number of support vectors is small.

This happens if there is a hyperplane/hypersurface with large margin separating the classes.

It is harder if data from the classes overlap – e.g. when there are a large number of data points which need non-zero slack variables (i.e. support vectors).

In either case, more support vectors are needed for the combined multiclass, case (B), than for the heuristic (A).

Note: other approximate methods for when chunking fails.

Page 8: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

8. Invariances and Priors

(3) Invariances in the Data. Recognizing handwritten digits. The classifier should be

insensitive to small changes to the data. For example, small rotations, and small translations.

Page 9: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

9. Invariances and Priors

Virtual Support Vectors (VSV). Strategy: (i) Train on the original dataset to get support vectors. (ii) Generate artificial examples by applying the transformations

to the support vectors. (iii) Train again on the “virtual examples” generated by (ii).

Page 10: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

10. Virtual Support Vectors

Page 11: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

11. Invariances and Priors

Other methods include: (i) Hand-designing features which are invariant to the problem. (ii) Training on virtual examples, before constructing support

vectors. (Computationally expensive). (iii) Designing criteria allowing for data transformations. (iv) Learning features which are invariants (TPA.)

In general, it is best to select your input features using as much prior knowledge as you have about the problem.

Page 12: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

12. MNIST Results

MNIST dataset of handwritten digits. Summary of Results: page 341 S.S. Best classifier uses a polynomial

8 VSV means 8 invariance samples per data. (1 pixel translation, plus rotation).

MNIST Dataset has 600,000 handwritten digits. LeNet is a multilayer network with special training plus boosting,

Page 13: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

13. MNIST Results

Page 14: 1. Stat 231. A.L. Yuille. Fall 2004

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

14. Summary.

Applying SVM’s to real problems requires: Multiclass: Method (A) One-versus-Rest, (B) Full solutions. Computational practically – Chunking by dividing dataset into subsets, and using the support vectors from each set. Invariance – generate new samples by apply translations to support vectors to generate virtual support vectors. Very successful on the DNIST and US Postal Office datasets.

Simpler than the LeNet approach (closest rival.)