Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w1.pdf · TAs &...

Post on 10-Sep-2020

1 views 0 download

Transcript of Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w1.pdf · TAs &...

Deep Learning for Computer VisionFall 2020

http://vllab.ee.ntu.edu.tw/dlcv.html (Public website)

https://cool.ntu.edu.tw/courses/3368 (NTU COOL; for grade, etc.)

Yu-Chiang Frank Wang 王鈺強, Professor

Dept. Electrical Engineering, National Taiwan University

2020/09/22

Tight (yet tentative) Schedule

2

Week Date Topic Remarks

1 9/15 Course Logistics

2 9/22 Machine Learning 101

3 9/29 Convolutional Neural Network (I) HW #1 out

4 10/6 Convolutional Neural Network (II): Visualization & Extensions of CNN

5 10/13 Tutorials on Python, Github, etc. (by TAs);Invited Talk by Prof. Philipp Krähenbühl (UT Austin)

HW #1 due

6 10/20 Object Detection & Segmentation HW #2 out

7 10/27 Generative Models & Generative Adversarial Network (GAN)

8 11/3 Transfer Learning for Visual Classification & Synthesis (I)

9 11/10 Transfer Learning for Visual Classification & Synthesis (II); Representation Disentanglement

HW #2 due;HW #3 out

10 11/17 TBD (CVPR Week)

11 11/24 Recurrent Neural Networks & Transformer

12 12/1 Meta-Learning; Few-Shot and Zero-Shot Classification (I) HW #3 due

13 12/8 Meta-Learning; Few-Shot and Zero-Shot Classification (II) HW #4 out

14 12/15 From Domain Adaptation to Domain Generalization Team-up for Final Projects

15 12/22 Beyond 2D vision (3D and Depth)

16 12/29 Image Inpainting and Outpainting; Guest Lecture HW #4 due

17 1/5 Guest Lectures

1/18-22 Presentation for Final Projects TBD

3 weeks

2 weeks

3 weeks

3 weeks

What’s to Be Covered Today…

• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning

• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification

3

Example: Testing/Screening of COVID-19

4

Positivefor COVID

Negativefor COVID

Distributions between positive/negative test results (e.g., PCR, antibody, etc.)- further away from each other- more accurate COVID diagnosis

Bayesian Decision Theory

• Fundamental statistical approach to classification/detection tasks

• Take a 2-class classification/detection task as an example:• Let’s see if a student would pass or fail the course of DLCV.• Define a probabilistic variable ω describe the case of pass or fail.• That is, ω = ω1 for pass, and ω = ω2 for fail.

• Prior Probability• The a priori or prior probability reflects the knowledge of

how likely we expect a certain state of nature before observation.• P(ω = ω1) or simply P(ω1) as the prior that the next student would pass DLCV.• The priors must exhibit exclusivity and exhaustivity, i.e.,

• Equal priors• If we have equal numbers of students pass/fail DLCV, then the priors are equal;

in other words, the priors are uniform.

5

Prior Probability (cont’d)

• Decision rule based on priors only• If the only available info is the prior,

and the cost of any type of incorrect classification is equal, what would be a reasonable decision rule?

• Decide ω1 if

otherwise decide ω2 .• What’s the incorrect classification rate (or error rate) Pe?

6

Class-Conditional Probability Density (or Likelihood)

• The probability density function (PDF) for input/observation x given a state of nature ωis written as:

• Here’s (hopefully) the hypothetical class-conditional densities reflecting the time of the students spending on DLCV who eventually pass/fail this course.

7

Posterior Probability & Bayes Formula

• If we know the prior distribution and the class-conditional density, can we come up with a better decision rule?

• Yes We Can! • By calculating the posterior probability.

• Posterior probability 𝑃𝑃(𝜔𝜔|𝒙𝒙) :• The probability of a certain state of nature ω given an observable x.

• Bayes formula:𝑃𝑃 𝜔𝜔𝑗𝑗 ,𝒙𝒙

𝑃𝑃(𝜔𝜔𝑗𝑗|𝒙𝒙)

And, we have ∑𝑗𝑗=1𝐶𝐶 𝑃𝑃(𝜔𝜔𝑗𝑗|𝒙𝒙) = 1.8

Decision Rule & Probability of Error

• For a given observable x (e.g., # of GPUs), the decision rule (to take DLCV or not) will be now based on:

• What’s the probability of error P(error) (or Pe)?

9

From Bayes Decision Rule to Detection Theory

• Hit (detection, TP), false alarm (FA, FP), miss (false reject, FN), rejection (TN)

• Receiver Operating Characteristics (ROC)• To assess the effectiveness of the designed features/classifiers• False alarm (PFA or FP) vs. detection (Pd or TP) rates• Which curve/line makes sense? (a), (b), or (c)?

10

PD

PFA

100

1000

What’s to Be Covered Today…

• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning

• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification

11

Clustering

• Clustering is an unsupervised algorithm.• Given:

a set of N unlabeled instances {x1, …, xN}; # of clusters K• Goal: group the samples into K partitions

• Remarks:• High within-cluster (intra-cluster) similarity• Low between-cluster (inter-cluster) similarity• But…how to determine a proper similarity measure?

12

Similarity is NOT Always Objective…

13

Clustering (cont’d)

• Similarity:• A key component/measure to perform data clustering• Inversely proportional to distance• Example distance metrics:

• Euclidean distance (L2 norm): 𝑑𝑑 𝑥𝑥, 𝑧𝑧 = 𝑥𝑥 − 𝑧𝑧 2 = ∑𝑖𝑖=1𝐷𝐷 𝑥𝑥𝑖𝑖 − 𝑧𝑧𝑖𝑖 2

• Manhattan distance (L1 norm): 𝑑𝑑 𝑥𝑥, 𝑧𝑧 = 𝑥𝑥 − 𝑧𝑧 1 = ∑𝑖𝑖=1𝐷𝐷 𝑥𝑥𝑖𝑖 − 𝑧𝑧𝑖𝑖

• Note that p-norm of x is denoted as:

14

Clustering (cont’d) • Similarity:

• A key component/measure to perform data clustering• Inversely proportional to distance• Example distance metrics:

• Kernelized (non-linear) distance:

𝑑𝑑 𝑥𝑥, 𝑧𝑧 = Φ(𝑥𝑥) − Φ 𝑧𝑧 22 = Φ(𝑥𝑥) 2

2 + Φ(𝑧𝑧) 22 − 2Φ(𝑥𝑥)𝑇𝑇Φ(𝑧𝑧)

• Taking Gaussian kernel for example: 𝐾𝐾 𝑥𝑥, 𝑧𝑧 = Φ 𝑥𝑥 𝑇𝑇Φ 𝑧𝑧 = 𝑒𝑒𝑥𝑥𝑒𝑒 − 𝑥𝑥−𝑧𝑧 22

2𝜎𝜎2,

we have

And, distance is more sensitive to larger/smaller σ. Why?• For example, L2 or kernelized distance metrics for the following two cases?

15

K-Means Clustering

• Input: N examples {x1, . . . , xN } (xn ∈RD ); number of partitions K• Initialize: K cluster centers µ1, . . . , µK . Several initialization options:

• Randomly initialize µ1, . . . , µK anywhere in RD

• Or, simply choose any K examples as the cluster centers• Iterate:

• Assign each of example xn to its closest cluster center• Recompute the new cluster centers µk (mean/centroid of the set Ck )• Repeat while not converge

• Possible convergence criteria:• Cluster centers do not change anymore• Max. number of iterations reached

• Output:• K clusters (with centers/means of each cluster)

16

K-Means Clustering

• Example (K = 2): Initialization, iteration #1: pick cluster centers

17

K-Means Clustering

• Example (K = 2): iteration #1-2, assign data to each cluster

18

K-Means Clustering

• Example (K = 2): iteration #2-1, update cluster centers

19

K-Means Clustering

• Example (K = 2): iteration #2, assign data to each cluster

20

K-Means Clustering

• Example (K = 2): iteration #3-1

21

K-Means Clustering

• Example (K = 2): iteration #3-2

22

K-Means Clustering

• Example (K = 2): iteration #4-1

23

K-Means Clustering

• Example (K = 2): iteration #4-2

24

K-Means Clustering

• Example (K = 2): iteration #5, cluster means are not changed.

25

K-Means Clustering (cont’d)• Limitation

• Preferable for round shaped clusters with similar sizes

• Sensitive to initialization; how to alleviate this problem?• Sensitive to outliers; possible change from K-means to…• Hard assignment only. Mathematically, we have…

• Remarks• Expectation-maximization (EM) algorithm• Speed-up possible by hierarchical clustering (e.g., 100 = 102 clusters)

26

What’s to Be Covered Today…

• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning

• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification

27

Dimension Reduction• Principal Component Analysis (PCA)

• Unsupervised & linear dimension reduction• Related to Eigenfaces, etc. feature extraction and classification techniques• Still very popular despite of its simplicity and effectiveness.• Goal:

• Determine the projection, so that the variation of projected data is maximized.

28x

y

axis that describes the largest variation for data projected onto it.

Formulation & Derivation for PCA

• Input: a set of instances x without label info

• Output: a projection vector ω maximizing the variance of the projected data

29

x

y

Formulation & Derivation for PCA (cont’d)

• We need to maximize var 𝝎𝝎𝑇𝑇𝒙𝒙 with 𝝎𝝎 = 1.

• Lagrangian optimization for PCA

30

Eigenanalysis & PCA• Eigenanalysis for PCA = find the eigenvectors ωi and the corresponding eigenvalues λi

• In other words, the direction ωi captures the variance of λi.• But, which eigenvectors to use though? All of them?

• A d x d covariance matrix ∑ contains a maximum of d eigenvector/eigenvalue pairs. • Do we need to compute all of them? Which ωi and λi pairs to use?• Assuming you have N images of size M x M pixels, we have dimension d = M2.• What is the rank of ∑? It’s N, M, or d?

• Thus, at most non-zero eigenvalues can be obtained. Why?

31

Eigenanalysis & PCA (cont’d)• A d x d covariance matrix contains a maximum of d eigenvector/eigenvalue pairs.

• How dimension reduction is realized? how to reconstruct the input data?

• Expanding a signal via eigenvectors as bases• With symmetric matrices (e.g., covariance matrix), eigenvectors are orthogonal.• They can be regarded as unit basis vectors to span any instance in the d-dim space.

32

Practical Issues in PCA (if time permits…)

• Assume we have N = 100 images of size 200 x 200 pixels (i.e., d = 40000).• What is the size of the covariance matrix? What’s its rank?

• What can we do? Gram Matrix Trick!

33

Let’s See an Example (CMU AMP Face Database)

• Let’s take 5 face images x 13 people = 65 images, each is of size 64 x 64 = 4096 pixels.

• # of eigenvectors are expected to use for perfectly reconstructing the input = 64.

• Let’s check it out!

34

What Do the Eigenvectors/Eigenfaces Look Like?

35

V4 V5 V6 V7

V8 V9 V10 V11

V12 V13 V14 V15

Mean V1 V2 V3

All 64 Eigenvectors, do we need them all?

36

Use only 1 eigenvector, MSE = 1233

37

MSE=1233.16

Use 2 eigenvectors, MSE = 1027

38

MSE=1027.63

Use 3 eigenvectors, MSE = 758

39

MSE=758.13

Use 4 eigenvectors, MSE = 634

40

MSE=634.54

Use 8 eigenvectors, MSE = 285

41

MSE=285.08

With 20 eigenvectors, MSE = 87

42

MSE=87.93

With 30 eigenvectors, MSE = 20

43

MSE=20.55

With 50 eigenvectors, MSE = 2.14

44

MSE=2.14

With 60 eigenvectors, MSE = 0.06

45

MSE=0.06

All 64 eigenvectors, MSE = 0

46

MSE=0.00

Final Remarks

• Linear & unsupervised dimension reduction

• PCA can be applied as a feature extraction/preprocessing technique.• E.g,, Use the top 3 eigenvectors to project data into a 3D space for classification.

470 500 1000 1500 2000 2500-2000

0-1000

1000

02000

200

400

600

800

1000

1200 Class1Class2Class3

Final Remarks (cont’d)

• How do we perform classification? For example…• Given a test face input, project into the same 3D space (by the same 3 eigenvectors).• The resulting 3D vector is the feature vector for this test input.• We can do a simple Nearest Neighbor (NN) classification (with Euclidean distance),

which calculates the distance between the input to all training data in this space.• If NN (1-NN), the label of the closest training instance determines the classification output.• If k-nearest neighbors (k-NN), then k-nearest neighbors need to vote for the decision.

48

k = 1 k = 3 k = 5

Image credit: Stanford CS231n

Demo available at http://vision.stanford.edu/teaching/cs231n-demos/knn/

Final Remarks (cont’d)

• If labels for each data is provided → Linear Discriminant Analysis (LDA)• LDA is also known as Fisher’s discriminant analysis.• Eigenface vs. Fisherface (IEEE Trans. PAMI 1997)

• If linear DR is not sufficient, and non-linear DR is of interest…• lsomap, locally linear embedding (LLE), etc.

• t-distributed stochastic neighbor embedding (t-SNE) (by G. Hinton & L. van der Maaten)

49

What’s to Be Covered Today…

• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning

• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification

50

Hyperparameters in ML

• Recall that for k-NN, we need to determine the k value in advance. • What is the best k value?• Or, take PCA for example, what is the best reduced dimension number?

• Hyperparameters: parameter choices for the learning model/algorithm• We need to determine such hyperparameters instead of guessing.• Let’s see what we can and cannot do…

51

k = 1 k = 3 k = 5

Image credit: Stanford CS231n

How to Determine Hyperparameters?

• Idea #1• Let’s say you are working on face recognition.• You come up with your very own feature extraction/learning algorithm.• You take a dataset to train your model, and select your hyperparameters

(e.g., k of k-NN) based on the resulting performance.

• Might not generalize well.

52

Dataset

How to Determine Hyperparameters? (cont’d)

• Idea #2• Let’s say you are working on face recognition.• You come up with your very own feature extraction/learning algorithm.• For a dataset of interest, you split it into training and test sets.• You train your model with possible hyperparameter choices (e.g., k in k-NN),

and select those work best on test set data.

• That’s called cheating…

53

Training set Test set

How to Determine Hyperparameters? (cont’d)

• Idea #3• Let’s say you are working on face recognition.• You come up with your very own feature extraction/learning algorithm.• For the dataset of interest, it is split it into training, validation, and test sets.• You train your model with possible hyperparameter choices (k in k-NN),

and select those work best on the validation set.

• OK, but…

54

Training set Test setValidation set

Training set Test setValidation set

How to Determine Hyperparameters? (cont’d)

• Idea #3.5• What if only training and test sets are given, not the validation set?• Cross-validation (or k-fold cross validation)

• Split the training set into k folds with a hyperparameter choice• Keep 1 fold as validation set and the remaining k-1 folds for training• After each of k folds is evaluated, report the average validation performance.• Choose the hyperparameter(s) which result in the highest average validation

performance.

• Take a 4-fold cross-validation as an example…

55

Training set Test set

Fold 1 Test setFold 2 Fold 3 Fold 4

Fold 1 Test setFold 2 Fold 3 Fold 4

Fold 1 Test setFold 2 Fold 3 Fold 4

Fold 1 Test setFold 2 Fold 3 Fold 4

Minor Remarks on NN-based Methods

• k-NN is easy to implement but not of much interest in practice. Why?• Choice of distance metrics might be an issue (see example below)• Measuring distances in high-dimensional spaces might not be a good idea.• Moreover, NN-based methods require lots of and !

(NN-based methods are viewed as data-driven approaches.)

56Image credit: Stanford CS231n

All three images have the same Euclidean distance to the original one.

What We’ve Covered Today…

• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning

• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification (next lecture)

57