Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w1.pdf · TAs &...
Transcript of Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w1.pdf · TAs &...
Deep Learning for Computer VisionFall 2020
http://vllab.ee.ntu.edu.tw/dlcv.html (Public website)
https://cool.ntu.edu.tw/courses/3368 (NTU COOL; for grade, etc.)
Yu-Chiang Frank Wang 王鈺強, Professor
Dept. Electrical Engineering, National Taiwan University
2020/09/22
Tight (yet tentative) Schedule
2
Week Date Topic Remarks
1 9/15 Course Logistics
2 9/22 Machine Learning 101
3 9/29 Convolutional Neural Network (I) HW #1 out
4 10/6 Convolutional Neural Network (II): Visualization & Extensions of CNN
5 10/13 Tutorials on Python, Github, etc. (by TAs);Invited Talk by Prof. Philipp Krähenbühl (UT Austin)
HW #1 due
6 10/20 Object Detection & Segmentation HW #2 out
7 10/27 Generative Models & Generative Adversarial Network (GAN)
8 11/3 Transfer Learning for Visual Classification & Synthesis (I)
9 11/10 Transfer Learning for Visual Classification & Synthesis (II); Representation Disentanglement
HW #2 due;HW #3 out
10 11/17 TBD (CVPR Week)
11 11/24 Recurrent Neural Networks & Transformer
12 12/1 Meta-Learning; Few-Shot and Zero-Shot Classification (I) HW #3 due
13 12/8 Meta-Learning; Few-Shot and Zero-Shot Classification (II) HW #4 out
14 12/15 From Domain Adaptation to Domain Generalization Team-up for Final Projects
15 12/22 Beyond 2D vision (3D and Depth)
16 12/29 Image Inpainting and Outpainting; Guest Lecture HW #4 due
17 1/5 Guest Lectures
1/18-22 Presentation for Final Projects TBD
3 weeks
2 weeks
3 weeks
3 weeks
What’s to Be Covered Today…
• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning
• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification
3
Example: Testing/Screening of COVID-19
4
Positivefor COVID
Negativefor COVID
Distributions between positive/negative test results (e.g., PCR, antibody, etc.)- further away from each other- more accurate COVID diagnosis
Bayesian Decision Theory
• Fundamental statistical approach to classification/detection tasks
• Take a 2-class classification/detection task as an example:• Let’s see if a student would pass or fail the course of DLCV.• Define a probabilistic variable ω describe the case of pass or fail.• That is, ω = ω1 for pass, and ω = ω2 for fail.
• Prior Probability• The a priori or prior probability reflects the knowledge of
how likely we expect a certain state of nature before observation.• P(ω = ω1) or simply P(ω1) as the prior that the next student would pass DLCV.• The priors must exhibit exclusivity and exhaustivity, i.e.,
• Equal priors• If we have equal numbers of students pass/fail DLCV, then the priors are equal;
in other words, the priors are uniform.
5
Prior Probability (cont’d)
• Decision rule based on priors only• If the only available info is the prior,
and the cost of any type of incorrect classification is equal, what would be a reasonable decision rule?
• Decide ω1 if
otherwise decide ω2 .• What’s the incorrect classification rate (or error rate) Pe?
6
Class-Conditional Probability Density (or Likelihood)
• The probability density function (PDF) for input/observation x given a state of nature ωis written as:
• Here’s (hopefully) the hypothetical class-conditional densities reflecting the time of the students spending on DLCV who eventually pass/fail this course.
7
Posterior Probability & Bayes Formula
• If we know the prior distribution and the class-conditional density, can we come up with a better decision rule?
• Yes We Can! • By calculating the posterior probability.
• Posterior probability 𝑃𝑃(𝜔𝜔|𝒙𝒙) :• The probability of a certain state of nature ω given an observable x.
• Bayes formula:𝑃𝑃 𝜔𝜔𝑗𝑗 ,𝒙𝒙
𝑃𝑃(𝜔𝜔𝑗𝑗|𝒙𝒙)
And, we have ∑𝑗𝑗=1𝐶𝐶 𝑃𝑃(𝜔𝜔𝑗𝑗|𝒙𝒙) = 1.8
Decision Rule & Probability of Error
• For a given observable x (e.g., # of GPUs), the decision rule (to take DLCV or not) will be now based on:
• What’s the probability of error P(error) (or Pe)?
9
From Bayes Decision Rule to Detection Theory
• Hit (detection, TP), false alarm (FA, FP), miss (false reject, FN), rejection (TN)
• Receiver Operating Characteristics (ROC)• To assess the effectiveness of the designed features/classifiers• False alarm (PFA or FP) vs. detection (Pd or TP) rates• Which curve/line makes sense? (a), (b), or (c)?
10
PD
PFA
100
1000
What’s to Be Covered Today…
• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning
• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification
11
Clustering
• Clustering is an unsupervised algorithm.• Given:
a set of N unlabeled instances {x1, …, xN}; # of clusters K• Goal: group the samples into K partitions
• Remarks:• High within-cluster (intra-cluster) similarity• Low between-cluster (inter-cluster) similarity• But…how to determine a proper similarity measure?
12
Similarity is NOT Always Objective…
13
Clustering (cont’d)
• Similarity:• A key component/measure to perform data clustering• Inversely proportional to distance• Example distance metrics:
• Euclidean distance (L2 norm): 𝑑𝑑 𝑥𝑥, 𝑧𝑧 = 𝑥𝑥 − 𝑧𝑧 2 = ∑𝑖𝑖=1𝐷𝐷 𝑥𝑥𝑖𝑖 − 𝑧𝑧𝑖𝑖 2
• Manhattan distance (L1 norm): 𝑑𝑑 𝑥𝑥, 𝑧𝑧 = 𝑥𝑥 − 𝑧𝑧 1 = ∑𝑖𝑖=1𝐷𝐷 𝑥𝑥𝑖𝑖 − 𝑧𝑧𝑖𝑖
• Note that p-norm of x is denoted as:
14
Clustering (cont’d) • Similarity:
• A key component/measure to perform data clustering• Inversely proportional to distance• Example distance metrics:
• Kernelized (non-linear) distance:
𝑑𝑑 𝑥𝑥, 𝑧𝑧 = Φ(𝑥𝑥) − Φ 𝑧𝑧 22 = Φ(𝑥𝑥) 2
2 + Φ(𝑧𝑧) 22 − 2Φ(𝑥𝑥)𝑇𝑇Φ(𝑧𝑧)
• Taking Gaussian kernel for example: 𝐾𝐾 𝑥𝑥, 𝑧𝑧 = Φ 𝑥𝑥 𝑇𝑇Φ 𝑧𝑧 = 𝑒𝑒𝑥𝑥𝑒𝑒 − 𝑥𝑥−𝑧𝑧 22
2𝜎𝜎2,
we have
And, distance is more sensitive to larger/smaller σ. Why?• For example, L2 or kernelized distance metrics for the following two cases?
15
K-Means Clustering
• Input: N examples {x1, . . . , xN } (xn ∈RD ); number of partitions K• Initialize: K cluster centers µ1, . . . , µK . Several initialization options:
• Randomly initialize µ1, . . . , µK anywhere in RD
• Or, simply choose any K examples as the cluster centers• Iterate:
• Assign each of example xn to its closest cluster center• Recompute the new cluster centers µk (mean/centroid of the set Ck )• Repeat while not converge
• Possible convergence criteria:• Cluster centers do not change anymore• Max. number of iterations reached
• Output:• K clusters (with centers/means of each cluster)
16
K-Means Clustering
• Example (K = 2): Initialization, iteration #1: pick cluster centers
17
K-Means Clustering
• Example (K = 2): iteration #1-2, assign data to each cluster
18
K-Means Clustering
• Example (K = 2): iteration #2-1, update cluster centers
19
K-Means Clustering
• Example (K = 2): iteration #2, assign data to each cluster
20
K-Means Clustering
• Example (K = 2): iteration #3-1
21
K-Means Clustering
• Example (K = 2): iteration #3-2
22
K-Means Clustering
• Example (K = 2): iteration #4-1
23
K-Means Clustering
• Example (K = 2): iteration #4-2
24
K-Means Clustering
• Example (K = 2): iteration #5, cluster means are not changed.
25
K-Means Clustering (cont’d)• Limitation
• Preferable for round shaped clusters with similar sizes
• Sensitive to initialization; how to alleviate this problem?• Sensitive to outliers; possible change from K-means to…• Hard assignment only. Mathematically, we have…
• Remarks• Expectation-maximization (EM) algorithm• Speed-up possible by hierarchical clustering (e.g., 100 = 102 clusters)
26
What’s to Be Covered Today…
• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning
• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification
27
Dimension Reduction• Principal Component Analysis (PCA)
• Unsupervised & linear dimension reduction• Related to Eigenfaces, etc. feature extraction and classification techniques• Still very popular despite of its simplicity and effectiveness.• Goal:
• Determine the projection, so that the variation of projected data is maximized.
28x
y
axis that describes the largest variation for data projected onto it.
Formulation & Derivation for PCA
• Input: a set of instances x without label info
• Output: a projection vector ω maximizing the variance of the projected data
29
x
y
Formulation & Derivation for PCA (cont’d)
• We need to maximize var 𝝎𝝎𝑇𝑇𝒙𝒙 with 𝝎𝝎 = 1.
• Lagrangian optimization for PCA
30
Eigenanalysis & PCA• Eigenanalysis for PCA = find the eigenvectors ωi and the corresponding eigenvalues λi
• In other words, the direction ωi captures the variance of λi.• But, which eigenvectors to use though? All of them?
• A d x d covariance matrix ∑ contains a maximum of d eigenvector/eigenvalue pairs. • Do we need to compute all of them? Which ωi and λi pairs to use?• Assuming you have N images of size M x M pixels, we have dimension d = M2.• What is the rank of ∑? It’s N, M, or d?
• Thus, at most non-zero eigenvalues can be obtained. Why?
31
Eigenanalysis & PCA (cont’d)• A d x d covariance matrix contains a maximum of d eigenvector/eigenvalue pairs.
• How dimension reduction is realized? how to reconstruct the input data?
• Expanding a signal via eigenvectors as bases• With symmetric matrices (e.g., covariance matrix), eigenvectors are orthogonal.• They can be regarded as unit basis vectors to span any instance in the d-dim space.
32
Practical Issues in PCA (if time permits…)
• Assume we have N = 100 images of size 200 x 200 pixels (i.e., d = 40000).• What is the size of the covariance matrix? What’s its rank?
• What can we do? Gram Matrix Trick!
33
Let’s See an Example (CMU AMP Face Database)
• Let’s take 5 face images x 13 people = 65 images, each is of size 64 x 64 = 4096 pixels.
• # of eigenvectors are expected to use for perfectly reconstructing the input = 64.
• Let’s check it out!
34
What Do the Eigenvectors/Eigenfaces Look Like?
35
V4 V5 V6 V7
V8 V9 V10 V11
V12 V13 V14 V15
Mean V1 V2 V3
All 64 Eigenvectors, do we need them all?
36
Use only 1 eigenvector, MSE = 1233
37
MSE=1233.16
Use 2 eigenvectors, MSE = 1027
38
MSE=1027.63
Use 3 eigenvectors, MSE = 758
39
MSE=758.13
Use 4 eigenvectors, MSE = 634
40
MSE=634.54
Use 8 eigenvectors, MSE = 285
41
MSE=285.08
With 20 eigenvectors, MSE = 87
42
MSE=87.93
With 30 eigenvectors, MSE = 20
43
MSE=20.55
With 50 eigenvectors, MSE = 2.14
44
MSE=2.14
With 60 eigenvectors, MSE = 0.06
45
MSE=0.06
All 64 eigenvectors, MSE = 0
46
MSE=0.00
Final Remarks
• Linear & unsupervised dimension reduction
• PCA can be applied as a feature extraction/preprocessing technique.• E.g,, Use the top 3 eigenvectors to project data into a 3D space for classification.
470 500 1000 1500 2000 2500-2000
0-1000
1000
02000
200
400
600
800
1000
1200 Class1Class2Class3
Final Remarks (cont’d)
• How do we perform classification? For example…• Given a test face input, project into the same 3D space (by the same 3 eigenvectors).• The resulting 3D vector is the feature vector for this test input.• We can do a simple Nearest Neighbor (NN) classification (with Euclidean distance),
which calculates the distance between the input to all training data in this space.• If NN (1-NN), the label of the closest training instance determines the classification output.• If k-nearest neighbors (k-NN), then k-nearest neighbors need to vote for the decision.
48
k = 1 k = 3 k = 5
Image credit: Stanford CS231n
Demo available at http://vision.stanford.edu/teaching/cs231n-demos/knn/
Final Remarks (cont’d)
• If labels for each data is provided → Linear Discriminant Analysis (LDA)• LDA is also known as Fisher’s discriminant analysis.• Eigenface vs. Fisherface (IEEE Trans. PAMI 1997)
• If linear DR is not sufficient, and non-linear DR is of interest…• lsomap, locally linear embedding (LLE), etc.
• t-distributed stochastic neighbor embedding (t-SNE) (by G. Hinton & L. van der Maaten)
49
What’s to Be Covered Today…
• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning
• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification
50
Hyperparameters in ML
• Recall that for k-NN, we need to determine the k value in advance. • What is the best k value?• Or, take PCA for example, what is the best reduced dimension number?
• Hyperparameters: parameter choices for the learning model/algorithm• We need to determine such hyperparameters instead of guessing.• Let’s see what we can and cannot do…
51
k = 1 k = 3 k = 5
Image credit: Stanford CS231n
How to Determine Hyperparameters?
• Idea #1• Let’s say you are working on face recognition.• You come up with your very own feature extraction/learning algorithm.• You take a dataset to train your model, and select your hyperparameters
(e.g., k of k-NN) based on the resulting performance.
•
• Might not generalize well.
52
Dataset
How to Determine Hyperparameters? (cont’d)
• Idea #2• Let’s say you are working on face recognition.• You come up with your very own feature extraction/learning algorithm.• For a dataset of interest, you split it into training and test sets.• You train your model with possible hyperparameter choices (e.g., k in k-NN),
and select those work best on test set data.
•
• That’s called cheating…
53
Training set Test set
How to Determine Hyperparameters? (cont’d)
• Idea #3• Let’s say you are working on face recognition.• You come up with your very own feature extraction/learning algorithm.• For the dataset of interest, it is split it into training, validation, and test sets.• You train your model with possible hyperparameter choices (k in k-NN),
and select those work best on the validation set.
•
• OK, but…
54
Training set Test setValidation set
Training set Test setValidation set
How to Determine Hyperparameters? (cont’d)
• Idea #3.5• What if only training and test sets are given, not the validation set?• Cross-validation (or k-fold cross validation)
• Split the training set into k folds with a hyperparameter choice• Keep 1 fold as validation set and the remaining k-1 folds for training• After each of k folds is evaluated, report the average validation performance.• Choose the hyperparameter(s) which result in the highest average validation
performance.
• Take a 4-fold cross-validation as an example…
55
Training set Test set
Fold 1 Test setFold 2 Fold 3 Fold 4
Fold 1 Test setFold 2 Fold 3 Fold 4
Fold 1 Test setFold 2 Fold 3 Fold 4
Fold 1 Test setFold 2 Fold 3 Fold 4
Minor Remarks on NN-based Methods
• k-NN is easy to implement but not of much interest in practice. Why?• Choice of distance metrics might be an issue (see example below)• Measuring distances in high-dimensional spaces might not be a good idea.• Moreover, NN-based methods require lots of and !
(NN-based methods are viewed as data-driven approaches.)
56Image credit: Stanford CS231n
All three images have the same Euclidean distance to the original one.
What We’ve Covered Today…
• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning
• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification (next lecture)
57