Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w1.pdf · TAs &...

Deep Learning for Computer VisionFall 2020

http://vllab.ee.ntu.edu.tw/dlcv.html (Public website)

https://cool.ntu.edu.tw/courses/3368 (NTU COOL; for grade, etc.)

Yu-Chiang Frank Wang 王鈺強, Professor

Dept. Electrical Engineering, National Taiwan University

2020/09/22

Tight (yet tentative) Schedule

Week Date Topic Remarks

1 9/15 Course Logistics

2 9/22 Machine Learning 101

3 9/29 Convolutional Neural Network (I) HW #1 out

4 10/6 Convolutional Neural Network (II): Visualization & Extensions of CNN

5 10/13 Tutorials on Python, Github, etc. (by TAs);Invited Talk by Prof. Philipp Krähenbühl (UT Austin)

HW #1 due

6 10/20 Object Detection & Segmentation HW #2 out

7 10/27 Generative Models & Generative Adversarial Network (GAN)

8 11/3 Transfer Learning for Visual Classification & Synthesis (I)

9 11/10 Transfer Learning for Visual Classification & Synthesis (II); Representation Disentanglement

HW #2 due;HW #3 out

10 11/17 TBD (CVPR Week)

11 11/24 Recurrent Neural Networks & Transformer

12 12/1 Meta-Learning; Few-Shot and Zero-Shot Classification (I) HW #3 due

13 12/8 Meta-Learning; Few-Shot and Zero-Shot Classification (II) HW #4 out

14 12/15 From Domain Adaptation to Domain Generalization Team-up for Final Projects

15 12/22 Beyond 2D vision (3D and Depth)

16 12/29 Image Inpainting and Outpainting; Guest Lecture HW #4 due

17 1/5 Guest Lectures

1/18-22 Presentation for Final Projects TBD

3 weeks

2 weeks

3 weeks

What’s to Be Covered Today…

• From Probability to Bayes Decision Rule• Unsupervised vs. Supervised Learning

• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification

Example: Testing/Screening of COVID-19

Positivefor COVID

Negativefor COVID

Distributions between positive/negative test results (e.g., PCR, antibody, etc.)- further away from each other- more accurate COVID diagnosis

Bayesian Decision Theory

• Fundamental statistical approach to classification/detection tasks

• Take a 2-class classification/detection task as an example:• Let’s see if a student would pass or fail the course of DLCV.• Define a probabilistic variable ω describe the case of pass or fail.• That is, ω = ω1 for pass, and ω = ω2 for fail.

• Prior Probability• The a priori or prior probability reflects the knowledge of

how likely we expect a certain state of nature before observation.• P(ω = ω1) or simply P(ω1) as the prior that the next student would pass DLCV.• The priors must exhibit exclusivity and exhaustivity, i.e.,

• Equal priors• If we have equal numbers of students pass/fail DLCV, then the priors are equal;

in other words, the priors are uniform.

Prior Probability (cont’d)

• Decision rule based on priors only• If the only available info is the prior,

and the cost of any type of incorrect classification is equal, what would be a reasonable decision rule?

• Decide ω1 if

otherwise decide ω2 .• What’s the incorrect classification rate (or error rate) Pe?

Class-Conditional Probability Density (or Likelihood)

• The probability density function (PDF) for input/observation x given a state of nature ωis written as:

• Here’s (hopefully) the hypothetical class-conditional densities reflecting the time of the students spending on DLCV who eventually pass/fail this course.

Posterior Probability & Bayes Formula

• If we know the prior distribution and the class-conditional density, can we come up with a better decision rule?

• Yes We Can! • By calculating the posterior probability.

• Posterior probability 𝑃𝑃(𝜔𝜔|𝒙𝒙) :• The probability of a certain state of nature ω given an observable x.

• Bayes formula:𝑃𝑃 𝜔𝜔𝑗𝑗 ,𝒙𝒙

𝑃𝑃(𝜔𝜔𝑗𝑗|𝒙𝒙)

And, we have ∑𝑗𝑗=1𝐶𝐶 𝑃𝑃(𝜔𝜔𝑗𝑗|𝒙𝒙) = 1.8

Decision Rule & Probability of Error

• For a given observable x (e.g., # of GPUs), the decision rule (to take DLCV or not) will be now based on:

• What’s the probability of error P(error) (or Pe)?

From Bayes Decision Rule to Detection Theory

• Hit (detection, TP), false alarm (FA, FP), miss (false reject, FN), rejection (TN)

• Receiver Operating Characteristics (ROC)• To assess the effectiveness of the designed features/classifiers• False alarm (PFA or FP) vs. detection (Pd or TP) rates• Which curve/line makes sense? (a), (b), or (c)?

Clustering

• Clustering is an unsupervised algorithm.• Given:

a set of N unlabeled instances {x1, …, xN}; # of clusters K• Goal: group the samples into K partitions

• Remarks:• High within-cluster (intra-cluster) similarity• Low between-cluster (inter-cluster) similarity• But…how to determine a proper similarity measure?

Similarity is NOT Always Objective…

Clustering (cont’d)

• Similarity:• A key component/measure to perform data clustering• Inversely proportional to distance• Example distance metrics:

• Euclidean distance (L2 norm): 𝑑𝑑 𝑥𝑥, 𝑧𝑧 = 𝑥𝑥 − 𝑧𝑧 2 = ∑𝑖𝑖=1𝐷𝐷 𝑥𝑥𝑖𝑖 − 𝑧𝑧𝑖𝑖 2

• Manhattan distance (L1 norm): 𝑑𝑑 𝑥𝑥, 𝑧𝑧 = 𝑥𝑥 − 𝑧𝑧 1 = ∑𝑖𝑖=1𝐷𝐷 𝑥𝑥𝑖𝑖 − 𝑧𝑧𝑖𝑖

• Note that p-norm of x is denoted as:

Clustering (cont’d) • Similarity:

• A key component/measure to perform data clustering• Inversely proportional to distance• Example distance metrics:

• Kernelized (non-linear) distance:

𝑑𝑑 𝑥𝑥, 𝑧𝑧 = Φ(𝑥𝑥) − Φ 𝑧𝑧 22 = Φ(𝑥𝑥) 2

2 + Φ(𝑧𝑧) 22 − 2Φ(𝑥𝑥)𝑇𝑇Φ(𝑧𝑧)

• Taking Gaussian kernel for example: 𝐾𝐾 𝑥𝑥, 𝑧𝑧 = Φ 𝑥𝑥 𝑇𝑇Φ 𝑧𝑧 = 𝑒𝑒𝑥𝑥𝑒𝑒 − 𝑥𝑥−𝑧𝑧 22

2𝜎𝜎2,

we have

And, distance is more sensitive to larger/smaller σ. Why?• For example, L2 or kernelized distance metrics for the following two cases?

K-Means Clustering

• Input: N examples {x1, . . . , xN } (xn ∈RD ); number of partitions K• Initialize: K cluster centers µ1, . . . , µK . Several initialization options:

• Randomly initialize µ1, . . . , µK anywhere in RD

• Or, simply choose any K examples as the cluster centers• Iterate:

• Assign each of example xn to its closest cluster center• Recompute the new cluster centers µk (mean/centroid of the set Ck )• Repeat while not converge

• Possible convergence criteria:• Cluster centers do not change anymore• Max. number of iterations reached

• Output:• K clusters (with centers/means of each cluster)

K-Means Clustering

• Example (K = 2): Initialization, iteration #1: pick cluster centers

K-Means Clustering

• Example (K = 2): iteration #1-2, assign data to each cluster

K-Means Clustering

• Example (K = 2): iteration #2-1, update cluster centers

K-Means Clustering

• Example (K = 2): iteration #2, assign data to each cluster

K-Means Clustering

• Example (K = 2): iteration #3-1

K-Means Clustering

• Example (K = 2): iteration #5, cluster means are not changed.

K-Means Clustering (cont’d)• Limitation

• Preferable for round shaped clusters with similar sizes

• Sensitive to initialization; how to alleviate this problem?• Sensitive to outliers; possible change from K-means to…• Hard assignment only. Mathematically, we have…

• Remarks• Expectation-maximization (EM) algorithm• Speed-up possible by hierarchical clustering (e.g., 100 = 102 clusters)

Dimension Reduction• Principal Component Analysis (PCA)

• Unsupervised & linear dimension reduction• Related to Eigenfaces, etc. feature extraction and classification techniques• Still very popular despite of its simplicity and effectiveness.• Goal:

• Determine the projection, so that the variation of projected data is maximized.

axis that describes the largest variation for data projected onto it.

Formulation & Derivation for PCA

• Input: a set of instances x without label info

• Output: a projection vector ω maximizing the variance of the projected data

Formulation & Derivation for PCA (cont’d)

• We need to maximize var 𝝎𝝎𝑇𝑇𝒙𝒙 with 𝝎𝝎 = 1.

• Lagrangian optimization for PCA

Eigenanalysis & PCA• Eigenanalysis for PCA = find the eigenvectors ωi and the corresponding eigenvalues λi

• In other words, the direction ωi captures the variance of λi.• But, which eigenvectors to use though? All of them?

• A d x d covariance matrix ∑ contains a maximum of d eigenvector/eigenvalue pairs. • Do we need to compute all of them? Which ωi and λi pairs to use?• Assuming you have N images of size M x M pixels, we have dimension d = M2.• What is the rank of ∑? It’s N, M, or d?

• Thus, at most non-zero eigenvalues can be obtained. Why?

Eigenanalysis & PCA (cont’d)• A d x d covariance matrix contains a maximum of d eigenvector/eigenvalue pairs.

• How dimension reduction is realized? how to reconstruct the input data?

• Expanding a signal via eigenvectors as bases• With symmetric matrices (e.g., covariance matrix), eigenvectors are orthogonal.• They can be regarded as unit basis vectors to span any instance in the d-dim space.

Practical Issues in PCA (if time permits…)

• Assume we have N = 100 images of size 200 x 200 pixels (i.e., d = 40000).• What is the size of the covariance matrix? What’s its rank?

• What can we do? Gram Matrix Trick!

Let’s See an Example (CMU AMP Face Database)

• Let’s take 5 face images x 13 people = 65 images, each is of size 64 x 64 = 4096 pixels.

• # of eigenvectors are expected to use for perfectly reconstructing the input = 64.

• Let’s check it out!

What Do the Eigenvectors/Eigenfaces Look Like?

V4 V5 V6 V7

V8 V9 V10 V11

V12 V13 V14 V15

Mean V1 V2 V3

All 64 Eigenvectors, do we need them all?

Use only 1 eigenvector, MSE = 1233

MSE=1233.16

Use 2 eigenvectors, MSE = 1027

MSE=1027.63

MSE=758.13

MSE=634.54

MSE=285.08

With 20 eigenvectors, MSE = 87

MSE=87.93

With 30 eigenvectors, MSE = 20

MSE=20.55

With 50 eigenvectors, MSE = 2.14

MSE=2.14

With 60 eigenvectors, MSE = 0.06

MSE=0.06

All 64 eigenvectors, MSE = 0

MSE=0.00

Final Remarks

• Linear & unsupervised dimension reduction

• PCA can be applied as a feature extraction/preprocessing technique.• E.g,, Use the top 3 eigenvectors to project data into a 3D space for classification.

470 500 1000 1500 2000 2500-2000

0-1000

1200 Class1Class2Class3

Final Remarks (cont’d)

• How do we perform classification? For example…• Given a test face input, project into the same 3D space (by the same 3 eigenvectors).• The resulting 3D vector is the feature vector for this test input.• We can do a simple Nearest Neighbor (NN) classification (with Euclidean distance),

which calculates the distance between the input to all training data in this space.• If NN (1-NN), the label of the closest training instance determines the classification output.• If k-nearest neighbors (k-NN), then k-nearest neighbors need to vote for the decision.

k = 1 k = 3 k = 5

Image credit: Stanford CS231n

Demo available at http://vision.stanford.edu/teaching/cs231n-demos/knn/

Final Remarks (cont’d)

• If labels for each data is provided → Linear Discriminant Analysis (LDA)• LDA is also known as Fisher’s discriminant analysis.• Eigenface vs. Fisherface (IEEE Trans. PAMI 1997)

• If linear DR is not sufficient, and non-linear DR is of interest…• lsomap, locally linear embedding (LLE), etc.

• t-distributed stochastic neighbor embedding (t-SNE) (by G. Hinton & L. van der Maaten)

Hyperparameters in ML

• Recall that for k-NN, we need to determine the k value in advance. • What is the best k value?• Or, take PCA for example, what is the best reduced dimension number?

• Hyperparameters: parameter choices for the learning model/algorithm• We need to determine such hyperparameters instead of guessing.• Let’s see what we can and cannot do…

k = 1 k = 3 k = 5

Image credit: Stanford CS231n

How to Determine Hyperparameters?

• Idea #1• Let’s say you are working on face recognition.• You come up with your very own feature extraction/learning algorithm.• You take a dataset to train your model, and select your hyperparameters

(e.g., k of k-NN) based on the resulting performance.

• Might not generalize well.

Dataset

How to Determine Hyperparameters? (cont’d)

• Idea #2• Let’s say you are working on face recognition.• You come up with your very own feature extraction/learning algorithm.• For a dataset of interest, you split it into training and test sets.• You train your model with possible hyperparameter choices (e.g., k in k-NN),

and select those work best on test set data.

• That’s called cheating…

Training set Test set

• Idea #3• Let’s say you are working on face recognition.• You come up with your very own feature extraction/learning algorithm.• For the dataset of interest, it is split it into training, validation, and test sets.• You train your model with possible hyperparameter choices (k in k-NN),

and select those work best on the validation set.

• OK, but…

Training set Test setValidation set

• Idea #3.5• What if only training and test sets are given, not the validation set?• Cross-validation (or k-fold cross validation)

• Split the training set into k folds with a hyperparameter choice• Keep 1 fold as validation set and the remaining k-1 folds for training• After each of k folds is evaluated, report the average validation performance.• Choose the hyperparameter(s) which result in the highest average validation

performance.

• Take a 4-fold cross-validation as an example…

Training set Test set

Fold 1 Test setFold 2 Fold 3 Fold 4

Minor Remarks on NN-based Methods

• k-NN is easy to implement but not of much interest in practice. Why?• Choice of distance metrics might be an issue (see example below)• Measuring distances in high-dimensional spaces might not be a good idea.• Moreover, NN-based methods require lots of and !

(NN-based methods are viewed as data-driven approaches.)

56Image credit: Stanford CS231n

All three images have the same Euclidean distance to the original one.

What We’ve Covered Today…

• Clustering & Dimension Reduction • Training, testing, & validation• Linear Classification (next lecture)

Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w1.pdf · TAs &...

Documents

Transcript of Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w1.pdf · TAs &...

Money Aviation · Pratt & Whitney PT6A-1 14A 3600 Hours 6950.05 Hours 9807 Cycles 3350 Hours 771 .3 Hours 250 Hours McCauley 3GFR34C703 4000 Hrs / 6 Yrs. 6.1 Hours O Hours 4000 Hours

P12 - Blue Sea Systemsassets.bluesea.com/files/resources/sales_sheets/...Dual 6V Golf Cart 225 9 hours 5 hours Triple Group 27 270 10 hours 6 hours Dual 8 D 440 17 hours 11 hours Charge

Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w10.pdf · • Simultaneous Deep Transfer Across Domains and Tasks • Tzenget al., ICCV, 2015

BHARATIYA VIDYA BHAVAN’S V.M.PUBLIC … V.pdfQ.2 Convert into 12- hour clock: a) 1820 hours , b) 0000 hours c) 2300 hours d) 0900 hours e) 2000 hours f) 0200 hours g) 1400 hours

NEW MEDIA INSIGHTS€¦ · 2013 4.9 hours 2.3 hours 2.3 hours 0.3 hours 2012 4.3 hours 2.5 hours 1.6 hours 0.3 hours 2011 3.7 hours 2.6 hours 0.8 hours 0.3 hours KPCB, May 2017 %

XiteBio Yield+ for Corn, Wheat & Barley...2019/12/17 · KS1410 8 hours KS2075 8 hours LigniJoule ® 8 hours LS624 8 hours LS924 8 hours ManZinPhos 8 hours XiteBio® PulseRhizo®

Direction of the New Jersey Assessment System and Spring ... · ELA/L Grade 3. 3.75 hours. 2.5 hours. ELA/L Grades 4-10* 4.5 hours. 3 hours: Mathematics. 3.5 to 4.5 hours: 3 hours

Disposition and metabolism of [ c]- levomilnacipran, a ... · 1 hour, 2 hours, 2.5 hours, 3 hours, 3.5 hours, 4 hours, 5 hours, 6 hours, 8 hours, 10 hours, 12 hours, 24 hours, 48

VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Dassault Falcon 2000/2000EX€¦ · Briefing/Debriefing 9.0 hours 4.5 hours Simulator (Pilot Flying) 12.0 hours 6.0 hours Simulator (Pilot Monitoring) 12.0 hours 6.0 hours Simulator

Dassault Falcon 2000/2000EX · 2018-05-15 · Briefing/Debriefing 9.0 hours 4.5 hours Simulator (Pilot Flying) 12.0 hours 6.0 hours Simulator (Pilot Monitoring) 12.0 hours 6.0 hours

· Teacher 19 Maximum CEU's/C10ck Hours 20 clock hours per year 4 clock hours per year 20 clock hours per license cycle 30 clock hours per year 30 clock hours per semester 10 clock

schools.tdsb.on.caschools.tdsb.on.ca/albertcampbell/Technology... · Architectural Design total 15 hours 5 hours 20 hours 10 hours 20 hours 10 hours 10 hours 20 hours ... questions

Auckland: 2 hours Hamilton: 45 minutes Waitomo: 1 ½ hours Rotorua: 45 minutes Tauranga: 45 minutes Taupo: 1 ½ hours Wellington: 6 hours.

ICI Report powerpoint Final updated 0702 › ... › ICI_Report_Final_0702.pdf · 2019-07-02 · PROJECT CHICAGO RESIDENTS LOCAL HOURS Hours Hours% WagesPaid Hours Hours% WagesPaid

Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Dogs and cats are not small people: An introduction to dispensing … · 2018. 4. 3. · Cat t 1/2 Aspirin 2.3 hours 4.5 hours 22 hours Propofol 1.9 hours 2.4 hours 8.8 hours Piroxicam

Curriculum Framework (250 hours) Compulsory Part (184 hours) Heat Transfer and Gases(23 hours) Force & Motion (50 hours) Wave Motion (47 hours) Electricity.

For personal use only · 24 hours 3.61 89.6 48 hours 4.02 97.3 Air Sparging3 24 hours 4.20 96.5 Dot’s Luck Oxide ROM: 24 hours 1.58 91.2 48 hours 1.46 96.6 Air Sparging 24 hours

SECTION - B - SMBTsmbt.edu.in/data/I MBBS.pdf · Anatomy 650 hours Physiology 480 hours Biochemistry 240 hours Community Medicine 060 hours Total 1430 hours Didactic lectures should