[PR12] understanding deep learning requires rethinking generalization

54
Understanding Deep Learning Requires Rethinking Generalization PR12와 함께 이해하는 Jaejun Yoo Ph.D. Candidate @KAIST PR12 20 th Jan, 2018

Transcript of [PR12] understanding deep learning requires rethinking generalization

Page 1: [PR12] understanding deep learning requires rethinking generalization

Understanding Deep Learning Requires Rethinking Generalization

PR12와 함께 이해하는

Jaejun YooPh.D. Candidate @KAIST

PR12

20th Jan, 2018

Page 2: [PR12] understanding deep learning requires rethinking generalization

Today’s contents

Understanding deep learning requires rethinking generalizationby C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals

Nov. 2016: https://arxiv.org/abs/1611.03530

ICLR 2017 Best Paper

???@#*($DFJLDK

…paper awards

Page 3: [PR12] understanding deep learning requires rethinking generalization

Questions

Why large neural networks generalize well in practice?

Can the traditional theories on generalization actuallyexplain the results we are seeing these days?

What is it then that distinguishes neural networks that generalize well from those that don’t?

Page 4: [PR12] understanding deep learning requires rethinking generalization

Questions

Why large neural networks generalize well in practice?

Can the traditional theories on generalization actuallyexplain the results we are seeing these days?

What is it then that distinguishes neural networks that generalize well from those that don’t?

Page 5: [PR12] understanding deep learning requires rethinking generalization

Questions

Why large neural networks generalize well in practice?

Can the traditional theories on generalization actuallyexplain the results we are seeing these days?

What is it then that distinguishes neural networks that generalize well from those that don’t?

Page 6: [PR12] understanding deep learning requires rethinking generalization

Questions

Why large neural networks generalize well in practice?

Can the traditional theories on generalization actuallyexplain the results we are seeing these days?

What is it then that distinguishes neural networks that generalize well from those that don’t?

“Deep neural networks TOO easily fit random labels.”

Page 7: [PR12] understanding deep learning requires rethinking generalization

Conventional wisdom

Small generalization error due to • Model family • Various regularization techniques

“Generalization error”

전략: 최대한 적은 parameter를 갖으면서training error가 최소인 model을 찾자

Page 8: [PR12] understanding deep learning requires rethinking generalization

Conventional wisdom

???

Small generalization error due to • Model family • Various regularization techniques

“Generalization error”

Page 9: [PR12] understanding deep learning requires rethinking generalization

Effective capacity of neural networks

Parameter Count

Num Training Samples

MLP 1 x 512 AlexNet Inception Wide Resnet

MLP 1 x 512p/n = 24

Inceptionp/n = 33

Test error

AlexNetp/n = 28

Wide ResNetp/n = 179

Page 10: [PR12] understanding deep learning requires rethinking generalization

Effective capacity of neural networks

Parameter Count

Num Training Samples

MLP 1 x 512 AlexNet Inception Wide Resnet

MLP 1 x 512p/n = 24

Inceptionp/n = 33

Test error

AlexNetp/n = 28

Wide ResNetp/n = 179

If counting the number of parameter is not a useful way to measure the model complexity,then how can we measure the effective capacity of the model?

Page 11: [PR12] understanding deep learning requires rethinking generalization

Randomization test

Fitting random labels and pixels

Page 12: [PR12] understanding deep learning requires rethinking generalization

Randomization test

Naïve intuitionlearning is impossible, e.g., training not

converging or slowing down substantially.

Fitting random labels and pixels

Page 13: [PR12] understanding deep learning requires rethinking generalization

Randomization test

Fitting random labels and pixels

Page 14: [PR12] understanding deep learning requires rethinking generalization

Implications

Rademacher complexity and VC-dimension

Uniform stability

“How sensitive the algorithm is to the replacement of a single example”

Page 15: [PR12] understanding deep learning requires rethinking generalization

Implications

Rademacher complexity and VC-dimension

Uniform stability

“How sensitive the algorithm is to the replacement of a single example”

Page 16: [PR12] understanding deep learning requires rethinking generalization

Implications

Rademacher complexity and VC-dimension

Uniform stability

“How sensitive the algorithm is to the replacement of a single example”

Solely a property of the algorithm (nothing to do with the data)

Page 17: [PR12] understanding deep learning requires rethinking generalization

Implications

• The effective capacity of neural networks is sufficient for memorizing the entire data set.

• Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels

• Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged

Summary

Page 18: [PR12] understanding deep learning requires rethinking generalization

The role of regularization

Regularization 같을 걸 끼얹나?

What about regularization techniques?

당신의 모델이 자꾸 overfitting 될 때이렇게 하면 살 수 있다.

위기탈출 넘버원

Page 19: [PR12] understanding deep learning requires rethinking generalization

The role of regularization

What about regularization techniques?

Page 20: [PR12] understanding deep learning requires rethinking generalization

The role of regularization

What about regularization techniques?

없어도 꽤 잘 되는데?

Page 21: [PR12] understanding deep learning requires rethinking generalization

The role of regularization

What about regularization techniques?

있어도 잘만 (overfitting)되는데?

Page 22: [PR12] understanding deep learning requires rethinking generalization

The role of regularization

What about regularization techniques?

Regularization certainly helps generalization but it is NOT the fundamental reason for generalization.

있어도 잘만 (overfitting)되는데?

Page 23: [PR12] understanding deep learning requires rethinking generalization

Finite sample expressivity

Much effort has gone into studying the expressivity of NNs

However, almost all of these results are at the “population level”

Showing what functions of the entire domain can and cannot be represented by certain classes of NNs with the same number of parameters

What is more relevant in practice is the expressive power of NNs on a finite sample size of n

Page 24: [PR12] understanding deep learning requires rethinking generalization

Finite sample expressivity

The expressive power of NNs on a finite sample size of n?

“As soon as the number of parameters p of a networks is greater than n, even simple two-layer neural networks can represent any function of the input sample.”

"(2n + d)의 weight를 가지고, 활성화 함수로 ReLU를 사용하는 2층 뉴럴 네트워크는d차원의 n개의 샘플에 대한 어떠한 함수든지 표현할 수 있다."

Page 25: [PR12] understanding deep learning requires rethinking generalization

Finite sample expressivity

Proof)

• are invertible if and only if the diagonal elements are nonzero

• have their eigenvalues taken directly from the diagonal elements

Lower triangular matrix…

∃𝐀𝐀−𝟏𝟏 ∵ 𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝐀𝐀 = 𝒓𝒓

Page 26: [PR12] understanding deep learning requires rethinking generalization

Finite sample expressivity

Proof)

"(2n + d)의 weight를 가지고, 활성화 함수로 ReLU를 사용하는 2층 뉴럴 네트워크는d차원의 n개의 샘플에 대한 어떠한 함수든지 표현할 수 있다."

Page 27: [PR12] understanding deep learning requires rethinking generalization

Finite sample expressivity

Proof)

"(2n + d)의 weight를 가지고, 활성화 함수로 ReLU를 사용하는 2층 뉴럴 네트워크는d차원의 n개의 샘플에 대한 어떠한 함수든지 표현할 수 있다."

Page 28: [PR12] understanding deep learning requires rethinking generalization

Finite sample expressivity

Proof)

"(2n + d)의 weight를 가지고, 활성화 함수로 ReLU를 사용하는 2층 뉴럴 네트워크는d차원의 n개의 샘플에 대한 어떠한 함수든지 표현할 수 있다."

𝒚𝒚 = 𝐀𝐀𝒘𝒘, ∃𝐀𝐀−𝟏𝟏 ∵ 𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝐀𝐀 = 𝒓𝒓 𝒃𝒃𝒚𝒚 𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝒓𝒓 𝟏𝟏

Page 29: [PR12] understanding deep learning requires rethinking generalization

Implicit regularization

Neural net이 잘 되는 이유는 잘 모르겠지만 linear model로부터 어떤 insight를 얻을 수는 없을까?

An appeal to linear models

Page 30: [PR12] understanding deep learning requires rethinking generalization

Implicit regularization

An appeal to linear models

But do all global minima generalize equally well?How do we determine one to the other?

“Curvature”

Page 31: [PR12] understanding deep learning requires rethinking generalization

Implicit regularization

An appeal to linear models

But do all global minima generalize equally well?How do we determine one to the other?

“Curvature”

In linear case, curvature of all optimal solutions is the same

∵ Hessian of the loss function is not a function of the choice of 𝑤𝑤

Page 32: [PR12] understanding deep learning requires rethinking generalization

Implicit regularization

Algorithm 자체가 어떤 constraint

Here, implicit regularization = SGD algorithm

SGD는 solution이 Minimum Norm Solution으로 수렴

If curvature doesn’t distinguish global minima, what does?

Page 33: [PR12] understanding deep learning requires rethinking generalization

Algorithm 자체가 어떤 constraint

Here, implicit regularization = SGD algorithm

SGD는 solution이 Minimum Norm Solution으로 수렴

Implicit regularization

If curvature doesn’t distinguish global minima, what does?

𝑤𝑤𝑚𝑚𝑚𝑚 = 𝐗𝐗𝐓𝐓 𝐗𝐗𝐗𝐗𝐓𝐓 −𝟏𝟏𝑦𝑦

𝐗𝐗𝐗𝐗𝐓𝐓 ∈ ℝ𝒓𝒓×𝒓𝒓일종의 kernel, gram matrix K

Page 34: [PR12] understanding deep learning requires rethinking generalization

Implicit regularization

Quite surprisingly…

이렇게 단순한 방식으로 구한 solution이 err가 매우 낮다!

Page 35: [PR12] understanding deep learning requires rethinking generalization

Implicit regularization

Quite surprisingly…

이렇게 단순한 방식으로 구한 solution이 err가 매우 낮다!

RBF kernel

𝐗𝐗𝐗𝐗𝐓𝐓𝜶𝜶 = 𝒚𝒚

Page 36: [PR12] understanding deep learning requires rethinking generalization

Implicit regularization

Quite surprisingly…

이렇게 단순한 방식으로 구한 solution이 err가 매우 낮다!

Page 37: [PR12] understanding deep learning requires rethinking generalization

• Simple experimental framework for understanding the

effective capacity of deep learning models

• Successful DeepNets are able to overfit the training set

• Other formal measure of complexity for the models/

algorithms/data distributions are needed to precisely explain

the over-parameterized regime

Conclusion

Page 38: [PR12] understanding deep learning requires rethinking generalization

We believe that …

“understanding neural networks requires rethinking generalization.”

Conclusion

Page 39: [PR12] understanding deep learning requires rethinking generalization

References

• https://arxiv.org/pdf/1611.03530.pdf (paper)

• https://openreview.net/forum?id=Sy8gdB9xx (open review comments)

• https://github.com/pluskid/fitting-random-labels (code)

• http://pluskid.org/slides/ICLR2017-Poster.pdf (poster)

• https://www.youtube.com/watch?v=kCj51pTQPKI (presentation, YouTube)

• https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-requires-

rethinking-generalization-2017-12 (slideshare: Kor)

• https://danieltakeshi.github.io/2017/05/19/understanding-deep-learning-requires-

rethinking-generalization-my-thoughts-and-notes (blog)

Page 40: [PR12] understanding deep learning requires rethinking generalization

Things to discuss about…

• The effective capacity of neural networks is sufficient for memorizing the entire data set.

• Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels

• Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged

Summary

정말???

Page 41: [PR12] understanding deep learning requires rethinking generalization
Page 42: [PR12] understanding deep learning requires rethinking generalization

기존의 일반적인 supervised learning setting: Training과 test의 domain이 같다고 가정.

Statistical Learning Theory: SVM, …

Page 43: [PR12] understanding deep learning requires rethinking generalization
Page 44: [PR12] understanding deep learning requires rethinking generalization
Page 45: [PR12] understanding deep learning requires rethinking generalization
Page 46: [PR12] understanding deep learning requires rethinking generalization
Page 47: [PR12] understanding deep learning requires rethinking generalization

전자기기 고객평가 (X) / 긍정 혹은 부정 라벨 (Y)

Page 48: [PR12] understanding deep learning requires rethinking generalization

전자기기 고객평가 (X) / 긍정 혹은 부정 라벨 (Y)

비디오 게임 고객평가 (X)

Page 49: [PR12] understanding deep learning requires rethinking generalization

전자기기 고객평가 (X) / 긍정 혹은 부정 라벨 (Y)

비디오 게임 고객평가 (X)NN으로 표현되는 H 함수 공간으로부터….

Page 50: [PR12] understanding deep learning requires rethinking generalization

전자기기 고객평가 (X) / 긍정 혹은 부정 라벨 (Y)

비디오 게임 고객평가 (X)

Classifier h를 학습하는데, target의 label을 모르지만source(X,Y)와 target(X)두 도메인 모두에서 잘 label을 찾는 h를 찾고 싶다.

NN으로 표현되는 H 함수 공간으로부터….

Page 51: [PR12] understanding deep learning requires rethinking generalization

기존 전략: 최대한 적은 parameter를 갖으면서training error가 최소인 model을 찾자

Page 52: [PR12] understanding deep learning requires rethinking generalization

이제는 training domain (source)과 testing domain (target)이 서로 다르다

기존의 전략 외에 다른 전략이 추가로 필요하다.

Page 53: [PR12] understanding deep learning requires rethinking generalization

A Computable Adaptation Bound

Divergence estimation complexity

Dependent on number of unlabeled samples

Page 54: [PR12] understanding deep learning requires rethinking generalization

The optimal joint hypothesis

is the hypothesis with minimal combined error

is that error