BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART...

80
BERT and its family Hung-yi Lee 李宏毅

Transcript of BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART...

Page 1: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

BERT and its familyHung-yi Lee 李宏毅

Page 2: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Text without annotation

Pre-train

A model that can read text

Model

Model

Task Specific

Model

Task Specific

Model

Task Specific

Fine-tune

Task-specific data with annotation

Page 3: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention
Page 4: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

ELMo(Embeddings from Language Models)

BERT (Bidirectional Encoder Representations from Transformers)

ERNIE (Enhanced Representation through Knowledge Integration)

Big BirdBig Binary Recursive Decoder?

Grover (Generating aRticles by Only Viewing mEtadata Records)

BERT & PALs

(Projected Attention Layers)

Page 5: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Outline

What is pre-train model

How to fine-tune

How to pre-train

Page 6: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Pre-train Model

Page 7: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Pre-train Model

養 隻 狗

Represent each token by a embedding vector

Glove

The token with the same type has the same embedding.

Model Model Model

Simply a table look-up

……

狗貓 雞羊 豬

[Pennington, et al., EMNLP’14]

[Mikolov, et al., NIPS’13]Word2vec

Page 8: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Pre-train Model

養 隻 狗

Represent each token by a embedding vector

The token with the same type has the same embedding.

Model Model Model

English word as token …

Model

r e u s e

FastText[Bojanowski, et al., TACL’17]

Page 9: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Pre-train Model

養 隻 狗

Represent each token by a embedding vector

The token with the same type has the same embedding.

Model Model Model

Chinese character as token …

[Su, et al., EMNLP’17]

Image?

Model(CNN)

Page 10: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Pre-train Model

養 隻 狗

Represent each token by a embedding vector

Model Model Model

單 身 狗

Model Model Model

same

Page 11: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Pre-train Model

養 隻 狗

Represent each token by a embedding vector

Model Model Model

單 身 狗

Model Model Model

different

Page 12: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Pre-train ModelContextualized Word Embedding

養 隻 狗

Model Model Model

單 身 狗

Model Model ModelModel Model

Page 13: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Pre-train Model

養 隻 狗

Model

• LSTM

• Self-attention layers

• Tree-based model (?)• Ref: https://youtu.be/z0uOq2wEGcc

Contextualized Word Embedding

Many Layers

Page 14: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention
Page 15: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Bigger Model Source of image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

[Shoeybi, et al., arXiv’19]

Megatron

TuringNLG

Page 16: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Smaller Model

[Sanh, et al., NeurIPS workshop’19]

Distill BERT

Tiny BERT [Jian, et al., arXiv’19]

Mobile BERT [Sun, et al., ACL’20]

[Zafrir, et al., NeurIPS workshop 2019]

[Lan, et al., ICLR’20]ALBERT

Q8BERT

Page 17: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Smaller Model

• Network Compression

• Network Pruning

• Knowledge Distillation

• Parameter Quantization

• Architecture Design

Ref: https://youtu.be/dPp8rCAnU_A

Excellent reference: http://mitchgordon.me/machine/learning/2019/11/18/all-the-ways-to-compress-BERT.html

All of them have been tried.

Page 18: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Network Architecture

• Transformer-XL: Segment-Level Recurrence with State Reuse

• Reformer

• Longformer

[Dai, et al., ACL’19]

[Kitaev, et al., ICLR’20]

[Beltagy, et al., arXiv’20]

Reduce the complexity of self-attention

Page 19: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

How to fine-tune

Pre-trained Model

w1 w2 w3 w4

Task-specific Layer

For a specific NLP task

Page 20: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

NLP tasks

Input

Output

one sentence

multiple sentences

one class

class for each token

copy from input

general sequence

Page 21: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Input one sentence

multiple sentences

w1 w2

Model

[SEP]

Sentence 1 Sentence 2

w3 w4 w5

Query Document

Premise Hypothesis

Page 22: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Output

Model

[CLS] w1 w2 w3

Task Specific

class

input

one class

class for each token

copy from input

general sequence Task Specific

class

Page 23: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Output

Model

[CLS] w1 w2 w3

input

one class

class for each token

copy from input

general sequence

class class class

Task Specific

Page 24: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

one class

class for each token

copy from input

general sequence

Output

• Extraction-based QA

𝐷 = 𝑑1, 𝑑2, ⋯ , 𝑑𝑁

𝑄 = 𝑞1, 𝑞2, ⋯ , 𝑞𝑀

QAModel

output: two integers (𝑠, 𝑒)

𝐴 = 𝑑𝑠, ⋯ , 𝑑𝑒

Document:

Query:

Answer:

𝐷

𝑄

𝑠

𝑒

77 79

𝑠 = 77, 𝑒 = 79

Page 25: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

q1 q2

Copy from Input (BERT)

Model

[CLS] [SEP]

question document

d1 d2 d3

dot product

Softmax

0.50.3 0.2

s = 2

TaskSpecific

Page 26: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

q1 q2

Copy from Input (BERT)

Model

[CLS] [SEP]

question document

d1 d2 d3

dot product

Softmax

0.20.1 0.7

The answer is “d2 d3”.

s = 2 e = 3

TaskSpecific

Page 27: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Output – General Sequence (v1)

• Seq2seq model

Model

[CLS] w1 w2 w3

input sequence

Task Specific

w4 w5 <EOS>Attention

output sequence

Encoder Decoder

Page 28: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Output – General Sequence (v2)

w1 w2

Model

[SEP] w3 w4 w5

Task Specific

w3

Task Specific

w4

Task Specific

w5

Task Specific

input sequence

<EOS>output sequence

Page 29: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

w1 w2 w3

Task-specific

Pre-trained

Fine-tune

A gigantic model for down-stream tasks

w1 w2 w3

Task-specific

Pre-trainedModel

How to fine-tune

Fine-tune

Feature Extractor (Fix)

Page 30: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

[Houlsby, et al., ICML’19][Stickland, et al., ICML’19]

Model

Task Specific

Model

Task Specific

Model

Task Specific

Adaptor

Model

Task Specific

Model

Task Specific

Model

Task Specific

Fine-tune

Large Memory is needed QQ

Page 31: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

[Houlsby, et al., ICML’19][Stickland, et al., ICML’19]

Model

Task Specific

Model

Task Specific

Model

Task Specific

Adaptor

Task Specific

Task Specific

Task Specific

Fine-tune

Apt Apt Apt

Model Model ModelApt Apt Apt

Page 32: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

[Houlsby, et al., ICML’19]Source of image: https://arxiv.org/abs/1902.00751

Page 33: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

[Houlsby, et al., ICML’19]Source of image: https://arxiv.org/abs/1902.00751

Page 34: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Layer 1

w1 w2 w3

Layer 2

WholeModel

+

𝑥1

𝑥2

𝑤1𝑥1 +𝑤2𝑥

2

𝑤2

𝑤1

𝑤1 and 𝑤2 are learned in down-stream tasks

Weighted Features Task Specific

Page 35: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Why Pre-train Models?

• GLUE scores

Source of image: https://arxiv.org/abs/1905.00537

Page 36: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Why Fine-tune?

Source of image: https://arxiv.org/abs/1908.05620[Hao, et al., EMNLP’19]

Page 37: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Why Fine-tune?How to generate the figures below?https://youtu.be/XysGHdNOTbg

Source of image: https://arxiv.org/abs/1908.05620[Hao, et al., EMNLP’19]

Page 38: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

How to Pre-train

Text without annotation

Pre-train

A model that can read text

Model

Page 39: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Pre-training by Translation

• Context Vector (CoVe)

Input: A language

output: B language

w1 w2 w3

Decoder

w5 w6 w7

w4

ModelEncoder

Need sentences pairs for languages A and B

Page 40: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Self-supervised Learning

Supervised Self-

supervised 𝑥

𝑦

𝑥

𝑥′

𝑥′′label

Model Model

Text without annotation

Page 41: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Predict Next Token

w1 w2 w3

w2 w3 w4

ℎ1 ℎ2 ℎ3 ℎ4

Model

? ? ? ?

w4

w5

LinearTransform

softmax

Cross entropy

wt+1

from wt ℎ𝑡

Page 42: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Predict Next Token

w1 w2 w3

w2 w3 w4

ℎ1 ℎ2 ℎ3 ℎ4

Model

? ? ? ?

w4

w5

ELMo

[Peters, et al.,

NAACL’18]

Universal Language Model Fine-tuning (ULMFiT)

[Howard, et al., ACL’18]

This is exactly how we train language models (LM).

LSTM

Page 43: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Predict Next Token GPT

[Shoeybi, et al., arXiv’19]

Megatron

TuringNLG

w1 w2 w3

Model

w4

Self-attention

w1 w2 w3 w4

with constraint

w3

w1

w2

w4

[Alec, et al., 2019]

[Alec, et al., 2018]

GPT-2

Page 44: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Predict Next TokenThey can do generation.

https://talktotransformer.com/

Page 45: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Predict Next TokenThey can do generation.

Page 46: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

律師

Page 47: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

混亂

Page 48: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

I forced a bot to watch over 1,000 hours of XXX

是一個梗! 人在模仿機器模仿人!!!

Page 49: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

w1 w2 w3

ℎ4

Model

w4

w5

You shall know a word by the company it keeps

John Rupert Firth

encoding w4 and its left context

How about the right context!?

Page 50: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

w1 w2 w3 w4 w5 w6 w7

LSTM

LSTM

Predict Next Token - Bidirectional

ELMO

Page 51: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Masking Input

Model

w1 w2 w3 w4

w2

MASK

BERT

Transformer

Random Token

(special token)

(no limitation on self-attention)

[Devlin, et al., NAACL’19]

Page 52: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Masking Input

Model

w1 w2 w3 w4

w2

Using context to predict the missing token

Page 53: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Masking Input

• Whole Word Masking (WWM)

• Phrase-level & Entity-level

Source of image: https://arxiv.org/abs/1906.08101

Enhanced Representation through Knowledge Integration (ERNIE)

Is random masking good enough?

[Cui, et al., arXiv’19]

[Sun, et al., ACL’19]

Page 54: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

SpanBert[Joshi, et al., TACL’20]

Source of image: https://arxiv.org/abs/1907.10529

Page 55: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

SpanBert –Span Boundary Objective (SBO)

w1 w2 w3 w4 w5 w7 w8 w9 w10

Span BERT

3

SBO

w6

Page 56: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

SpanBert –Span Boundary Objective (SBO)

Useful in coreference?

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10

Span BERT

2

SBO

w5

Page 57: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

XLNet

Transformer-XL

[Yang, et al., NeurIPS’19]

深1 度2 學3 習4

Model

深1度2 學3習4

Model

Page 58: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

XLNet

Transformer-XL

[Yang, et al., NeurIPS’19]

深1 度2 學3 習4

Model

深1 度2學3 習4

Model

Page 59: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

XLNet

Transformer-XL

[Yang, et al., NeurIPS’19]

深1 度2 學3 習4MASK

深1 度2 學3 習4MASK 深1 度2 學3 習4MASK

Page 60: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

BERT cannot talk?

w1 w2 w3 w4

Given partial sequence, predict the next token

w1 w2 w3 MASK

w4

BERT-styleLM-Style

What LM born for Never seen partial sequence

Limited to autoregressive model

(non-autoregressive next time)

Page 61: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

MASS / BART

• The pre-train model is a typical seq2seq model.

w1 w2 w3

w5 w6 w7

w4

Attention

w8

ModelModel

w1 w2 w3 w4

Reconstruct the input

MAsked Sequence to Sequence pre-training (MASS) [Song, et al., ICML’19]

Bidirectional and Auto-Regressive Transformers (BART) [Lewis, et al., arXiv’19]

Corrupted

Page 62: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Input Corruption

BART

A B [SEP] C D E

A B [SEP] C D E

A B [SEP] C E

C D E [SEP] A B

D E A B [SEP] C

A B [SEP] E

MASS

(Delete “D”)

Text Infilling

• Permutation / Rotation do not perform well.

• Text Infilling is consistently good.

(permutation)

(rotation)

Page 63: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

w1 w2 w3

w5 w6 w7

w4

Attention

w8

ModelModel

Corrupted

BART/MASS

UniLM

Model

Encoder Decoder

Encoder

Decoder

Seq2seq[Dong, et al., NeurIPS’19]

Page 64: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Source of image:https://arxiv.org/pdf/1905.03197.pdf

BERT

GPT

BARTMASS

UniLM

Page 65: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Replace or Not?

ELECTRA

Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA)

the chef cooked the

Model

meal

Predicting yes/not is easier than reconstruction.

Every output position is used.

ate

NO NO NO NOYES

Page 66: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

the chef cooked the

Model

mealate

NO NO NO NOYES

the chef cooked the

Small BERT

mealmask

ate

Note: This is not GAN.

Page 67: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Source of image: https://arxiv.org/abs/2003.10555

Page 68: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Sentence Level

Model

w1 w2 w3 w4

Representation for each tokenRepresentation for

whole sequence

Page 69: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

You shall know a sentenceby the company it keeps?

Encoder

w1 w2 w3

Decoder

w4 w5 w6 w7

Encoder

w1 w2 w3 w4 w5 w6 w7

similarity

Quick Thought

Skip Thought

Next Sentence

Consecutive?

Encoder

Yes

Page 70: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Model

w1 w2 [SEP]

Yes/No

[CLS]

NSP: Next sentence prediction Robustly optimized BERT approach (RoBERTa)

[Liu, et al., arXiv’19]

SOP: Sentence order prediction

w3 w4 w5

Used in ALBERT

In the original BERT, …..

structBERT (Alice) [Want, et al., ICLR’20]

Page 71: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

T5 – Comparison

• Transfer Text-to-Text Transformer (T5)

• Colossal Clean Crawled Corpus (C4)

[Raffel, et al., arXiv’19]

Page 72: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Knowledge

• Enhanced Language RepresentatioN with Informative Entities (ERNIE)

This is another story ……

+

Page 73: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Audio BERT This is another story ……

Audio BERTBERT

深 度 學 習

Page 74: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Reference

• [Lewis, et al., arXiv’19] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, arXiv, 2019

• [Raffel, et al., arXiv’19] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, arXiv, 2019

• [Joshi, et al., TACL’20] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy, SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL, 2020

• [Song, et al., ICML’19] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, MASS: Masked Sequence to Sequence Pre-training for Language Generation, ICML, 2019

• [Zafrir, et al., NeurIPS workshop 2019] Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat, Q8BERT: Quantized 8Bit BERT, NeurIPS workshop 2019

Page 75: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Reference

• [Houlsby, et al., ICML’19] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly, Parameter-Efficient Transfer Learning for NLP, ICML, 2019

• [Hao, et al., EMNLP’19] Yaru Hao, Li Dong, Furu Wei, Ke Xu, Visualizing and Understanding the Effectiveness of BERT, EMNLP, 2019

• [Liu, et al., arXiv’19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, MandarJoshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv, 2019

• [Sanh, et al., NeurIPS workshop’s] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, NeurIPS workshop, 2019

• [Jian, et al., arXiv’19] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv, 19

Page 76: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Reference

• [Shoeybi, et al., arXiv’19]Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, arXiv, 19

• [Lan, et al., ICLR’20]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, ICLR, 2020

• [Kitaev, et al., ICLR’20] Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya, Reformer: The Efficient Transformer, ICLR, 2020

• [Beltagy, et al., arXiv’20] Iz Beltagy, Matthew E. Peters, Arman Cohan, Longformer: The Long-Document Transformer, arXiv, 2020

• [Dai, et al., ACL’19] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, ACL, 2019

• [Peters, et al., NAACL’18] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer, Deep contextualized word representations, NAACL, 2018

Page 77: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Reference

• [Sanh, et al., NeurIPS workshop’s] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, NeurIPS workshop, 2019

• [Jian, et al., arXiv’19] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv, 19

• [Sun, et al., ACL’20] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, YimingYang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, ACL, 2020

• [Zafrir, et al., NeurIPS workshop 2019] Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat, Q8BERT: Quantized 8Bit BERT, NeurIPS workshop 2019

• [Sun, et al., ACL’20] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, YimingYang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, ACL, 2020

Page 78: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Reference

• [Pennington, et al., EMNLP’14] Jeffrey Pennington, Richard Socher, Christopher Manning, Glove: Global Vectors for Word Representation, EMNLP, 2014

• [Mikolov, et al., NIPS’13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean, Distributed Representations of Words and Phrases and their Compositionality, NIPS, 2013

• [Bojanowski, et al., TACL’17] Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, TACL, 2017

• [Su, et al., EMNLP’17] Tzu-Ray Su, Hung-Yi Lee, Learning Chinese Word Representations From Glyphs Of Characters, EMNLP, 2017

• [Liu, et al., ACL’19] Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao, Multi-Task Deep Neural Networks for Natural Language Understanding, ACL, 2019

• [Stickland, et al., ICML’19] Asa Cooper Stickland, Iain Murray, BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning, ICML, 2019

Page 79: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Reference

• [Howard, et al., ACL’18] Jeremy Howard, Sebastian Ruder, Universal Language Model Fine-tuning for Text Classification, ACL, 2018

• [Alec, et al., 2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, Improving Language Understanding by Generative Pre-Training, 2018

• [Devlin, et al., NAACL’19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL, 2019

• [Alec, et al., 2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, Language Models are Unsupervised Multitask Learners, 2019

• [Want, et al., ICLR’20] Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, Luo Si, StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding, ICLR, 2020

• [Yang, et al., NeurIPS’19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le, XLNet: Generalized Autoregressive Pretraining for Language Understanding, NeurIPS, 2019

Page 80: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention

Reference

• [Cui, et al., arXiv’19] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, ZiqingYang, Shijin Wang, Guoping Hu, Pre-Training with Whole Word Masking for Chinese BERT, arXiv, 2019

• [Sun, et al., ACL’19] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, XuyiChen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu, ERNIE: Enhanced Representation through Knowledge Integration, ACL, 2019

• [Dong, et al., NeurIPS’19] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, XiaodongLiu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon, Unified Language Model Pre-training for Natural Language Understanding and Generation, NeurIPS, 2019