BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART...

BERT and its familyHung-yi Lee 李宏毅

Text without annotation

Pre-train

A model that can read text

Task Specific

Fine-tune

Task-specific data with annotation

ELMo(Embeddings from Language Models)

BERT (Bidirectional Encoder Representations from Transformers)

ERNIE (Enhanced Representation through Knowledge Integration)

Big BirdBig Binary Recursive Decoder?

Grover (Generating aRticles by Only Viewing mEtadata Records)

BERT & PALs

(Projected Attention Layers)

Outline

What is pre-train model

How to fine-tune

How to pre-train

Pre-train Model

養隻狗

Represent each token by a embedding vector

The token with the same type has the same embedding.

Model Model Model

Simply a table look-up

……

狗貓雞羊豬

[Pennington, et al., EMNLP’14]

[Mikolov, et al., NIPS’13]Word2vec

Pre-train Model

養隻狗

Model Model Model

English word as token …

r e u s e

FastText[Bojanowski, et al., TACL’17]

Pre-train Model

養隻狗

Model Model Model

Chinese character as token …

[Su, et al., EMNLP’17]

Image?

Model(CNN)

Pre-train Model

養隻狗

Model Model Model

單身狗

Model Model Model

Pre-train Model

養隻狗

Model Model Model

單身狗

Model Model Model

different

Pre-train ModelContextualized Word Embedding

養隻狗

Model Model Model

單身狗

Model Model ModelModel Model

Pre-train Model

養隻狗

• LSTM

• Self-attention layers

• Tree-based model (?)• Ref: https://youtu.be/z0uOq2wEGcc

Contextualized Word Embedding

Many Layers

Bigger Model Source of image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

[Shoeybi, et al., arXiv’19]

Megatron

TuringNLG

Smaller Model

[Sanh, et al., NeurIPS workshop’19]

Distill BERT

Tiny BERT [Jian, et al., arXiv’19]

Mobile BERT [Sun, et al., ACL’20]

[Zafrir, et al., NeurIPS workshop 2019]

[Lan, et al., ICLR’20]ALBERT

Q8BERT

Smaller Model

• Network Compression

• Network Pruning

• Knowledge Distillation

• Parameter Quantization

• Architecture Design

Ref: https://youtu.be/dPp8rCAnU_A

Excellent reference: http://mitchgordon.me/machine/learning/2019/11/18/all-the-ways-to-compress-BERT.html

All of them have been tried.

Network Architecture

• Transformer-XL: Segment-Level Recurrence with State Reuse

• Reformer

• Longformer

[Dai, et al., ACL’19]

[Kitaev, et al., ICLR’20]

[Beltagy, et al., arXiv’20]

Reduce the complexity of self-attention

How to fine-tune

Pre-trained Model

w1 w2 w3 w4

Task-specific Layer

For a specific NLP task

NLP tasks

Output

one sentence

multiple sentences

one class

class for each token

copy from input

general sequence

Input one sentence

multiple sentences

Sentence 1 Sentence 2

w3 w4 w5

Query Document

Premise Hypothesis

Output

[CLS] w1 w2 w3

Task Specific

one class

copy from input

general sequence Task Specific

Output

[CLS] w1 w2 w3

one class

copy from input

general sequence

class class class

Task Specific

one class

copy from input

general sequence

Output

• Extraction-based QA

𝐷 = 𝑑1, 𝑑2, ⋯ , 𝑑𝑁

𝑄 = 𝑞1, 𝑞2, ⋯ , 𝑞𝑀

QAModel

output: two integers (𝑠, 𝑒)

𝐴 = 𝑑𝑠, ⋯ , 𝑑𝑒

Document:

Query:

Answer:

𝑠 = 77, 𝑒 = 79

Copy from Input (BERT)

[CLS] [SEP]

question document

d1 d2 d3

dot product

Softmax

0.50.3 0.2

TaskSpecific

Copy from Input (BERT)

[CLS] [SEP]

question document

d1 d2 d3

dot product

Softmax

0.20.1 0.7

The answer is “d2 d3”.

s = 2 e = 3

TaskSpecific

Output – General Sequence (v1)

• Seq2seq model

[CLS] w1 w2 w3

input sequence

Task Specific

w4 w5 <EOS>Attention

output sequence

Encoder Decoder

Output – General Sequence (v2)

[SEP] w3 w4 w5

Task Specific

input sequence

<EOS>output sequence

w1 w2 w3

Task-specific

Pre-trained

Fine-tune

A gigantic model for down-stream tasks

w1 w2 w3

Task-specific

Pre-trainedModel

How to fine-tune

Fine-tune

Feature Extractor (Fix)

[Houlsby, et al., ICML’19][Stickland, et al., ICML’19]

Task Specific

Adaptor

Task Specific

Fine-tune

Large Memory is needed QQ

[Houlsby, et al., ICML’19][Stickland, et al., ICML’19]

Task Specific

Adaptor

Task Specific

Fine-tune

Apt Apt Apt

Model Model ModelApt Apt Apt

[Houlsby, et al., ICML’19]Source of image: https://arxiv.org/abs/1902.00751

Layer 1

w1 w2 w3

Layer 2

WholeModel

𝑤1𝑥1 +𝑤2𝑥

𝑤1 and 𝑤2 are learned in down-stream tasks

Weighted Features Task Specific

Why Pre-train Models?

• GLUE scores

Source of image: https://arxiv.org/abs/1905.00537

Why Fine-tune?

Source of image: https://arxiv.org/abs/1908.05620[Hao, et al., EMNLP’19]

Why Fine-tune?How to generate the figures below?https://youtu.be/XysGHdNOTbg

Source of image: https://arxiv.org/abs/1908.05620[Hao, et al., EMNLP’19]

How to Pre-train

Pre-train

A model that can read text

Pre-training by Translation

• Context Vector (CoVe)

Input: A language

output: B language

w1 w2 w3

Decoder

w5 w6 w7

ModelEncoder

Need sentences pairs for languages A and B

Self-supervised Learning

Supervised Self-

supervised 𝑥

𝑥′

𝑥′′label

Model Model

Predict Next Token

w1 w2 w3

w2 w3 w4

ℎ1 ℎ2 ℎ3 ℎ4

? ? ? ?

LinearTransform

softmax

Cross entropy

from wt ℎ𝑡

Predict Next Token

w1 w2 w3

w2 w3 w4

ℎ1 ℎ2 ℎ3 ℎ4

? ? ? ?

[Peters, et al.,

NAACL’18]

Universal Language Model Fine-tuning (ULMFiT)

[Howard, et al., ACL’18]

This is exactly how we train language models (LM).

Predict Next Token GPT

[Shoeybi, et al., arXiv’19]

Megatron

TuringNLG

w1 w2 w3

Self-attention

w1 w2 w3 w4

with constraint

[Alec, et al., 2019]

[Alec, et al., 2018]

Predict Next TokenThey can do generation.

https://talktotransformer.com/

Predict Next TokenThey can do generation.

律師

混亂

I forced a bot to watch over 1,000 hours of XXX

是一個梗! 人在模仿機器模仿人!!!

w1 w2 w3

You shall know a word by the company it keeps

John Rupert Firth

encoding w4 and its left context

How about the right context!?

w1 w2 w3 w4 w5 w6 w7

Predict Next Token - Bidirectional

Masking Input

w1 w2 w3 w4

Transformer

Random Token

(special token)

(no limitation on self-attention)

[Devlin, et al., NAACL’19]

Masking Input

w1 w2 w3 w4

Using context to predict the missing token

Masking Input

• Whole Word Masking (WWM)

• Phrase-level & Entity-level

Enhanced Representation through Knowledge Integration (ERNIE)

Is random masking good enough?

[Cui, et al., arXiv’19]

[Sun, et al., ACL’19]

SpanBert[Joshi, et al., TACL’20]

SpanBert –Span Boundary Objective (SBO)

w1 w2 w3 w4 w5 w7 w8 w9 w10

Span BERT

SpanBert –Span Boundary Objective (SBO)

Useful in coreference?

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10

Span BERT

Transformer-XL

[Yang, et al., NeurIPS’19]

深1 度2 學3 習4

深1度2 學3習4

Transformer-XL

深1 度2 學3 習4

深1 度2學3 習4

Transformer-XL

深1 度2 學3 習4MASK

深1 度2 學3 習4MASK 深1 度2 學3 習4MASK

BERT cannot talk?

w1 w2 w3 w4

Given partial sequence, predict the next token

w1 w2 w3 MASK

BERT-styleLM-Style

What LM born for Never seen partial sequence

Limited to autoregressive model

(non-autoregressive next time)

MASS / BART

• The pre-train model is a typical seq2seq model.

w1 w2 w3

w5 w6 w7

Attention

ModelModel

w1 w2 w3 w4

Reconstruct the input

MAsked Sequence to Sequence pre-training (MASS) [Song, et al., ICML’19]

Bidirectional and Auto-Regressive Transformers (BART) [Lewis, et al., arXiv’19]

Corrupted

Input Corruption

A B [SEP] C D E

A B [SEP] C E

C D E [SEP] A B

D E A B [SEP] C

A B [SEP] E

(Delete “D”)

Text Infilling

• Permutation / Rotation do not perform well.

• Text Infilling is consistently good.

(permutation)

(rotation)

w1 w2 w3

w5 w6 w7

Attention

ModelModel

Corrupted

BART/MASS

Encoder Decoder

Encoder

Decoder

Seq2seq[Dong, et al., NeurIPS’19]

Source of image:https://arxiv.org/pdf/1905.03197.pdf

BARTMASS

Replace or Not?

ELECTRA

Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA)

the chef cooked the

Predicting yes/not is easier than reconstruction.

Every output position is used.

NO NO NO NOYES

the chef cooked the

mealate

NO NO NO NOYES

the chef cooked the

Small BERT

mealmask

Note: This is not GAN.

Sentence Level

w1 w2 w3 w4

Representation for each tokenRepresentation for

whole sequence

You shall know a sentenceby the company it keeps?

Encoder

w1 w2 w3

Decoder

w4 w5 w6 w7

Encoder

w1 w2 w3 w4 w5 w6 w7

similarity

Quick Thought

Skip Thought

Next Sentence

Consecutive?

Encoder

w1 w2 [SEP]

Yes/No

NSP: Next sentence prediction Robustly optimized BERT approach (RoBERTa)

[Liu, et al., arXiv’19]

SOP: Sentence order prediction

w3 w4 w5

Used in ALBERT

In the original BERT, …..

structBERT (Alice) [Want, et al., ICLR’20]

T5 – Comparison

• Transfer Text-to-Text Transformer (T5)

• Colossal Clean Crawled Corpus (C4)

[Raffel, et al., arXiv’19]

Knowledge

• Enhanced Language RepresentatioN with Informative Entities (ERNIE)

This is another story ……

Audio BERT This is another story ……

Audio BERTBERT

深度學習

Reference

• [Lewis, et al., arXiv’19] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, arXiv, 2019

• [Raffel, et al., arXiv’19] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, arXiv, 2019

• [Joshi, et al., TACL’20] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy, SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL, 2020

• [Song, et al., ICML’19] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, MASS: Masked Sequence to Sequence Pre-training for Language Generation, ICML, 2019

• [Zafrir, et al., NeurIPS workshop 2019] Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat, Q8BERT: Quantized 8Bit BERT, NeurIPS workshop 2019

Reference

• [Houlsby, et al., ICML’19] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly, Parameter-Efficient Transfer Learning for NLP, ICML, 2019

• [Hao, et al., EMNLP’19] Yaru Hao, Li Dong, Furu Wei, Ke Xu, Visualizing and Understanding the Effectiveness of BERT, EMNLP, 2019

• [Liu, et al., arXiv’19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, MandarJoshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv, 2019

• [Sanh, et al., NeurIPS workshop’s] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, NeurIPS workshop, 2019

• [Jian, et al., arXiv’19] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv, 19

Reference

• [Shoeybi, et al., arXiv’19]Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, arXiv, 19

• [Lan, et al., ICLR’20]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, ICLR, 2020

• [Kitaev, et al., ICLR’20] Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya, Reformer: The Efficient Transformer, ICLR, 2020

• [Beltagy, et al., arXiv’20] Iz Beltagy, Matthew E. Peters, Arman Cohan, Longformer: The Long-Document Transformer, arXiv, 2020

• [Dai, et al., ACL’19] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, ACL, 2019

• [Peters, et al., NAACL’18] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer, Deep contextualized word representations, NAACL, 2018

Reference

• [Sanh, et al., NeurIPS workshop’s] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, NeurIPS workshop, 2019

• [Jian, et al., arXiv’19] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv, 19

• [Sun, et al., ACL’20] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, YimingYang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, ACL, 2020

• [Zafrir, et al., NeurIPS workshop 2019] Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat, Q8BERT: Quantized 8Bit BERT, NeurIPS workshop 2019

• [Sun, et al., ACL’20] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, YimingYang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, ACL, 2020

Reference

• [Pennington, et al., EMNLP’14] Jeffrey Pennington, Richard Socher, Christopher Manning, Glove: Global Vectors for Word Representation, EMNLP, 2014

• [Mikolov, et al., NIPS’13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean, Distributed Representations of Words and Phrases and their Compositionality, NIPS, 2013

• [Bojanowski, et al., TACL’17] Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, TACL, 2017

• [Su, et al., EMNLP’17] Tzu-Ray Su, Hung-Yi Lee, Learning Chinese Word Representations From Glyphs Of Characters, EMNLP, 2017

• [Liu, et al., ACL’19] Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao, Multi-Task Deep Neural Networks for Natural Language Understanding, ACL, 2019

• [Stickland, et al., ICML’19] Asa Cooper Stickland, Iain Murray, BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning, ICML, 2019

Reference

• [Howard, et al., ACL’18] Jeremy Howard, Sebastian Ruder, Universal Language Model Fine-tuning for Text Classification, ACL, 2018

• [Alec, et al., 2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, Improving Language Understanding by Generative Pre-Training, 2018

• [Devlin, et al., NAACL’19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL, 2019

• [Alec, et al., 2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, Language Models are Unsupervised Multitask Learners, 2019

• [Want, et al., ICLR’20] Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, Luo Si, StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding, ICLR, 2020

• [Yang, et al., NeurIPS’19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le, XLNet: Generalized Autoregressive Pretraining for Language Understanding, NeurIPS, 2019

Reference

• [Cui, et al., arXiv’19] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, ZiqingYang, Shijin Wang, Guoping Hu, Pre-Training with Whole Word Masking for Chinese BERT, arXiv, 2019

• [Sun, et al., ACL’19] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, XuyiChen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu, ERNIE: Enhanced Representation through Knowledge Integration, ACL, 2019

• [Dong, et al., NeurIPS’19] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, XiaodongLiu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon, Unified Language Model Pre-training for Natural Language Understanding and Generation, NeurIPS, 2019

BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART...

Documents

Transcript of BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART...

CLASSIC MEETS JAZZ - Pianist Günther Brackmann...Glenn Miller, der „Honky Tonk Train“ von Meade Lux Lewis, „Swingin’ Safari“ von Bert Kaempfert oder „On The Street Where

Bert Helbig

Biosketches...Architecture degree from UCLA, and an MBA from the University of Houston. Bert Bras Bert Bras has been a Professor at the George W. Woodruff School of Mechanical Engineering

Adoptieplan Bert van de Pol Adoptieplan Bert van de Pol.

BERT VERMINNEN - UAntwerpentheater.uantwerpen.be/etc_pdf/1984-07_jg2_nr7_34.pdf · 2011. 5. 3. · BERT VERMINNEN Vergeten visionair Bert Verminnen omschreef — naar Artaud— he

bert plantagie

7KH 3H1-OS-04a-1 ¯ ï Â Ý w h w Ú Þ Ç á å ç É ¿ Ä ë « w · PDF fileÑ ç» ± ¶ (FS) ... episode 1 A (215) C , K (198) train train episode 2 B (155) C , G (149) train

D BERT : D BERT D ENTANGLED ATTENTION - OpenReview

J-BERT M8020A High-Performance BERT

Bert Flugelman

Bert Brouwer

Q*Bert Instruction Manual - MPRD.Se · 2009. 1. 20. · Q bert Insrruetion Gottlieb AMUSEMENT GAMES 165 W Lake Street Northlake. IL 60164 (312) 562-7400 72-8463 A Columota Pictures

Bert Hellinger - libroesoterico.com Bert... · Bert Hellinger Religión, Psicoterapia, Cura de almas Textos recopilados Traducción de Sylvia Kabelka Herder

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

N&W Freight Train Schedules Connections and ...

Future European-wide harmonised train driver’s cab …Future European-wide harmonised train driver’s cab EUCAB – an essential result of the Integrated Project Modular Train W.

E-BERT: Efﬁcient-Yet-Effective Entity Embeddings for BERT

Całość.9 - instytutslaski.com · Piotr Solga, Bert, Praca naukowa finansowana w ramach programu Ministra Nauki i Szkolnictwa WyŽszego pod nazwq Program Rozwoju Humanistyki" w

Rafael Bitzur The Bert W Strassburger Lipid Center Sheba ... · The Bert W Strassburger Lipid Center Sheba Medical Center Tel Hashomer. Case Presentation 50 YO man ... Unstable angina

Skenario Bert