Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+,...
Transcript of Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+,...
![Page 1: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/1.jpg)
Slides credited from Jacob Devlin
![Page 2: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/2.jpg)
BERT: Bidirectional Encoder Representations from TransformersIdea: contextualized word representations◦ Learn word vectors using long contexts using
Transformer instead of LSTM
2Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL-HLT, 2019.
![Page 3: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/3.jpg)
BERT #1 – Masked Language ModelIdea: language understanding is bidirectional while LM only uses left or right context◦ This is not a generation task
3http://jalammar.github.io/illustrated-bert/
Randomly mask 15% of tokens• Too little: expensive
to train• Too much: not
enough context
![Page 4: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/4.jpg)
BERT #1 – Masked Language Model
4Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL-HLT, 2019.
![Page 5: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/5.jpg)
BERT #2 – Next Sentence PredictionIdea: modeling relationship between sentences◦ QA, NLI etc. are based on understanding inter-sentence relationship
5Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL-HLT, 2019.
![Page 6: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/6.jpg)
BERT #2 – Next Sentence PredictionIdea: modeling relationship between sentences
6http://jalammar.github.io/illustrated-bert/
![Page 7: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/7.jpg)
BERT – Input RepresentationInput embeddings contain◦ Word-level token embeddings
◦ Sentence-level segment embeddings
◦ Position embeddings
7Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL-HLT, 2019.
![Page 8: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/8.jpg)
BERT – TrainingTraining data: Wikipedia + BookCorpus
2 BERT models◦ BERT-Base: 12-layer, 768-hidden, 12-head
◦ BERT-Large: 24-layer, 1024-hidden, 16-head
8http://jalammar.github.io/illustrated-bert/
![Page 9: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/9.jpg)
BERT for Fine-Tuning Understanding TasksIdea: simply learn a classifier/tagger built on the top layer for each target task
9Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL-HLT, 2019.
![Page 10: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/10.jpg)
BERT Overview
10Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL-HLT, 2019.
![Page 11: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/11.jpg)
BERT Fine-Tuning Results
11Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL-HLT, 2019.
![Page 12: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/12.jpg)
BERT Results on SQuAD 2.0
12Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL-HLT, 2019.
![Page 13: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/13.jpg)
BERT Results on NER
13Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL-HLT, 2019.
Model Description CONLL 2003 F1
TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93
ELMo (Peters+, 2018) ELMo in BLSTM 92.22
BERT-Base (Devlin+, 2019) Transformer bidi LM + fine tune 92.4
CVT Clark Cross-view training + multitask learn 92.61
BERT-Large (Devlin+, 2019) Transformer bidi LM + fine tune 92.8
Flair Character-level language model 93.09
![Page 14: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/14.jpg)
BERT Results with Different Model Sizes
14
Improving performance by increasing model size
Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in NAACL-HLT, 2019.
![Page 15: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/15.jpg)
BERT for Contextualized Word EmbeddingsIdea: use pre-trained BERT to get contextualized word embeddings and feed them into the task-specific models
15http://jalammar.github.io/illustrated-bert/
![Page 16: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/16.jpg)
BERT Embeddings Results on NER
16http://jalammar.github.io/illustrated-bert/
fine-tune = 96.4
![Page 17: Slides credited from Jacob Devlinmiulab/s107-adl/doc/190409... · 2019. 4. 8. · TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93 ELMo (Peters+, 2018) ELMo in BLSTM 92.22 BERT-Base](https://reader036.fdocuments.net/reader036/viewer/2022071113/5fea13f0e283131dee739ac4/html5/thumbnails/17.jpg)
Concluding RemarksContextualized embeddings learned from masked LM via Transformers provide informative cues for transfer learning
BERT – a general approach for learning contextual representations from Transformers and benefiting language understanding◦ Pre-trained BERT: https://github.com/google-research/bert
17