Deep Learning for Natural Language Processing Pre-trained ...

17
Deep Learning for Natural Language Processing Pre-trained Transformer models Richard Johansson [email protected]

Transcript of Deep Learning for Natural Language Processing Pre-trained ...

Deep Learning for Natural LanguageProcessing

Pre-trained Transformer models

Richard Johansson

[email protected]

-20pt

BERT

I BERT (Bidirectional Encoder Representationsfrom Transformers) is an architecture for transferlearning (Devlin et al., 2019)

I pushed the state of the art for several tasksI uses the encoder part of the Transformer (originally

12 or 24 layers)

-20pt

transfer learning with BERT

-20pt

pre-training tasks for BERT (1)

He bought two cans of fish soup .

He [MASK] two cans of fish [MASK] .

⇓ ⇓

bought soup

-20pt

pre-training tasks for BERT (1)

He bought two cans of fish soup .

He [MASK] two cans of fish [MASK] .

⇓ ⇓

bought soup

-20pt

pre-training tasks for BERT (1)

He bought two cans of fish soup .

He [MASK] two cans of fish [MASK] .

⇓ ⇓

bought soup

-20pt

pre-training tasks for BERT (2)

The man went to the store. He bought a bottle of milk.⇓

Adjacent

The man went to the store. Penguins are flightless birds.⇓

NonAdjacent

-20pt

pre-training tasks for BERT (2)

The man went to the store. He bought a bottle of milk.⇓

Adjacent

The man went to the store. Penguins are flightless birds.⇓

NonAdjacent

-20pt

how to use BERT in different types of tasks

-20pt

working with BERT in PyTorch

I the transformers library implements several types ofpre-trained Transformer-based models (BERT and derivatives)I https://github.com/huggingface/transformersI needs to be installed separately

I it implements classes for the standard use casesI such as BertForSequenceClassification: just a linear

layer on top of the final Transformer layer

-20pt

tokenization in BERT

I BERT comes with a built-in tokenizerI it uses WordPiece tokenization to avoid out-of-vocabulary

situationsI BERT can handle documents of up to 512 WordPiece tokens

-20pt

https://github.com/thunlp/PLMpapers/

-20pt

improved and specialized BERT variants (small sample)

I RoBERTa (Liu et al., 2019) uses more robust optimizationI DistilBERT (Sanh et al., 2019) “distils” BERT into a slightly

less enormous modelI domain-specific BERT models:

I BioBERT https://github.com/dmis-lab/biobertI SciBERT https://github.com/allenai/scibert

I Swedish BERT by the Royal Library:https://huggingface.co/KB/bert-base-swedish-cased

I Multilingual BERT: https://huggingface.co/bert-base-multilingual-cased

-20pt

introduction to BERTology

I there is an increasing interest in trying to understand complexmodels such as BERT

I several recent papers try to interpret partsof BERT’s Transformer layersI “BERT Rediscovers the Classical NLP

Pipeline” (Tenney et al., 2019)I “What Does BERT Look At? An

Analysis of BERT’s Attention” (Clarket al., 2019)

I the BlackboxNLP workshop publishes many interestingcontributions: https://blackboxnlp.github.io/

-20pt

code example

I in a notebook, we’ll see how to use BERT as the encoder forclassifying reviews

-20pt

reading

I tutorials, overviews:I Alammar: “The illustrated BERT, ELMo”I NAACL 2019 tutorial on transfer learning in NLP

I the BERT paper (Devlin et al., 2019) is also readable

-20pt

references

K. Clark, U. Khandelwal, O. Levy, and C. Manning. 2019. What does BERTlook at? an analysis of BERT’s attention. arXiv:1906.04341.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-trainingof deep bidirectional transformers for language understanding. In NAACL.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,L. Zettlemoyer, and V. Stoyanov. 2019. RoBERTa: A robustly optimizedBERT pretraining approach. arXiv:1907.11692.

V. Sanh, L. Debut, J. Chaumond, and T. Wolf. 2019. DistilBERT, a distilledversion of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.

I. Tenney, D. Das, and E. Pavlick. 2019. BERT rediscovers the classical NLPpipeline. arXiv:1905.05950.