Natural Language Gen eration...Pre-trai n i ng an d Fi n e-Tun i ng ULMﬁt Universal Language Model...

ALPS Winter School 2021

Natural Language GenerationTransfer Learning

Claire Gardent

Advanced Language Processing Winter School, 1722 January 2021

1 / 68


Part 2: Topics in NLGSelecting from the Input

Summarisation, Simplification, D2T

Modeling input structure

Graph encoders for D2T, MR2THierarchical encoders for long form text (Summarisation)

Transfer Learning

D2T, Dialog, ...

2 / 68


What is Transfer Learning ?Key Idea: Transfer KL acquired by one model to another model

How ?

Find a task (e.g., Language Modelling ) for which it is easy to generate labelsand for which you can get large quantities of training data

Train a model on this large data and adapt it to a task for which labelled data isharder to get

Two main approaches

Transfer word or sentence embeddings ("featurebased")

Transfer Model Parameters ("finetuning") 3 / 68


What is Transfer Learning ?Transfer prior KL

Two main approaches

Transfer word or sentence embeddings ("featurebased")

Transfer Model Parameters ("finetuning")

Why Transfer Learning ?Data

Efficiency

Quality

4 / 68


Feature-Based Transfer1hot vector + Learned word embeddings

Learned on labelled dataSpecific to the task

Pretrained word embeddings (word2vec, Glove, ...)

Learned on unlabeled dataUse these for different tasksRepresent each word type with a single representationContext is a limited window

Contextual Word Representations (ELMo)

Learned on very large amount of unlabeled dataDepend on the sentence in which a word is used play area (ADJ), the play was from Shakespeare (NOUN), I play violin (VERB)Improve NLP tasks

Image credit: Manning

5 / 68

http://web.stanford.edu/class/cs224n/index.html#schedule


ELMoLearn a LM on large quantities ofunlabelled data (1B Words)

For a specific task (e.g., NER)

Run this LM on the input text

This creates a contextual representationfor each token in the input text

Combine this contextual representation with a pretrained word embedding (e.g.,Glove or Word2Vec)

Peters et al. NAACL 2018, Image Credit: Manning

6 / 68

https://www.aclweb.org/anthology/N18-1202.pdfhttp://web.stanford.edu/class/cs224n/index.html#schedule


ELMo

7 / 68


LM = 2 BiLSTM layers (use alllayers, residual connection)

Trained on 800M words

Long context, not context window(whole sentence)

Provide good gains for multiple NLPtasks

ELMo

8 / 68


Train deep BiLSTM LM on large quantity of textualdata

FineTune LM on target domain text

Fine tune LM as classifier Use same network for pretraining (LM) and finetuning(Classification)

Improves on the SOTA

Needs less labelled data

Pre-training and Fine-Tuning

ULMfitUniversal Language Model Finetuning Howard ACL 2018

9 / 68

https://arxiv.org/abs/1801.06146


Pre-training and Fine-Tuning for NLU

Transformer-Based ArchitecturesUse Transformer encoder to compute sentence representations that can be used for all NLP tasks

GPT (OpenAI)

June 2018Training time: 240 GPU days

BERT (Google AI)

October 2018Training time: 256 TPU days (~320560 GPU days)

GPT2 (OpenAI)

February 2019Training time: 2048 TPU days

10 / 68


Unsupervised Pretraining Train a LM on a large corpus of text (BookCorpus 7Kbooks)

Supervised Finetuning Adapt the LM parameters to the supervised target class

input passed through pretrained LMfeed final LM activation to added linear + softmaxoutput layers to predict outputTask aware input transformations

Significantly improves upon the SOTA in 9 out of 12NLU tasks

Input Transformations for task adaptation

GPTGenerative PreTrained Transformer

Radford et al. 2018

11 / 68

https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf


BERTBidirectional Encoder Representations from Transformers

GPT: Standard LM, decodes lefttoright

ELMO: Left and right representation are concatenated, not jointly learned

BERT: Replace standard LM objective with Masked Language Model (MLM) objective. Jointly learns both sides of context

Devlin et al. NAACL 2019.

12 / 68

https://www.aclweb.org/anthology/N19-1423.pdf


BERT Pre-training

Pretrain on BooksCorpus (800M words) and English Wikipedia (2,500M words).

Two loss functions

Predict masked tokens

Next sentence prediction classification (true if next sentence is the orrectcontinuation)

The training loss is the sum of the mean masked LM likelihood and mean nextsentence prediction likelihood

13 / 68


Masked Language Modelling Objective

Mask 1 word in 7

Too litle masking: too expensive to train

Too much masking: not enough context

14 / 68


Next Sentence Prediction

To learn relationship between sentences

Beneficial for QA and NLI.

Has since been shown to be unimportant (not used in eg RoBERTa)

15 / 68


BERT Encoding

16 / 68


Run pretrained model on input

Sentence representation = finalhidden state output by theTransformer (= [CLS] wordembedding)

Add a classification layer

Fine tune all BERT parameters andthe classification layer jointly tomaxisimize the logprobability of thecorrect label

BERT Fine-tuning

Image credit: Devlin al. 2018

17 / 68


18 / 68


Pre-training and Fine-Tuning for NLGPreTrained EncoderDecoders

BART, T5

Generating into Multiple Languages

XLM embeddings

Generating Dialog Turns

ConVERT

Knowledgerich NLG

REALM, ...

19 / 68



BART, T5


XLM embeddings


ConVERT

Knowledgerich NLG

REALM, ...

20 / 68


Bidirectional encoder + Autoregressive TransformerDecoder

Denoising AutoEncoder

Corrupt text with a noising functionS2S model learns to reconstruct original text

Experiment with different noising functions

Token masking (reconstruct missing tokens)Token deletion (which positions are missinginputs?)Text infilling (reconstruct spans)Sentence permutation (reorder shuffled sentences)Document rotation (identify the start of thedocument)

BARTLewis et al. ACL 2020

Image credit: Lewis et al. 2019

21 / 68

https://www.aclweb.org/anthology/2020.acl-main.703.pdf


BART

Image credit: Kelvin Han 2020, Lewis et al. 2019

22 / 68


Achieves new stateoftheart resultson a number of text generation tasks

Text infilling (reconstruct spans)demonstrates the most consistentlystrong performance

Token masking (reconstruct missingtokens) is crucial

Document Rotation and SentenceShuffling perform poorly in isolation

BART Results

23 / 68


BART Summary

Coreference

Sacoolas, who has immunity as a diplomat’s wife, was involved in a traffic collision ...Prime Minister Johnson was questioned about the case while speaking to the press at ahospital in Watford. He said, “I hope that Anne Sacoolas will come back ... if we can’tresolve it then of course I will be raising it myself personally with the White House.”

Boris Johnson has said he will raise the issue of US diplomat Anne Sacoolas’diplomatic immunity with the White House.

He said Boris Johnson said

He said ... I will be raising ... Boris Johnson said he will raise the issue

→

→

24 / 68


BART Summary

Abstract Anaphora

Sacoolas, who has immunity as a diplomat’s wife, was involved in a traffic collision ...Prime Minister Johnson was questioned about the case while speaking to the press at ahospital in Watford. He said, “I hope that Anne Sacoolas will come back ... if we can’tresolve it then of course I will be raising it myself personally with the White House.”


25 / 68


World KnowledgeSacoolas, who has immunity as a diplomat’s wife, was involved in a traffic collision ...Prime Minister Johnson was questioned about the case while speaking to the press at ahospital in Watford. He said, “I hope that Anne Sacoolas will come back ... if we can’tresolve it then of course I will be raising it myself personally with the White House.”


26 / 68


Standard Encoderdecoder Transformer 12 blocks, 12heads, 220M parameters, 32K wordpieces

Pretraining

Masked span LM ObjectiveOn C4 (Colossal Clean Crawled Corpus, 750 GBof text data extracted from web pages)Multitask Learning: MT, translation, textclassification ...

Treat every text processing problem as a “texttotext”problem, i.e. taking text as input and producing new textas output

Add a taskspecific prefix to indicate the task to beperformed

T5-Base

27 / 68


T5TexttoText Transfer Transformer

Treat every text processing problem as a “texttotext” problem, i.e. taking text as input and producing new text as output

Raffel et al. JMLR 2020

28 / 68

https://jmlr.org/papers/volume21/20-074/20-074.pdf


Pretraining and Fine-Tuning for D2T Generation (1)

Finetune T5 on the WebNLG, MultiWoz and ToTTo benchmarks.

Small number of steps (5K for MultiWoz and WebNLG, 10K for ToTTo)All the model parameters are updated in the finetuning process

Achieves stateoftheart results

T5 pretraining also enables greater generalization, as evidenced by largeimprovements on outof domain test sets

Kale et al., INLG 2020.

29 / 68

https://www.aclweb.org/anthology/2020.inlg-1.14.pdf


WebNLG Gardent et al., ACL 2017

RDFtoText

MultiWoz Budzianowski et al. EMNLP 2018

10K humanhuman dialogsTaskbasedInput = dialog act (inform, request etc.) and list of slotkeyvalue pairs

ToTTo Parikh et al. EMNLP 2020

Wikipedia tables + TextA subset of cells is highlighted.A model must generate text that describes thehighlighted cells.Here: use only the highlighted cells and metadata asinput

D2T Benchmarks

30 / 68

https://www.aclweb.org/anthology/P17-1017.pdfhttps://www.aclweb.org/anthology/D18-1547.pdfhttps://www.aclweb.org/anthology/2020.emnlp-main.89.pdf


WebNLGUnseen: +14 BLEU

TottoNonOverlap test set: +6.6 BLEU, +7.5PARENT

MultiWoz+45 BLEU without indomainpretraining

31 / 68


Pretraining and Fine-Tuning for D2T Generation (2)

WebNLG+ 2020D2T Generation: RDF English, RussianSemantic Parsing: English, Russian RDF

PretrainingT5 jointly pretrained on English WKP, Russian WKP and the WMT (en,ru) parallel corpus for 800K steps

FinetuningSemantic parsing and NLG are handled by separate models

Monolingual: a model for each languageBilingual: multitasks both languages and fine tune a single model for each languageBilingual+WPC: Add WebNLG (WPC) Parallel corpus in both directions (en,ru) and (ru,en)WPC: parallel sentences and entities in English and Russian extracted from the WebNLG+ parallel corpusEach task is weighted by the size of its training corpus

Agarwal et al., WebNLG+ 2020

→→

32 / 68

https://webnlg-challenge.loria.fr/files/2020.webnlg-papers.13.pdf


GenerationBilingual+WPC: +6.84 BLEU on unseen Russian

Semantic ParsingBilingual and Bilingual+WPC outperforms Monolingual

Results

33 / 68


Wikibio Dataset

Input: Wiki infoboxOutput: first sentence of thecorresponding WKP article

Pretraining and Fine-Tuning for D2T Generation (3)Fewshot D2T NLG: Learn D2T Model from 50 200 training instances

34 / 68


Fieldgated dual attention LSTMmodel Liu et al., 2018

Latent switch to choose betweencopying from table and generatingfrom softmax distribution

Additional loss term to maximisecopy probability for target tokenswhich match an input token

Pretrained TransformerBased LM(GPT2)

Model

Chen et al. ACL 2020.

35 / 68

https://arxiv.org/pdf/1711.09724.pdfhttps://www.aclweb.org/anthology/2020.acl-main.18/


Loss term: +4.0 BLEU points

Results

Baseoriginal (Liu et al., 2018), which obtains the state oftheart result on WIKIBIO full set, performs very poorly underfewshot setting.

Base+switch: +10.0 BLEU on avg Learns to copy but output is not fluent

Base+switch+ pretrained LM (GPT2): +8.0 BLEU on avg

36 / 68



BART, T5

Generating into Multiple LanguagesXLM embeddings


ConVERT

Knowledgerich NLG

REALM, ...

37 / 68


Generating from AMRs into 21 Languages

Fan et al. EMNLP 2020.

38 / 68

https://www.aclweb.org/anthology/2020.emnlp-main.231.pdf


Graph Encoding

39 / 68


Graph Encoding

40 / 68


Remove variable names and instanceof relation

No anonymisation

Sentence piece model with 32K operations

Pre-processing

41 / 68


Pretraining on silver AMRs

30M sentences from CCNET

Using JAMR

Pretraining

42 / 68


XLM Sentence piece model and vocabulary

XLM crosslingual embeddings

Language Model pretraining on 30M sentences (foreach language)

Decoding into Multiple Languages

43 / 68


XLM Cross-Lingual Embeddings

Lample et al. NeurIPS 2019.

44 / 68

https://arxiv.org/pdf/1901.07291v1.pdf


Multilingual AMR-to-NL Model

45 / 68


Training Data

Europarl: 21 Languages

Input AMR: create AMR structure with JAMR parser

46 / 68


Take aways

Human evaluation shows that multilingual techniques generalize across languages

A multilingual model benefits from increased training data and performs better thanmonolingual particularly when training data is less

Using EnglishCentric AMR, we can decode into many different targetsidelanguages

The model generates good paraphrases

47 / 68


Example Paraphrases

48 / 68


Pre-training and Fine-Tuning for NLGGenerating PreTrained EncoderDecoders

BART, T5

Generating into Multiple LanguagesXLM embeddings

Generating Dialog TurnsDialoGPT, ConVERT

Knowledgerich NLGREALM, ...

49 / 68


DialoGPTLargeScale Generative Pretraining for Conversational Response Generation

Language Model: GPT2 architecture, 12to24 layer transformer

Pretrained on Reddit ( Dialog Context, Dialog Turn) pairs 147M Dialogs, 1.8Bwords

Zhang et al. ACL 2020.

50 / 68

https://www.aclweb.org/anthology/2020.acl-demos.30.pdf


DSTC-7 Dialogue Generation ChallengeGenerate conversation responses that go beyond chitchat by injecting information that is grounded in external knowledge

DialoGPT finetuned on DSTC sourcetarget pairs

Does not leverage the grounding information from DSTC dataset

Outperforms the winner system of the DSTC7 Challenge and Personality Chat

51 / 68


Dialogs demonstrate some ability

to address commonsense questions

and to handle multiturn dialog

Results

52 / 68


ConveRTEfficient and Accurate Conversational Representations from Transformers

Retrievalbased dialog model

Treats each input utterance as a query and retrieves the most relevant response from a large response collection by computingsemantic similarity between the query representation and the encoding of each response in the collection.

Henderson et al. arXiv 2019 and Findings EMNLP 2020

53 / 68

https://arxiv.org/pdf/1911.03688.pdfhttps://www.aclweb.org/anthology/P19-1536.pdf


ModelResponse Selection is a task of selecting the mostappropriate response given the dialog history selectingthe most appropriate system response given the dialoguehistory and the input user utterance (i.e., the fulldialogue context).

Transformerstyle architectures for inputresponseencodings

Pretraining and FinetuningPretrain the response selection model on large generaldomain conversational corpora Reddit 727M (input, response) pairs

Finetune for target dialogue domains using small indomain dataset

ConveRT

54 / 68


Multi-Context Dual-Encoder Model

Combines the immediate context with previous dialoghistory (up to 10 more previous messages in a Redditthread)

Linear combination of three training objectives

ranking responses given the immediate contextranking responses given only the nonimmediatecontextsranking responses given the averagedrepresentation of the immediate and nonimmediate context

ConveRT

55 / 68


Baselines

Universal Setnence EncoderPolyAIDual bestperforming dualencoder model fromHenderson et al. (2019b) pretrained on Reddit responseselection.MAP learns to (linearly) map the response vectors to theinput vector space

AMAZONQA

3.6M (singlecontext) QA pairs,300K pairs are reserved for testing.

DSTC7 UBUNTU

1M+ conversations

ConveRT

ResultsSignificant gains over the previous stateoftheart

56 / 68


And alsoPolyEncoders Humeau et al. ICLR 2020 Architectures and pretraining strategies for fast and accurate multisentence scoring

Meena Adiwardana et al., arXiv 2020 A Transformerbased model trained on 341 GB of text, that was shown to be superior to variants of DialoGPT

Opendomain chatbots that perform well in human evaluation Roller et al. arXiv 2020 Pretrain on Reddit and finetune on the ConvAI2, Wizard ofWikipedia, Empathetic Dialogues and Blended Skill Talk datasets Compares generative, retrieval and retrieveandrefine models

...

57 / 68

https://openreview.net/pdf?id=SkxgnnNFvHhttps://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.htmlhttps://parl.ai/projects/recipes/



BART, T5


XLM embeddings


ConVERT

Knowledgerich NLGREALM, RAG

58 / 68


Joint MLM and Document Retrieval

Unsupervised MLM objective

Retrieval

Inner product between Input text and DocumentBERT embeddingMIPS for efficient retrieval (neared neighbour stylesearch)Precomputed index updated every few hundredstepsExtract topk documents

LM

Predict masked tokens based on input text andretrieved documentsSignal from the language modeling objectivebackpropagates through the retriever

REALMRetrievalAugmented Language Model PreTraining

Guu et al. arXiv 2020

59 / 68

https://arxiv.org/pdf/2002.08909v1.pdf


Injecting inductive biases into pre-trainingSalient span masking

Focus on examples that require world knowledge to predict the masked tokenMask Named Entities

Null document

Add an empty null document to the top k retrieved documents for cases when no retrieval is necessary (not all masked tokensrequire world knowledge to predict.)

Trivial retrievals

Forbid retrieval of documents containing the masked input sentence

Initialisation

To improve quality of document and input embeddings, start using the Inverze Cloze Task objective (given a sentence , themodel is trained to retrieve the document where that sentence comes from)

x

60 / 68


Pretrain retriever and encoder on Wikipedia and CCNews

Finetune on QA data Real user queries

NaturalQuestionsOpen

WebQuestions

CuratedTrec

Fine-tuning for QA

61 / 68


T5: no explicit kl retrieval

DrQA etc: RI and LM trained separately

Ablation

Joint IR and LM training helpsSalient span masking is crucialRefreshing the index is important

Results

62 / 68


RAGRetrievalAugmented Generation for KnowledgeIntensive NLP Tasks

Pretrained Retriever + Pretrained EncoderDecoder

Retriever

DPR (Dense Passage Retrieval) biencoder pretrained to retrieve documents which contain answers to TriviaQAquestions and Natural Questions

Generator: BARTlarge

Generates tokens based on a context of the previous tokens, the original input and a retrieved passage

Jointly trained on target task

Keep docuemnt encoder and index fixedFinetune the query encoder and the generator

63 / 68

https://arxiv.org/pdf/2004.04906.pdf


Knowledge-intensive TasksWikipedia dump as knowledge source

DPR document encoder to compute document embeddings for each documentUse a single MIPS index using FAISSDuring training, retrieve the top 510 documents for each query

Extractive Opendomain QA

Natural Questions (NQ)TriviaQA (TQA)WebQuestions (WQ)CuratedTrec (CT)

Abstractive Question Answering

MSMARCO Natural Language Generation task v2.1 (NL QUestions + snippets, Sentence Answer)Jeopardy Question Generation

Fact verification

FEVER

64 / 68


ClosedBook: Generate answers relying purely onparametric knowledge

OpenBook: Answers are extracted as spans fromretrieved documents

New SOTA on all 4 tasks

No reranking, no index update

ResultsExtractive QA

65 / 68


Abstractive QA and Classification

Jeopardy

Outperforms BART, more factualRAGToken is able to synthesize a response bycombining disparate information from differentretrieved documents

MSMARCO

Outperforms BART on by 2.6 Bleu points and 2.6RougeL pointsaApproaches SOTA performance w/o access topassages that contain the specific informationrequired to generate the reference answer

FEVER

For 3way classification, RAG accuracies arewithin 4.3% of SOTA models with domainspecificarchitectures and are trained using intermediatesupervision, which RAG does not require.

Results

66 / 68


MARGE: Pretraining via Paraphrasing Lewis et al. arXiv 2020

67 / 68

https://arxiv.org/pdf/2006.15020.pdf


The EndThe End

68 / 6868 / 68

Natural Language Gen eration...Pre-trai n i ng an d Fi n e-Tun i ng ULMﬁt Universal Language Model...

Documents

Transcript of Natural Language Gen eration...Pre-trai n i ng an d Fi n e-Tun i ng ULMﬁt Universal Language Model...