ALPS Winter School 2021
Natural Language GenerationTransfer Learning
Claire Gardent
Advanced Language Processing Winter School, 1722 January 2021
1 / 68
ALPS Winter School 2021
Part 2: Topics in NLGSelecting from the Input
Summarisation, Simplification, D2T
Modeling input structure
Graph encoders for D2T, MR2THierarchical encoders for long form text (Summarisation)
Transfer Learning
D2T, Dialog, ...
2 / 68
ALPS Winter School 2021
What is Transfer Learning ?Key Idea: Transfer KL acquired by one model to another model
How ?
Find a task (e.g., Language Modelling ) for which it is easy to generate labelsand for which you can get large quantities of training data
Train a model on this large data and adapt it to a task for which labelled data isharder to get
Two main approaches
Transfer word or sentence embeddings ("featurebased")
Transfer Model Parameters ("finetuning") 3 / 68
ALPS Winter School 2021
What is Transfer Learning ?Transfer prior KL
Two main approaches
Transfer word or sentence embeddings ("featurebased")
Transfer Model Parameters ("finetuning")
Why Transfer Learning ?Data
Efficiency
Quality
4 / 68
ALPS Winter School 2021
Feature-Based Transfer1hot vector + Learned word embeddings
Learned on labelled dataSpecific to the task
Pretrained word embeddings (word2vec, Glove, ...)
Learned on unlabeled dataUse these for different tasksRepresent each word type with a single representationContext is a limited window
Contextual Word Representations (ELMo)
Learned on very large amount of unlabeled dataDepend on the sentence in which a word is used play area (ADJ), the play was from Shakespeare (NOUN), I play violin (VERB)Improve NLP tasks
Image credit: Manning
5 / 68
http://web.stanford.edu/class/cs224n/index.html#schedule
ALPS Winter School 2021
ELMoLearn a LM on large quantities ofunlabelled data (1B Words)
For a specific task (e.g., NER)
Run this LM on the input text
This creates a contextual representationfor each token in the input text
Combine this contextual representation with a pretrained word embedding (e.g.,Glove or Word2Vec)
Peters et al. NAACL 2018, Image Credit: Manning
6 / 68
https://www.aclweb.org/anthology/N18-1202.pdfhttp://web.stanford.edu/class/cs224n/index.html#schedule
ALPS Winter School 2021
ELMo
7 / 68
ALPS Winter School 2021
LM = 2 BiLSTM layers (use alllayers, residual connection)
Trained on 800M words
Long context, not context window(whole sentence)
Provide good gains for multiple NLPtasks
ELMo
8 / 68
ALPS Winter School 2021
Train deep BiLSTM LM on large quantity of textualdata
FineTune LM on target domain text
Fine tune LM as classifier Use same network for pretraining (LM) and finetuning(Classification)
Improves on the SOTA
Needs less labelled data
Pre-training and Fine-Tuning
ULMfitUniversal Language Model Finetuning Howard ACL 2018
9 / 68
https://arxiv.org/abs/1801.06146
ALPS Winter School 2021
Pre-training and Fine-Tuning for NLU
Transformer-Based ArchitecturesUse Transformer encoder to compute sentence representations that can be used for all NLP tasks
GPT (OpenAI)
June 2018Training time: 240 GPU days
BERT (Google AI)
October 2018Training time: 256 TPU days (~320560 GPU days)
GPT2 (OpenAI)
February 2019Training time: 2048 TPU days
10 / 68
ALPS Winter School 2021
Unsupervised Pretraining Train a LM on a large corpus of text (BookCorpus 7Kbooks)
Supervised Finetuning Adapt the LM parameters to the supervised target class
input passed through pretrained LMfeed final LM activation to added linear + softmaxoutput layers to predict outputTask aware input transformations
Significantly improves upon the SOTA in 9 out of 12NLU tasks
Input Transformations for task adaptation
GPTGenerative PreTrained Transformer
Radford et al. 2018
11 / 68
https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
ALPS Winter School 2021
BERTBidirectional Encoder Representations from Transformers
GPT: Standard LM, decodes lefttoright
ELMO: Left and right representation are concatenated, not jointly learned
BERT: Replace standard LM objective with Masked Language Model (MLM) objective. Jointly learns both sides of context
Devlin et al. NAACL 2019.
12 / 68
https://www.aclweb.org/anthology/N19-1423.pdf
ALPS Winter School 2021
BERT Pre-training
Pretrain on BooksCorpus (800M words) and English Wikipedia (2,500M words).
Two loss functions
Predict masked tokens
Next sentence prediction classification (true if next sentence is the orrectcontinuation)
The training loss is the sum of the mean masked LM likelihood and mean nextsentence prediction likelihood
13 / 68
ALPS Winter School 2021
Masked Language Modelling Objective
Mask 1 word in 7
Too litle masking: too expensive to train
Too much masking: not enough context
14 / 68
ALPS Winter School 2021
Next Sentence Prediction
To learn relationship between sentences
Beneficial for QA and NLI.
Has since been shown to be unimportant (not used in eg RoBERTa)
15 / 68
ALPS Winter School 2021
BERT Encoding
16 / 68
ALPS Winter School 2021
Run pretrained model on input
Sentence representation = finalhidden state output by theTransformer (= [CLS] wordembedding)
Add a classification layer
Fine tune all BERT parameters andthe classification layer jointly tomaxisimize the logprobability of thecorrect label
BERT Fine-tuning
Image credit: Devlin al. 2018
17 / 68
ALPS Winter School 2021
18 / 68
ALPS Winter School 2021
Pre-training and Fine-Tuning for NLGPreTrained EncoderDecoders
BART, T5
Generating into Multiple Languages
XLM embeddings
Generating Dialog Turns
ConVERT
Knowledgerich NLG
REALM, ...
19 / 68
ALPS Winter School 2021
Pre-training and Fine-Tuning for NLGPreTrained EncoderDecoders
BART, T5
Generating into Multiple Languages
XLM embeddings
Generating Dialog Turns
ConVERT
Knowledgerich NLG
REALM, ...
20 / 68
ALPS Winter School 2021
Bidirectional encoder + Autoregressive TransformerDecoder
Denoising AutoEncoder
Corrupt text with a noising functionS2S model learns to reconstruct original text
Experiment with different noising functions
Token masking (reconstruct missing tokens)Token deletion (which positions are missinginputs?)Text infilling (reconstruct spans)Sentence permutation (reorder shuffled sentences)Document rotation (identify the start of thedocument)
BARTLewis et al. ACL 2020
Image credit: Lewis et al. 2019
21 / 68
https://www.aclweb.org/anthology/2020.acl-main.703.pdf
ALPS Winter School 2021
BART
Image credit: Kelvin Han 2020, Lewis et al. 2019
22 / 68
ALPS Winter School 2021
Achieves new stateoftheart resultson a number of text generation tasks
Text infilling (reconstruct spans)demonstrates the most consistentlystrong performance
Token masking (reconstruct missingtokens) is crucial
Document Rotation and SentenceShuffling perform poorly in isolation
BART Results
23 / 68
ALPS Winter School 2021
BART Summary
Coreference
Sacoolas, who has immunity as a diplomat’s wife, was involved in a traffic collision ...Prime Minister Johnson was questioned about the case while speaking to the press at ahospital in Watford. He said, “I hope that Anne Sacoolas will come back ... if we can’tresolve it then of course I will be raising it myself personally with the White House.”
Boris Johnson has said he will raise the issue of US diplomat Anne Sacoolas’diplomatic immunity with the White House.
He said Boris Johnson said
He said ... I will be raising ... Boris Johnson said he will raise the issue
→
→
24 / 68
ALPS Winter School 2021
BART Summary
Abstract Anaphora
Sacoolas, who has immunity as a diplomat’s wife, was involved in a traffic collision ...Prime Minister Johnson was questioned about the case while speaking to the press at ahospital in Watford. He said, “I hope that Anne Sacoolas will come back ... if we can’tresolve it then of course I will be raising it myself personally with the White House.”
Boris Johnson has said he will raise the issue of US diplomat Anne Sacoolas’diplomatic immunity with the White House.
25 / 68
ALPS Winter School 2021
World KnowledgeSacoolas, who has immunity as a diplomat’s wife, was involved in a traffic collision ...Prime Minister Johnson was questioned about the case while speaking to the press at ahospital in Watford. He said, “I hope that Anne Sacoolas will come back ... if we can’tresolve it then of course I will be raising it myself personally with the White House.”
Boris Johnson has said he will raise the issue of US diplomat Anne Sacoolas’diplomatic immunity with the White House.
26 / 68
ALPS Winter School 2021
Standard Encoderdecoder Transformer 12 blocks, 12heads, 220M parameters, 32K wordpieces
Pretraining
Masked span LM ObjectiveOn C4 (Colossal Clean Crawled Corpus, 750 GBof text data extracted from web pages)Multitask Learning: MT, translation, textclassification ...
Treat every text processing problem as a “texttotext”problem, i.e. taking text as input and producing new textas output
Add a taskspecific prefix to indicate the task to beperformed
T5-Base
27 / 68
ALPS Winter School 2021
T5TexttoText Transfer Transformer
Treat every text processing problem as a “texttotext” problem, i.e. taking text as input and producing new text as output
Raffel et al. JMLR 2020
28 / 68
https://jmlr.org/papers/volume21/20-074/20-074.pdf
ALPS Winter School 2021
Pretraining and Fine-Tuning for D2T Generation (1)
Finetune T5 on the WebNLG, MultiWoz and ToTTo benchmarks.
Small number of steps (5K for MultiWoz and WebNLG, 10K for ToTTo)All the model parameters are updated in the finetuning process
Achieves stateoftheart results
T5 pretraining also enables greater generalization, as evidenced by largeimprovements on outof domain test sets
Kale et al., INLG 2020.
29 / 68
https://www.aclweb.org/anthology/2020.inlg-1.14.pdf
ALPS Winter School 2021
WebNLG Gardent et al., ACL 2017
RDFtoText
MultiWoz Budzianowski et al. EMNLP 2018
10K humanhuman dialogsTaskbasedInput = dialog act (inform, request etc.) and list of slotkeyvalue pairs
ToTTo Parikh et al. EMNLP 2020
Wikipedia tables + TextA subset of cells is highlighted.A model must generate text that describes thehighlighted cells.Here: use only the highlighted cells and metadata asinput
D2T Benchmarks
30 / 68
https://www.aclweb.org/anthology/P17-1017.pdfhttps://www.aclweb.org/anthology/D18-1547.pdfhttps://www.aclweb.org/anthology/2020.emnlp-main.89.pdf
ALPS Winter School 2021
WebNLGUnseen: +14 BLEU
TottoNonOverlap test set: +6.6 BLEU, +7.5PARENT
MultiWoz+45 BLEU without indomainpretraining
31 / 68
ALPS Winter School 2021
Pretraining and Fine-Tuning for D2T Generation (2)
WebNLG+ 2020D2T Generation: RDF English, RussianSemantic Parsing: English, Russian RDF
PretrainingT5 jointly pretrained on English WKP, Russian WKP and the WMT (en,ru) parallel corpus for 800K steps
FinetuningSemantic parsing and NLG are handled by separate models
Monolingual: a model for each languageBilingual: multitasks both languages and fine tune a single model for each languageBilingual+WPC: Add WebNLG (WPC) Parallel corpus in both directions (en,ru) and (ru,en)WPC: parallel sentences and entities in English and Russian extracted from the WebNLG+ parallel corpusEach task is weighted by the size of its training corpus
Agarwal et al., WebNLG+ 2020
→→
32 / 68
https://webnlg-challenge.loria.fr/files/2020.webnlg-papers.13.pdf
ALPS Winter School 2021
GenerationBilingual+WPC: +6.84 BLEU on unseen Russian
Semantic ParsingBilingual and Bilingual+WPC outperforms Monolingual
Results
33 / 68
ALPS Winter School 2021
Wikibio Dataset
Input: Wiki infoboxOutput: first sentence of thecorresponding WKP article
Pretraining and Fine-Tuning for D2T Generation (3)Fewshot D2T NLG: Learn D2T Model from 50 200 training instances
34 / 68
ALPS Winter School 2021
Fieldgated dual attention LSTMmodel Liu et al., 2018
Latent switch to choose betweencopying from table and generatingfrom softmax distribution
Additional loss term to maximisecopy probability for target tokenswhich match an input token
Pretrained TransformerBased LM(GPT2)
Model
Chen et al. ACL 2020.
35 / 68
https://arxiv.org/pdf/1711.09724.pdfhttps://www.aclweb.org/anthology/2020.acl-main.18/
ALPS Winter School 2021
Loss term: +4.0 BLEU points
Results
Baseoriginal (Liu et al., 2018), which obtains the state oftheart result on WIKIBIO full set, performs very poorly underfewshot setting.
Base+switch: +10.0 BLEU on avg Learns to copy but output is not fluent
Base+switch+ pretrained LM (GPT2): +8.0 BLEU on avg
36 / 68
ALPS Winter School 2021
Pre-training and Fine-Tuning for NLGPreTrained EncoderDecoders
BART, T5
Generating into Multiple LanguagesXLM embeddings
Generating Dialog Turns
ConVERT
Knowledgerich NLG
REALM, ...
37 / 68
ALPS Winter School 2021
Generating from AMRs into 21 Languages
Fan et al. EMNLP 2020.
38 / 68
https://www.aclweb.org/anthology/2020.emnlp-main.231.pdf
ALPS Winter School 2021
Graph Encoding
39 / 68
ALPS Winter School 2021
Graph Encoding
40 / 68
ALPS Winter School 2021
Remove variable names and instanceof relation
No anonymisation
Sentence piece model with 32K operations
Pre-processing
41 / 68
ALPS Winter School 2021
Pretraining on silver AMRs
30M sentences from CCNET
Using JAMR
Pretraining
42 / 68
ALPS Winter School 2021
XLM Sentence piece model and vocabulary
XLM crosslingual embeddings
Language Model pretraining on 30M sentences (foreach language)
Decoding into Multiple Languages
43 / 68
ALPS Winter School 2021
XLM Cross-Lingual Embeddings
Lample et al. NeurIPS 2019.
44 / 68
https://arxiv.org/pdf/1901.07291v1.pdf
ALPS Winter School 2021
Multilingual AMR-to-NL Model
45 / 68
ALPS Winter School 2021
Training Data
Europarl: 21 Languages
Input AMR: create AMR structure with JAMR parser
46 / 68
ALPS Winter School 2021
Take aways
Human evaluation shows that multilingual techniques generalize across languages
A multilingual model benefits from increased training data and performs better thanmonolingual particularly when training data is less
Using EnglishCentric AMR, we can decode into many different targetsidelanguages
The model generates good paraphrases
47 / 68
ALPS Winter School 2021
Example Paraphrases
48 / 68
ALPS Winter School 2021
Pre-training and Fine-Tuning for NLGGenerating PreTrained EncoderDecoders
BART, T5
Generating into Multiple LanguagesXLM embeddings
Generating Dialog TurnsDialoGPT, ConVERT
Knowledgerich NLGREALM, ...
49 / 68
ALPS Winter School 2021
DialoGPTLargeScale Generative Pretraining for Conversational Response Generation
Language Model: GPT2 architecture, 12to24 layer transformer
Pretrained on Reddit ( Dialog Context, Dialog Turn) pairs 147M Dialogs, 1.8Bwords
Zhang et al. ACL 2020.
50 / 68
https://www.aclweb.org/anthology/2020.acl-demos.30.pdf
ALPS Winter School 2021
DSTC-7 Dialogue Generation ChallengeGenerate conversation responses that go beyond chitchat by injecting information that is grounded in external knowledge
DialoGPT finetuned on DSTC sourcetarget pairs
Does not leverage the grounding information from DSTC dataset
Outperforms the winner system of the DSTC7 Challenge and Personality Chat
51 / 68
ALPS Winter School 2021
Dialogs demonstrate some ability
to address commonsense questions
and to handle multiturn dialog
Results
52 / 68
ALPS Winter School 2021
ConveRTEfficient and Accurate Conversational Representations from Transformers
Retrievalbased dialog model
Treats each input utterance as a query and retrieves the most relevant response from a large response collection by computingsemantic similarity between the query representation and the encoding of each response in the collection.
Henderson et al. arXiv 2019 and Findings EMNLP 2020
53 / 68
https://arxiv.org/pdf/1911.03688.pdfhttps://www.aclweb.org/anthology/P19-1536.pdf
ALPS Winter School 2021
ModelResponse Selection is a task of selecting the mostappropriate response given the dialog history selectingthe most appropriate system response given the dialoguehistory and the input user utterance (i.e., the fulldialogue context).
Transformerstyle architectures for inputresponseencodings
Pretraining and FinetuningPretrain the response selection model on large generaldomain conversational corpora Reddit 727M (input, response) pairs
Finetune for target dialogue domains using small indomain dataset
ConveRT
54 / 68
ALPS Winter School 2021
Multi-Context Dual-Encoder Model
Combines the immediate context with previous dialoghistory (up to 10 more previous messages in a Redditthread)
Linear combination of three training objectives
ranking responses given the immediate contextranking responses given only the nonimmediatecontextsranking responses given the averagedrepresentation of the immediate and nonimmediate context
ConveRT
55 / 68
ALPS Winter School 2021
Baselines
Universal Setnence EncoderPolyAIDual bestperforming dualencoder model fromHenderson et al. (2019b) pretrained on Reddit responseselection.MAP learns to (linearly) map the response vectors to theinput vector space
AMAZONQA
3.6M (singlecontext) QA pairs,300K pairs are reserved for testing.
DSTC7 UBUNTU
1M+ conversations
ConveRT
ResultsSignificant gains over the previous stateoftheart
56 / 68
ALPS Winter School 2021
And alsoPolyEncoders Humeau et al. ICLR 2020 Architectures and pretraining strategies for fast and accurate multisentence scoring
Meena Adiwardana et al., arXiv 2020 A Transformerbased model trained on 341 GB of text, that was shown to be superior to variants of DialoGPT
Opendomain chatbots that perform well in human evaluation Roller et al. arXiv 2020 Pretrain on Reddit and finetune on the ConvAI2, Wizard ofWikipedia, Empathetic Dialogues and Blended Skill Talk datasets Compares generative, retrieval and retrieveandrefine models
...
57 / 68
https://openreview.net/pdf?id=SkxgnnNFvHhttps://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.htmlhttps://parl.ai/projects/recipes/
ALPS Winter School 2021
Pre-training and Fine-Tuning for NLGPreTrained EncoderDecoders
BART, T5
Generating into Multiple Languages
XLM embeddings
Generating Dialog Turns
ConVERT
Knowledgerich NLGREALM, RAG
58 / 68
ALPS Winter School 2021
Joint MLM and Document Retrieval
Unsupervised MLM objective
Retrieval
Inner product between Input text and DocumentBERT embeddingMIPS for efficient retrieval (neared neighbour stylesearch)Precomputed index updated every few hundredstepsExtract topk documents
LM
Predict masked tokens based on input text andretrieved documentsSignal from the language modeling objectivebackpropagates through the retriever
REALMRetrievalAugmented Language Model PreTraining
Guu et al. arXiv 2020
59 / 68
https://arxiv.org/pdf/2002.08909v1.pdf
ALPS Winter School 2021
Injecting inductive biases into pre-trainingSalient span masking
Focus on examples that require world knowledge to predict the masked tokenMask Named Entities
Null document
Add an empty null document to the top k retrieved documents for cases when no retrieval is necessary (not all masked tokensrequire world knowledge to predict.)
Trivial retrievals
Forbid retrieval of documents containing the masked input sentence
Initialisation
To improve quality of document and input embeddings, start using the Inverze Cloze Task objective (given a sentence , themodel is trained to retrieve the document where that sentence comes from)
x
60 / 68
ALPS Winter School 2021
Pretrain retriever and encoder on Wikipedia and CCNews
Finetune on QA data Real user queries
NaturalQuestionsOpen
WebQuestions
CuratedTrec
Fine-tuning for QA
61 / 68
ALPS Winter School 2021
T5: no explicit kl retrieval
DrQA etc: RI and LM trained separately
Ablation
Joint IR and LM training helpsSalient span masking is crucialRefreshing the index is important
Results
62 / 68
ALPS Winter School 2021
RAGRetrievalAugmented Generation for KnowledgeIntensive NLP Tasks
Pretrained Retriever + Pretrained EncoderDecoder
Retriever
DPR (Dense Passage Retrieval) biencoder pretrained to retrieve documents which contain answers to TriviaQAquestions and Natural Questions
Generator: BARTlarge
Generates tokens based on a context of the previous tokens, the original input and a retrieved passage
Jointly trained on target task
Keep docuemnt encoder and index fixedFinetune the query encoder and the generator
63 / 68
https://arxiv.org/pdf/2004.04906.pdf
ALPS Winter School 2021
Knowledge-intensive TasksWikipedia dump as knowledge source
DPR document encoder to compute document embeddings for each documentUse a single MIPS index using FAISSDuring training, retrieve the top 510 documents for each query
Extractive Opendomain QA
Natural Questions (NQ)TriviaQA (TQA)WebQuestions (WQ)CuratedTrec (CT)
Abstractive Question Answering
MSMARCO Natural Language Generation task v2.1 (NL QUestions + snippets, Sentence Answer)Jeopardy Question Generation
Fact verification
FEVER
64 / 68
ALPS Winter School 2021
ClosedBook: Generate answers relying purely onparametric knowledge
OpenBook: Answers are extracted as spans fromretrieved documents
New SOTA on all 4 tasks
No reranking, no index update
ResultsExtractive QA
65 / 68
ALPS Winter School 2021
Abstractive QA and Classification
Jeopardy
Outperforms BART, more factualRAGToken is able to synthesize a response bycombining disparate information from differentretrieved documents
MSMARCO
Outperforms BART on by 2.6 Bleu points and 2.6RougeL pointsaApproaches SOTA performance w/o access topassages that contain the specific informationrequired to generate the reference answer
FEVER
For 3way classification, RAG accuracies arewithin 4.3% of SOTA models with domainspecificarchitectures and are trained using intermediatesupervision, which RAG does not require.
Results
66 / 68
ALPS Winter School 2021
MARGE: Pretraining via Paraphrasing Lewis et al. arXiv 2020
67 / 68
https://arxiv.org/pdf/2006.15020.pdf
ALPS Winter School 2021
The EndThe End
68 / 6868 / 68
Top Related