arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37...

14
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com Abstract We introduce a new language representa- tion model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language repre- sentation models (Peters et al., 2018; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the- art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art re- sults on eleven natural language processing tasks, including pushing the GLUE bench- mark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7% (5.6% abso- lute improvement) and the SQuAD v1.1 ques- tion answering Test F1 to 93.2 (1.5 absolute improvement), outperforming human perfor- mance by 2.0. 1 Introduction Language model pre-training has shown to be ef- fective for improving many natural language pro- cessing tasks (Dai and Le, 2015; Peters et al., 2017, 2018; Radford et al., 2018; Howard and Ruder, 2018). These tasks include sentence-level tasks such as natural language inference (Bow- man et al., 2015; Williams et al., 2018) and para- phrasing (Dolan and Brockett, 2005), which aim to predict the relationships between sentences by analyzing them holistically, as well as token-level tasks such as named entity recognition (Tjong Kim Sang and De Meulder, 2003) and SQuAD question answering (Rajpurkar et al., 2016), where models are required to produce fine-grained output at the token-level. There are two existing strategies for apply- ing pre-trained language representations to down- stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018), uses tasks-specific architectures that include the pre-trained representations as addi- tional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning the pre- trained parameters. In previous work, both ap- proaches share the same objective function dur- ing pre-training, where they use unidirectional lan- guage models to learn general language represen- tations. We argue that current techniques severely re- strict the power of the pre-trained representations, especially for the fine-tuning approaches. The ma- jor limitation is that standard language models are unidirectional, and this limits the choice of archi- tectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left- to-right architecture, where every token can only attended to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such restrictions are sub-optimal for sentence- level tasks, and could be devastating when ap- plying fine-tuning based approaches to token-level tasks such as SQuAD question answering (Ra- jpurkar et al., 2016), where it is crucial to incor- porate context from both directions. In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT addresses the previously mentioned uni- directional constraints by proposing a new pre-training objective: the “masked language arXiv:1810.04805v1 [cs.CL] 11 Oct 2018

Transcript of arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37...

Page 1: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina ToutanovaGoogle AI Language

{jacobdevlin,mingweichang,kentonl,kristout}@google.com

Abstract

We introduce a new language representa-tion model called BERT, which stands forBidirectional Encoder Representations fromTransformers. Unlike recent language repre-sentation models (Peters et al., 2018; Radfordet al., 2018), BERT is designed to pre-traindeep bidirectional representations by jointlyconditioning on both left and right context inall layers. As a result, the pre-trained BERTrepresentations can be fine-tuned with just oneadditional output layer to create state-of-the-art models for a wide range of tasks, suchas question answering and language inference,without substantial task-specific architecturemodifications.

BERT is conceptually simple and empiricallypowerful. It obtains new state-of-the-art re-sults on eleven natural language processingtasks, including pushing the GLUE bench-mark to 80.4% (7.6% absolute improvement),MultiNLI accuracy to 86.7% (5.6% abso-lute improvement) and the SQuAD v1.1 ques-tion answering Test F1 to 93.2 (1.5 absoluteimprovement), outperforming human perfor-mance by 2.0.

1 Introduction

Language model pre-training has shown to be ef-fective for improving many natural language pro-cessing tasks (Dai and Le, 2015; Peters et al.,2017, 2018; Radford et al., 2018; Howard andRuder, 2018). These tasks include sentence-leveltasks such as natural language inference (Bow-man et al., 2015; Williams et al., 2018) and para-phrasing (Dolan and Brockett, 2005), which aimto predict the relationships between sentences byanalyzing them holistically, as well as token-leveltasks such as named entity recognition (TjongKim Sang and De Meulder, 2003) and SQuADquestion answering (Rajpurkar et al., 2016), where

models are required to produce fine-grained outputat the token-level.

There are two existing strategies for apply-ing pre-trained language representations to down-stream tasks: feature-based and fine-tuning. Thefeature-based approach, such as ELMo (Peterset al., 2018), uses tasks-specific architectures thatinclude the pre-trained representations as addi-tional features. The fine-tuning approach, such asthe Generative Pre-trained Transformer (OpenAIGPT) (Radford et al., 2018), introduces minimaltask-specific parameters, and is trained on thedownstream tasks by simply fine-tuning the pre-trained parameters. In previous work, both ap-proaches share the same objective function dur-ing pre-training, where they use unidirectional lan-guage models to learn general language represen-tations.

We argue that current techniques severely re-strict the power of the pre-trained representations,especially for the fine-tuning approaches. The ma-jor limitation is that standard language models areunidirectional, and this limits the choice of archi-tectures that can be used during pre-training. Forexample, in OpenAI GPT, the authors use a left-to-right architecture, where every token can onlyattended to previous tokens in the self-attentionlayers of the Transformer (Vaswani et al., 2017).Such restrictions are sub-optimal for sentence-level tasks, and could be devastating when ap-plying fine-tuning based approaches to token-leveltasks such as SQuAD question answering (Ra-jpurkar et al., 2016), where it is crucial to incor-porate context from both directions.

In this paper, we improve the fine-tuning basedapproaches by proposing BERT: BidirectionalEncoder Representations from Transformers.BERT addresses the previously mentioned uni-directional constraints by proposing a newpre-training objective: the “masked language

arX

iv:1

810.

0480

5v1

[cs

.CL

] 1

1 O

ct 2

018

Page 2: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

model” (MLM), inspired by the Cloze task (Tay-lor, 1953). The masked language model randomlymasks some of the tokens from the input, and theobjective is to predict the original vocabulary id ofthe masked word based only on its context. Un-like left-to-right language model pre-training, theMLM objective allows the representation to fusethe left and the right context, which allows usto pre-train a deep bidirectional Transformer. Inaddition to the masked language model, we alsointroduce a “next sentence prediction” task thatjointly pre-trains text-pair representations.

The contributions of our paper are as follows:

• We demonstrate the importance of bidirec-tional pre-training for language representa-tions. Unlike Radford et al. (2018), whichuses unidirectional language models for pre-training, BERT uses masked language mod-els to enable pre-trained deep bidirectionalrepresentations. This is also in contrast toPeters et al. (2018), which uses a shallowconcatenation of independently trained left-to-right and right-to-left LMs.

• We show that pre-trained representationseliminate the needs of many heavily-engineered task-specific architectures. BERTis the first fine-tuning based representationmodel that achieves state-of-the-art perfor-mance on a large suite of sentence-level andtoken-level tasks, outperforming many sys-tems with task-specific architectures.

• BERT advances the state-of-the-art for elevenNLP tasks. We also report extensive abla-tions of BERT, demonstrating that the bidi-rectional nature of our model is the singlemost important new contribution. The codeand pre-trained model will be available atgoo.gl/language/bert.1

2 Related Work

There is a long history of pre-training general lan-guage representations, and we briefly review themost popular approaches in this section.

2.1 Feature-based ApproachesLearning widely applicable representations ofwords has been an active area of research fordecades, including non-neural (Brown et al., 1992;

1Will be released before the end of October 2018.

Ando and Zhang, 2005; Blitzer et al., 2006) andneural (Collobert and Weston, 2008; Mikolovet al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are considered to be anintegral part of modern NLP systems, offering sig-nificant improvements over embeddings learnedfrom scratch (Turian et al., 2010).

These approaches have been generalized tocoarser granularities, such as sentence embed-dings (Kiros et al., 2015; Logeswaran and Lee,2018) or paragraph embeddings (Le and Mikolov,2014). As with traditional word embeddings,these learned representations are also typicallyused as features in a downstream model.

ELMo (Peters et al., 2017) generalizes tradi-tional word embedding research along a differ-ent dimension. They propose to extract context-sensitive features from a language model. Whenintegrating contextual word embeddings with ex-isting task-specific architectures, ELMo advancesthe state-of-the-art for several major NLP bench-marks (Peters et al., 2018) including question an-swering (Rajpurkar et al., 2016) on SQuAD, sen-timent analysis (Socher et al., 2013), and namedentity recognition (Tjong Kim Sang and De Meul-der, 2003).

2.2 Fine-tuning ApproachesA recent trend in transfer learning from languagemodels (LMs) is to pre-train some model ar-chitecture on a LM objective before fine-tuningthat same model for a supervised downstreamtask (Dai and Le, 2015; Howard and Ruder, 2018;Radford et al., 2018). The advantage of these ap-proaches is that few parameters need to be learnedfrom scratch. At least partly due this advantage,OpenAI GPT (Radford et al., 2018) achieved pre-viously state-of-the-art results on many sentence-level tasks from the GLUE benchmark (Wanget al., 2018).

2.3 Transfer Learning from Supervised DataWhile the advantage of unsupervised pre-trainingis that there is a nearly unlimited amount of dataavailable, there has also been work showing ef-fective transfer from supervised tasks with largedatasets, such as natural language inference (Con-neau et al., 2017) and machine translation (Mc-Cann et al., 2017). Outside of NLP, computervision research has also demonstrated the impor-tance of transfer learning from large pre-trainedmodels, where an effective recipe is to fine-tune

Page 3: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

BERT (Ours)

Trm Trm Trm

Trm Trm Trm

...

...

Trm Trm Trm

Trm Trm Trm

...

...

OpenAI GPT

Lstm

ELMo

Lstm Lstm

Lstm Lstm Lstm

Lstm Lstm Lstm

Lstm Lstm Lstm

T1 T2 TN...

...

...

...

...

E1 E2 EN...

T1 T2 TN...

E1 E2 EN...

T1 T2 TN...

E1 E2 EN...

Figure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPTuses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to-left LSTM to generate features for downstream tasks. Among three, only BERT representations are jointlyconditioned on both left and right context in all layers.

models pre-trained on ImageNet (Deng et al.,2009; Yosinski et al., 2014).

3 BERT

We introduce BERT and its detailed implementa-tion in this section. We first cover the model ar-chitecture and the input representation for BERT.We then introduce the pre-training tasks, the coreinnovation in this paper, in Section 3.3. Thepre-training procedures, and fine-tuning proce-dures are detailed in Section 3.4 and 3.5, respec-tively. Finally, the differences between BERT andOpenAI GPT are discussed in Section 3.6.

3.1 Model ArchitectureBERT’s model architecture is a multi-layer bidi-rectional Transformer encoder based on the orig-inal implementation described in Vaswani et al.(2017) and released in the tensor2tensor li-brary.2 Because the use of Transformers has be-come ubiquitous recently and our implementationis effectively identical to the original, we willomit an exhaustive background description of themodel architecture and refer readers to Vaswaniet al. (2017) as well as excellent guides such as“The Annotated Transformer.”3

In this work, we denote the number of layers(i.e., Transformer blocks) as L, the hidden size asH , and the number of self-attention heads as A.In all cases we set the feed-forward/filter size tobe 4H , i.e., 3072 for the H = 768 and 4096 forthe H = 1024. We primarily report results on twomodel sizes:

• BERTBASE: L=12, H=768, A=12, Total Pa-rameters=110M

2https://github.com/tensorflow/tensor2tensor3http://nlp.seas.harvard.edu/2018/04/03/attention.html

• BERTLARGE: L=24, H=1024, A=16, TotalParameters=340M

BERTBASE was chosen to have an identicalmodel size as OpenAI GPT for comparison pur-poses. Critically, however, the BERT Transformeruses bidirectional self-attention, while the GPTTransformer uses constrained self-attention whereevery token can only attend to context to its left.We note that in the literature the bidirectionalTransformer is often referred to as a “Transformerencoder” while the left-context-only version is re-ferred to as a “Transformer decoder” since it canbe used for text generation. The comparisons be-tween BERT, OpenAI GPT and ELMo are shownvisually in Figure 1.

3.2 Input Representation

Our input representation is able to unambiguouslyrepresent both a single text sentence or a pair oftext sentences (e.g., [Question, Answer]) in onetoken sequence.4 For a given token, its input rep-resentation is constructed by summing the cor-responding token, segment and position embed-dings. A visual representation of our input rep-resentation is given in Figure 2.

The specifics are:

• We use WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary. Wedenote split word pieces with ##.

• We use learned positional embeddings withsupported sequence lengths up to 512 tokens.

4Throughout this work, a “sentence” can be an arbitraryspan of contiguous text, rather than an actual linguistic sen-tence. A “sequence” refers to the input token sequence toBERT, which may be a single sentence or two sentencespacked together.

Page 4: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

[CLS] he likes play ##ing [SEP]my dog is cute [SEP]Input

E[CLS] Ehe Elikes Eplay E##ing E[SEP]Emy Edog Eis Ecute E[SEP]TokenEmbeddings

EA EB EB EB EB EBEA EA EA EA EASegmentEmbeddings

E0 E6 E7 E8 E9 E10E1 E2 E3 E4 E5PositionEmbeddings

Figure 2: BERT input representation. The input embeddings is the sum of the token embeddings, the segmentationembeddings and the position embeddings.

• The first token of every sequence is al-ways the special classification embedding([CLS]). The final hidden state (i.e., out-put of Transformer) corresponding to this to-ken is used as the aggregate sequence rep-resentation for classification tasks. For non-classification tasks, this vector is ignored.

• Sentence pairs are packed together into a sin-gle sequence. We differentiate the sentencesin two ways. First, we separate them witha special token ([SEP]). Second, we add alearned sentence A embedding to every tokenof the first sentence and a sentence B embed-ding to every token of the second sentence.

• For single-sentence inputs we only use thesentence A embeddings.

3.3 Pre-training TasksUnlike Peters et al. (2018) and Radford et al.(2018), we do not use traditional left-to-right orright-to-left language models to pre-train BERT.Instead, we pre-train BERT using two novel unsu-pervised prediction tasks, described in this section.

3.3.1 Task #1: Masked LMIntuitively, it is reasonable to believe that adeep bidirectional model is strictly more power-ful than either a left-to-right model or the shal-low concatenation of a left-to-right and right-to-left model. Unfortunately, standard conditionallanguage models can only be trained left-to-rightor right-to-left, since bidirectional conditioningwould allow each word to indirectly “see itself”in a multi-layered context.

In order to train a deep bidirectional representa-tion, we take a straightforward approach of mask-ing some percentage of the input tokens at random,and then predicting only those masked tokens. We

refer to this procedure as a “masked LM” (MLM),although it is often referred to as a Cloze task inthe literature (Taylor, 1953). In this case, the fi-nal hidden vectors corresponding to the mask to-kens are fed into an output softmax over the vo-cabulary, as in a standard LM. In all of our exper-iments, we mask 15% of all WordPiece tokens ineach sequence at random. In contrast to denoisingauto-encoders (Vincent et al., 2008), we only pre-dict the masked words rather than reconstructingthe entire input.

Although this does allow us to obtain a bidirec-tional pre-trained model, there are two downsidesto such an approach. The first is that we are cre-ating a mismatch between pre-training and fine-tuning, since the [MASK] token is never seen dur-ing fine-tuning. To mitigate this, we do not alwaysreplace “masked” words with the actual [MASK]token. Instead, the training data generator chooses15% of tokens at random, e.g., in the sentence my

dog is hairy it chooses hairy. It then performsthe following procedure:

• Rather than always replacing the chosenwords with [MASK], the data generator willdo the following:

• 80% of the time: Replace the word with the[MASK] token, e.g., my dog is hairy →my dog is [MASK]

• 10% of the time: Replace the word with arandom word, e.g., my dog is hairy → my

dog is apple

• 10% of the time: Keep the word un-changed, e.g., my dog is hairy → my dog

is hairy. The purpose of this is to bias therepresentation towards the actual observedword.

Page 5: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

The Transformer encoder does not know whichwords it will be asked to predict or which havebeen replaced by random words, so it is forced tokeep a distributional contextual representation ofevery input token. Additionally, because randomreplacement only occurs for 1.5% of all tokens(i.e., 10% of 15%), this does not seem to harm themodel’s language understanding capability.

The second downside of using an MLM is thatonly 15% of tokens are predicted in each batch,which suggests that more pre-training steps maybe required for the model to converge. In Sec-tion 5.3 we demonstrate that MLM does con-verge marginally slower than a left-to-right model(which predicts every token), but the empirical im-provements of the MLM model far outweigh theincreased training cost.

3.3.2 Task #2: Next Sentence PredictionMany important downstream tasks such as Ques-tion Answering (QA) and Natural Language In-ference (NLI) are based on understanding the re-lationship between two text sentences, which isnot directly captured by language modeling. Inorder to train a model that understands sentencerelationships, we pre-train a binarized next sen-tence prediction task that can be trivially gener-ated from any monolingual corpus. Specifically,when choosing the sentences A and B for each pre-training example, 50% of the time B is the actualnext sentence that follows A, and 50% of the timeit is a random sentence from the corpus. For ex-ample:

Input = [CLS] the man went to [MASK] store [SEP]

he bought a gallon [MASK] milk [SEP]

Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]

penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

We choose the NotNext sentences completely atrandom, and the final pre-trained model achieves97%-98% accuracy at this task. Despite its sim-plicity, we demonstrate in Section 5.1 that pre-training towards this task is very beneficial to bothQA and NLI.

3.4 Pre-training ProcedureThe pre-training procedure largely follows the ex-isting literature on language model pre-training.

For the pre-training corpus we use the concatena-tion of BooksCorpus (800M words) (Zhu et al.,2015) and English Wikipedia (2,500M words).For Wikipedia we extract only the text passagesand ignore lists, tables, and headers. It is criti-cal to use a document-level corpus rather than ashuffled sentence-level corpus such as the BillionWord Benchmark (Chelba et al., 2013) in order toextract long contiguous sequences.

To generate each training input sequence, wesample two spans of text from the corpus, whichwe refer to as “sentences” even though they aretypically much longer than single sentences (butcan be shorter also). The first sentence receivesthe A embedding and the second receives the Bembedding. 50% of the time B is the actual nextsentence that follows A and 50% of the time it isa random sentence, which is done for the “nextsentence prediction” task. They are sampled suchthat the combined length is ≤ 512 tokens. TheLM masking is applied after WordPiece tokeniza-tion with a uniform masking rate of 15%, and nospecial consideration given to partial word pieces.

We train with batch size of 256 sequences (256sequences * 512 tokens = 128,000 tokens/batch)for 1,000,000 steps, which is approximately 40epochs over the 3.3 billion word corpus. Weuse Adam with learning rate of 1e-4, β1 = 0.9,β2 = 0.999, L2 weight decay of 0.01, learningrate warmup over the first 10,000 steps, and lineardecay of the learning rate. We use a dropout prob-ability of 0.1 on all layers. We use a gelu acti-vation (Hendrycks and Gimpel, 2016) rather thanthe standard relu, following OpenAI GPT. Thetraining loss is the sum of the mean masked LMlikelihood and mean next sentence prediction like-lihood.

Training of BERTBASE was performed on 4Cloud TPUs in Pod configuration (16 TPU chipstotal).5 Training of BERTLARGE was performedon 16 Cloud TPUs (64 TPU chips total). Each pre-training took 4 days to complete.

3.5 Fine-tuning ProcedureFor sequence-level classification tasks, BERTfine-tuning is straightforward. In order to obtaina fixed-dimensional pooled representation of theinput sequence, we take the final hidden state (i.e.,the output of the Transformer) for the first token

5https://cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible-pricing-and-global-availability.html

Page 6: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

in the input, which by construction corresponds tothe the special [CLS] word embedding. We de-note this vector as C ∈ RH . The only new pa-rameters added during fine-tuning are for a classi-fication layer W ∈ RK×H , where K is the num-ber of classifier labels. The label probabilitiesP ∈ RK are computed with a standard softmax,P = softmax(CW T ). All of the parameters ofBERT and W are fine-tuned jointly to maximizethe log-probability of the correct label. For span-level and token-level prediction tasks, the aboveprocedure must be modified slightly in a task-specific manner. Details are given in the corre-sponding subsection of Section 4.

For fine-tuning, most model hyperparametersare the same as in pre-training, with the excep-tion of the batch size, learning rate, and numberof training epochs. The dropout probability wasalways kept at 0.1. The optimal hyperparametervalues are task-specific, but we found the follow-ing range of possible values to work well acrossall tasks:

• Batch size: 16, 32• Learning rate (Adam): 5e-5, 3e-5, 2e-5• Number of epochs: 3, 4

We also observed that large data sets (e.g.,100k+ labeled training examples) were far lesssensitive to hyperparameter choice than small datasets. Fine-tuning is typically very fast, so it is rea-sonable to simply run an exhaustive search overthe above parameters and choose the model thatperforms best on the development set.

3.6 Comparison of BERT and OpenAI GPTThe most comparable existing pre-training methodto BERT is OpenAI GPT, which trains a left-to-right Transformer LM on a large text corpus. Infact, many of the design decisions in BERT wereintentionally chosen to be as close to GPT as pos-sible so that the two methods could be minimallycompared. The core argument of this work is thatthe two novel pre-training tasks presented in Sec-tion 3.3 account for the majority of the empiri-cal improvements, but we do note that there areseveral other differences between how BERT andGPT were trained:

• GPT is trained on the BooksCorpus (800Mwords); BERT is trained on the BooksCor-pus (800M words) and Wikipedia (2,500Mwords).

• GPT uses a sentence separator ([SEP]) andclassifier token ([CLS]) which are only in-troduced at fine-tuning time; BERT learns[SEP], [CLS] and sentence A/B embed-dings during pre-training.

• GPT was trained for 1M steps with a batchsize of 32,000 words; BERT was trained for1M steps with a batch size of 128,000 words.

• GPT used the same learning rate of 5e-5 forall fine-tuning experiments; BERT chooses atask-specific fine-tuning learning rate whichperforms the best on the development set.

To isolate the effect of these differences, we per-form ablation experiments in Section 5.1 whichdemonstrate that the majority of the improvementsare in fact coming from the new pre-training tasks.

4 Experiments

In this section, we present BERT fine-tuning re-sults on 11 NLP tasks.

4.1 GLUE Datasets

The General Language Understanding Evaluation(GLUE) benchmark (Wang et al., 2018) is a col-lection of diverse natural language understand-ing tasks. Most of the GLUE datasets have al-ready existed for a number of years, but the pur-pose of GLUE is to (1) distribute these datasetswith canonical Train, Dev, and Test splits, and(2) set up an evaluation server to mitigate issueswith evaluation inconsistencies and Test set over-fitting. GLUE does not distribute labels for theTest set and users must upload their predictions tothe GLUE server for evaluation, with limits on thenumber of submissions.

The GLUE benchmark includes the followingdatasets, the descriptions of which were originallysummarized in Wang et al. (2018):

MNLI Multi-Genre Natural Language Inferenceis a large-scale, crowdsourced entailment classifi-cation task (Williams et al., 2018). Given a pair ofsentences, the goal is to predict whether the sec-ond sentence is an entailment, contradiction, orneutral with respect to the first one.

QQP Quora Question Pairs is a binary classifi-cation task where the goal is to determine if twoquestions asked on Quora are semantically equiv-alent (Chen et al., 2018).

Page 7: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

BERT

E[CLS] E1 E[SEP]... EN E1’ ... EM’

C T1 T[SEP]... TN T1’ ... TM’

[CLS]Tok

1 [SEP]... Tok N

Tok 1 ... Tok

M

Question Paragraph

BERT

E[CLS] E1 E2 EN

C T1 T2 TN

Single Sentence

...

...

BERT

Tok 1 Tok 2 Tok N...[CLS]

E[CLS] E1 E2 EN

C T1 T2 TN

Single Sentence

B-PERO O

...

...E[CLS] E1 E[SEP]

Class Label

... EN E1’ ... EM’

C T1 T[SEP]... TN T1’ ... TM’

Start/End Span

Class Label

BERT

Tok 1 Tok 2 Tok N...[CLS] Tok 1[CLS][CLS]Tok

1 [SEP]... Tok N

Tok 1 ... Tok

M

Sentence 1

...

Sentence 2

Figure 3: Our task specific models are formed by incorporating BERT with one additional output layer, so aminimal number of parameters need to be learned from scratch. Among the tasks, (a) and (b) are sequence-leveltasks while (c) and (d) are token-level tasks. In the figure, E represents the input embedding, Ti represents thecontextual representation of token i, [CLS] is the special symbol for classification output, and [SEP] is the specialsymbol to separate non-consecutive token sequences.

QNLI Question Natural Language Inference isa version of the Stanford Question AnsweringDataset (Rajpurkar et al., 2016) which has beenconverted to a binary classification task (Wanget al., 2018). The positive examples are (ques-tion, sentence) pairs which do contain the correctanswer, and the negative examples are (question,sentence) from the same paragraph which do notcontain the answer.

SST-2 The Stanford Sentiment Treebank is abinary single-sentence classification task consist-ing of sentences extracted from movie reviewswith human annotations of their sentiment (Socheret al., 2013).

CoLA The Corpus of Linguistic Acceptability isa binary single-sentence classification task, where

the goal is to predict whether an English sentenceis linguistically “acceptable” or not (Warstadtet al., 2018).

STS-B The Semantic Textual Similarity Bench-mark is a collection of sentence pairs drawn fromnews headlines and other sources (Cer et al.,2017). They were annotated with a score from 1to 5 denoting how similar the two sentences are interms of semantic meaning.

MRPC Microsoft Research Paraphrase Corpusconsists of sentence pairs automatically extractedfrom online news sources, with human annotationsfor whether the sentences in the pair are semanti-cally equivalent (Dolan and Brockett, 2005).

Page 8: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

System MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average392k 363k 108k 67k 8.5k 5.7k 3.5k 2.5k -

Pre-OpenAI SOTA 80.6/80.1 66.1 82.3 93.2 35.0 81.0 86.0 61.7 74.0BiLSTM+ELMo+Attn 76.4/76.1 64.8 79.9 90.4 36.0 73.3 84.9 56.8 71.0OpenAI GPT 82.1/81.4 70.3 88.1 91.3 45.4 80.0 82.3 56.0 75.2BERTBASE 84.6/83.4 71.2 90.1 93.5 52.1 85.8 88.9 66.4 79.6BERTLARGE 86.7/85.9 72.1 91.1 94.9 60.5 86.5 89.3 70.1 81.9

Table 1: GLUE Test results, scored by the GLUE evaluation server. The number below each task denotes thenumber of training examples. The “Average” column is slightly different than the official GLUE score, sincewe exclude the problematic WNLI set. OpenAI GPT = (L=12, H=768, A=12); BERTBASE = (L=12, H=768,A=12); BERTLARGE = (L=24, H=1024, A=16). BERT and OpenAI GPT are single-model, single task. Allresults obtained from https://gluebenchmark.com/leaderboard and https://blog.openai.com/language-unsupervised/.

RTE Recognizing Textual Entailment is a bi-nary entailment task similar to MNLI, but withmuch less training data (Bentivogli et al., 2009).6

WNLI Winograd NLI is a small natural lan-guage inference dataset deriving from (Levesqueet al., 2011). The GLUE webpage notes that thereare issues with the construction of this dataset, 7

and every trained system that’s been submittedto GLUE has has performed worse than the 65.1baseline accuracy of predicting the majority class.We therefore exclude this set out of fairness toOpenAI GPT. For our GLUE submission, we al-ways predicted the majority class.

4.1.1 GLUE ResultsTo fine-tune on GLUE, we represent the input se-quence or sequence pair as described in Section 3,and use the final hidden vector C ∈ RH corre-sponding to the first input token ([CLS]) as theaggregate representation. This is demonstrated vi-sually in Figure 3 (a) and (b). The only new pa-rameters introduced during fine-tuning is a classi-fication layer W ∈ RK×H , where K is the num-ber of labels. We compute a standard classificationloss with C and W , i.e., log(softmax(CW T )).

We use a batch size of 32 and 3 epochs overthe data for all GLUE tasks. For each task, we ranfine-tunings with learning rates of 5e-5, 4e-5, 3e-5,and 2e-5 and selected the one that performed beston the Dev set. Additionally, for BERTLARGE wefound that fine-tuning was sometimes unstable on

6Note that we only report single-task fine-tuning results inthis paper. Multitask fine-tuning approach could potentiallypush the results even further. For example, we did observesubstantial improvements on RTE from multi-task trainingwith MNLI.

7https://gluebenchmark.com/faq

small data sets (i.e., some runs would produce de-generate results), so we ran several random restartsand selected the model that performed best on theDev set. With random restarts, we use the samepre-trained checkpoint but perform different fine-tuning data shuffling and classifier layer initializa-tion. We note that the GLUE data set distributiondoes not include the Test labels, and we only madea single GLUE evaluation server submission foreach BERTBASE and BERTLARGE.

Results are presented in Table 1. BothBERTBASE and BERTLARGE outperform all exist-ing systems on all tasks by a substantial margin,obtaining 4.4% and 6.7% respective average accu-racy improvement over the state-of-the-art. Notethat BERTBASE and OpenAI GPT are nearly iden-tical in terms of model architecture outside ofthe attention masking. For the largest and mostwidely reported GLUE task, MNLI, BERT ob-tains a 4.7% absolute accuracy improvement overthe state-of-the-art. On the official GLUE leader-board,8 BERTLARGE obtains a score of 80.4, com-pared to the top leaderboard system, OpenAI GPT,which obtains 72.8 as of the date of writing.

It is interesting to observe that BERTLARGE sig-nificantly outperforms BERTBASE across all tasks,even those with very little training data. The effectof BERT model size is explored more thoroughlyin Section 5.2.

4.2 SQuAD v1.1The Standford Question Answering Dataset(SQuAD) is a collection of 100k crowdsourcedquestion/answer pairs (Rajpurkar et al., 2016).Given a question and a paragraph from Wikipedia

8https://gluebenchmark.com/leaderboard

Page 9: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

containing the answer, the task is to predict the an-swer text span in the paragraph. For example:

• Input Question:Where do water droplets collide with ice

crystals to form precipitation?

• Input Paragraph:... Precipitation forms as smaller droplets

coalesce via collision with other rain drops

or ice crystals within a cloud. ...

• Output Answer:within a cloud

This type of span prediction task is quite dif-ferent from the sequence classification tasks ofGLUE, but we are able to adapt BERT to runon SQuAD in a straightforward manner. Just aswith GLUE, we represent the input question andparagraph as a single packed sequence, with thequestion using the A embedding and the paragraphusing the B embedding. The only new parame-ters learned during fine-tuning are a start vectorS ∈ RH and an end vector E ∈ RH . Let the finalhidden vector from BERT for the ith input tokenbe denoted as Ti ∈ RH . See Figure 3 (c) for a vi-sualization. Then, the probability of word i beingthe start of the answer span is computed as a dotproduct between Ti and S followed by a softmaxover all of the words in the paragraph:

Pi =eS·Ti∑j e

S·Tj

The same formula is used for the end of the an-swer span, and the maximum scoring span is usedas the prediction. The training objective is the log-likelihood of the correct start and end positions.

We train for 3 epochs with a learning rate of 5e-5 and a batch size of 32. At inference time, sincethe end prediction is not conditioned on the start,we add the constraint that the end must come afterthe start, but no other heuristics are used. The tok-enized labeled span is aligned back to the originaluntokenized input for evaluation.

Results are presented in Table 2. SQuAD usesa highly rigorous testing procedure where the sub-mitter must manually contact the SQuAD organiz-ers to run their system on a hidden test set, so weonly submitted our best system for testing. Theresult shown in the table is our first and only Testsubmission to SQuAD. We note that the top results

System Dev TestEM F1 EM F1

Leaderboard (Oct 8th, 2018)Human - - 82.3 91.2#1 Ensemble - nlnet - - 86.0 91.7#2 Ensemble - QANet - - 84.5 90.5#1 Single - nlnet - - 83.5 90.1#2 Single - QANet - - 82.5 89.3

PublishedBiDAF+ELMo (Single) - 85.8 - -R.M. Reader (Single) 78.9 86.3 79.5 86.6R.M. Reader (Ensemble) 81.2 87.9 82.3 88.5

OursBERTBASE (Single) 80.8 88.5 - -BERTLARGE (Single) 84.1 90.9 - -BERTLARGE (Ensemble) 85.8 91.8 - -BERTLARGE (Sgl.+TriviaQA) 84.2 91.1 85.1 91.8BERTLARGE (Ens.+TriviaQA) 86.2 92.2 87.4 93.2

Table 2: SQuAD results. The BERT ensemble is 7xsystems which use different pre-training checkpointsand fine-tuning seeds.

from the SQuAD leaderboard do not have up-to-date public system descriptions available, and areallowed to use any public data when training theirsystems. We therefore use very modest data aug-mentation in our submitted system by jointly train-ing on SQuAD and TriviaQA (Joshi et al., 2017).

Our best performing system outperforms the topleaderboard system by +1.5 F1 in ensembling and+1.3 F1 as a single system. In fact, our singleBERT model outperforms the top ensemble sys-tem in terms of F1 score. If we fine-tune on onlySQuAD (without TriviaQA) we lose 0.1-0.4 F1and still outperform all existing systems by a widemargin.

4.3 Named Entity Recognition

To evaluate performance on a token tagging task,we fine-tune BERT on the CoNLL 2003 NamedEntity Recognition (NER) dataset. This datasetconsists of 200k training words which have beenannotated as Person, Organization, Location,Miscellaneous, or Other (non-named entity).

For fine-tuning, we feed the final hiddenrepresentation Ti ∈ RH for to each token i intoa classification layer over the NER label set. Thepredictions are not conditioned on the surround-ing predictions (i.e., non-autoregressive and noCRF). To make this compatible with WordPiecetokenization, we feed each CoNLL-tokenizedinput word into our WordPiece tokenizer anduse the hidden state corresponding to the first

Page 10: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

System Dev F1 Test F1

ELMo+BiLSTM+CRF 95.7 92.2CVT+Multi (Clark et al., 2018) - 92.6

BERTBASE 96.4 92.4BERTLARGE 96.6 92.8

Table 3: CoNLL-2003 Named Entity Recognition re-sults. The hyperparameters were selected using theDev set, and the reported Dev and Test scores are aver-aged over 5 random restarts using those hyperparame-ters.

sub-token as input to the classifier. For example:

Jim Hen ##son was a puppet ##eerI-PER I-PER X O O O X

Where no prediction is made for X. Sincethe WordPiece tokenization boundaries are aknown part of the input, this is done for bothtraining and test. A visual representation is alsogiven in Figure 3 (d). A cased WordPiece modelis used for NER, whereas an uncased model isused for all other tasks.

Results are presented in Table 3. BERTLARGEoutperforms the existing SOTA, Cross-ViewTraining with multi-task learning (Clark et al.,2018), by +0.2 on CoNLL-2003 NER Test.

4.4 SWAGThe Situations With Adversarial Generations(SWAG) dataset contains 113k sentence-pair com-pletion examples that evaluate grounded common-sense inference (Zellers et al., 2018).

Given a sentence from a video captioningdataset, the task is to decide among four choicesthe most plausible continuation. For example:

A girl is going across a set of monkey bars. She

(i) jumps up across the monkey bars.

(ii) struggles onto the bars to grab her head.

(iii) gets to the end and stands on a wooden plank.

(iv) jumps up and does a back flip.

Adapting BERT to the SWAG dataset is similarto the adaptation for GLUE. For each example, weconstruct four input sequences, which each con-tain the concatenation of the the given sentence(sentence A) and a possible continuation (sentenceB). The only task-specific parameters we introduceis a vector V ∈ RH , whose dot product with thefinal aggregate representation Ci ∈ RH denotes a

System Dev Test

ESIM+GloVe 51.9 52.7ESIM+ELMo 59.1 59.2

BERTBASE 81.6 -BERTLARGE 86.6 86.3

Human (expert)† - 85.0Human (5 annotations)† - 88.0

Table 4: SWAG Dev and Test accuracies. Test resultswere scored against the hidden labels by the SWAG au-thors. †Human performance is measure with 100 sam-ples, as reported in the SWAG paper.

score for each choice i. The probability distribu-tion is the softmax over the four choices:

Pi =eV ·Ci∑4j=1 e

V ·Cj

We fine-tune the model for 3 epochs with alearning rate of 2e-5 and a batch size of 16. Re-sults are presented in Table 4. BERTLARGE out-performs the authors’ baseline ESIM+ELMo sys-tem by +27.1%.

5 Ablation Studies

Although we have demonstrated extremely strongempirical results, the results presented so far havenot isolated the specific contributions from eachaspect of the BERT framework. In this section,we perform ablation experiments over a number offacets of BERT in order to better understand theirrelative importance.

5.1 Effect of Pre-training Tasks

One of our core claims is that the deep bidirec-tionality of BERT, which is enabled by maskedLM pre-training, is the single most important im-provement of BERT compared to previous work.To give evidence for this claim, we evaluate twonew models which use the exact same pre-trainingdata, fine-tuning scheme and Transformer hyper-parameters as BERTBASE:

1. No NSP: A model which is trained using the“masked LM” (MLM) but without the “nextsentence prediction” (NSP) task.

2. LTR & No NSP: A model which is trainedusing a Left-to-Right (LTR) LM, rather than

Page 11: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

an MLM. In this case, we predict every in-put word and do not apply any masking. Theleft-only constraint was also applied at fine-tuning, because we found it is always worseto pre-train with left-only-context and fine-tune with bidirectional context. Additionally,this model was pre-trained without the NSPtask. This is directly comparable to OpenAIGPT, but using our larger training dataset,our input representation, and our fine-tuningscheme.

Results are presented in Table 5. We first ex-amine the impact brought by the NSP task. Wecan see that removing NSP hurts performance sig-nificantly on QNLI, MNLI, and SQuAD. Theseresults demonstrate that our pre-training methodis critical in obtaining the strong empirical resultspresented previously.

Next, we evaluate the impact of training bidi-rectional representations by comparing “No NSP”to “LTR & No NSP”. The LTR model performsworse than the MLM model on all tasks, with ex-tremely large drops on MRPC and SQuAD. ForSQuAD it is intuitively clear that an LTR modelwill perform very poorly at span and token predic-tion, since the token-level hidden states have noright-side context. For MRPC is unclear whetherthe poor performance is due to the small data sizeor the nature of the task, but we found this poorperformance to be consistent across a full hyper-parameter sweep with many random restarts.

In order make a good faith attempt at strength-ening the LTR system, we tried adding a ran-domly initialized BiLSTM on top of it for fine-tuning. This does significantly improve results onSQuAD, but the results are still far worse than the

Dev SetTasks MNLI-m QNLI MRPC SST-2 SQuAD

(Acc) (Acc) (Acc) (Acc) (F1)

BERTBASE 84.4 88.4 86.7 92.7 88.5No NSP 83.9 84.9 86.5 92.6 87.9LTR & No NSP 82.1 84.3 77.5 92.1 77.8

+ BiLSTM 82.1 84.1 75.7 91.6 84.9

Table 5: Ablation over the pre-training tasks using theBERTBASE architecture. “No NSP” is trained withoutthe next sentence prediction task. “LTR & No NSP” istrained as a left-to-right LM without the next sentenceprediction, like OpenAI GPT. “+ BiLSTM” adds a ran-domly initialized BiLSTM on top of the “LTR + NoNSP” model during fine-tuning.

pre-trained bidirectional models. It also hurts per-formance on all four GLUE tasks.

We recognize that it would also be possible totrain separate LTR and RTL models and representeach token as the concatenation of the two mod-els, as ELMo does. However: (a) this is twice asexpensive as a single bidirectional model; (b) thisis non-intuitive for tasks like QA, since the RTLmodel would not be able to condition the answeron the question; (c) this it is strictly less powerfulthan a deep bidirectional model, since a deep bidi-rectional model could choose to use either left orright context.

5.2 Effect of Model Size

In this section, we explore the effect of model sizeon fine-tuning task accuracy. We trained a numberof BERT models with a differing number of layers,hidden units, and attention heads, while otherwiseusing the same hyperparameters and training pro-cedure as described previously.

Results on selected GLUE tasks are shown inTable 6. In this table, we report the average DevSet accuracy from 5 random restarts of fine-tuning.We can see that larger models lead to a strict ac-curacy improvement across all four datasets, evenfor MRPC which only has 3,600 labeled train-ing examples, and is substantially different fromthe pre-training tasks. It is also perhaps surpris-ing that we are able to achieve such significantimprovements on top of models which are al-ready quite large relative to the existing literature.For example, the largest Transformer explored inVaswani et al. (2017) is (L=6, H=1024, A=16)with 100M parameters for the encoder, and thelargest Transformer we have found in the literatureis (L=64, H=512, A=2) with 235M parameters(Al-Rfou et al., 2018). By contrast, BERTBASE

Hyperparams Dev Set Accuracy

#L #H #A LM (ppl) MNLI-m MRPC SST-2

3 768 12 5.84 77.9 79.8 88.46 768 3 5.24 80.6 82.2 90.76 768 12 4.68 81.9 84.8 91.3

12 768 12 3.99 84.4 86.7 92.912 1024 16 3.54 85.7 86.9 93.324 1024 16 3.23 86.6 87.8 93.7

Table 6: Ablation over BERT model size. #L = thenumber of layers; #H = hidden size; #A = number of at-tention heads. “LM (ppl)” is the masked LM perplexityof held-out training data.

Page 12: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

contains 110M parameters and BERTLARGE con-tains 340M parameters.

It has been known for many years that increas-ing the model size will lead to continual improve-ments on large-scale tasks such as machine trans-lation and language modeling, which is demon-strated by the LM perplexity of held-out trainingdata shown in Table 6. However, we believe thatthis is the first work to demonstrate that scaling toextreme model sizes also leads to large improve-ments on very small scale tasks, provided that themodel has been sufficiently pre-trained.

5.3 Effect of Number of Training StepsFigure 4 presents MNLI Dev accuracy after fine-tuning from a checkpoint that has been pre-trainedfor k steps. This allows us to answer the followingquestions:

1. Question: Does BERT really need sucha large amount of pre-training (128,000words/batch * 1,000,000 steps) to achievehigh fine-tuning accuracy?Answer: Yes, BERTBASE achieves almost1.0% additional accuracy on MNLI whentrained on 1M steps compared to 500k steps.

2. Question: Does MLM pre-training convergeslower than LTR pre-training, since only 15%of words are predicted in each batch ratherthan every word?Answer: The MLM model does convergeslightly slower than the LTR model. How-ever, in terms of absolute accuracy the MLMmodel begins to outperform the LTR modelalmost immediately.

200 400 600 800 1,000

76

78

80

82

84

Pre-training Steps (Thousands)

MN

LID

evA

ccur

acy

BERTBASE (Masked LM)BERTBASE (Left-to-Right)

Figure 4: Ablation over number of training steps. Thisshows the MNLI accuracy after fine-tuning, startingfrom model parameters that have been pre-trained fork steps. The x-axis is the value of k.

5.4 Feature-based Approach with BERT

All of the BERT results presented so far have usedthe fine-tuning approach, where a simple classifi-cation layer is added to the pre-trained model, andall parameters are jointly fine-tuned on a down-stream task. However, the feature-based approach,where fixed features are extracted from the pre-trained model, has certain advantages. First, notall NLP tasks can be easily be represented by aTransformer encoder architecture, and thereforerequire a task-specific model architecture to beadded. Second, there are major computationalbenefits to being able to pre-compute an expensiverepresentation of the training data once and thenrun many experiments with less expensive modelson top of this representation.

In this section we evaluate how well BERT per-forms in the feature-based approach by generatingELMo-like pre-trained contextual representationson the CoNLL-2003 NER task. To do this, we usethe same input representation as in Section 4.3, butuse the activations from one or more layers with-out fine-tuning any parameters of BERT. Thesecontextual embeddings are used as input to a ran-domly initialized two-layer 768-dimensional BiL-STM before the classification layer.

Results are shown in Table 7. The best perform-ing method is to concatenate the token representa-tions from the top four hidden layers of the pre-trained Transformer, which is only 0.3 F1 behindfine-tuning the entire model. This demonstratesthat BERT is effective for both the fine-tuning andfeature-based approaches.

Layers Dev F1

Finetune All 96.4

First Layer (Embeddings) 91.0Second-to-Last Hidden 95.6Last Hidden 94.9Sum Last Four Hidden 95.9Concat Last Four Hidden 96.1Sum All 12 Layers 95.5

Table 7: Ablation using BERT with a feature-based ap-proach on CoNLL-2003 NER. The activations from thespecified layers are combined and fed into a two-layerBiLSTM, without backpropagation to BERT.

Page 13: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

6 Conclusion

Recent empirical improvements due to transferlearning with language models have demonstratedthat rich, unsupervised pre-training is an integralpart of many language understanding systems. Inparticular, these results enable even low-resourcetasks to benefit from very deep unidirectional ar-chitectures. Our major contribution is further gen-eralizing these findings to deep bidirectional ar-chitectures, allowing the same pre-trained modelto successfully tackle a broad set of NLP tasks.

While the empirical results are strong, in somecases surpassing human performance, importantfuture work is to investigate the linguistic phenom-ena that may or may not be captured by BERT.

ReferencesRami Al-Rfou, Dokook Choe, Noah Constant, Mandy

Guo, and Llion Jones. 2018. Character-level lan-guage modeling with deeper self-attention. arXivpreprint arXiv:1808.04444.

Rie Kubota Ando and Tong Zhang. 2005. A frameworkfor learning predictive structures from multiple tasksand unlabeled data. Journal of Machine LearningResearch, 6(Nov):1817–1853.

Luisa Bentivogli, Bernardo Magnini, Ido Dagan,Hoa Trang Dang, and Danilo Giampiccolo. 2009.The fifth PASCAL recognizing textual entailmentchallenge. In TAC. NIST.

John Blitzer, Ryan McDonald, and Fernando Pereira.2006. Domain adaptation with structural correspon-dence learning. In Proceedings of the 2006 confer-ence on empirical methods in natural language pro-cessing, pages 120–128. Association for Computa-tional Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In EMNLP. Association for Computational Linguis-tics.

Peter F Brown, Peter V Desouza, Robert L Mercer,Vincent J Della Pietra, and Jenifer C Lai. 1992.Class-based n-gram models of natural language.Computational linguistics, 18(4):467–479.

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017task 1: Semantic textual similarity-multilingual andcross-lingual focused evaluation. arXiv preprintarXiv:1708.00055.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2013. One billion word benchmark for measur-

ing progress in statistical language modeling. arXivpreprint arXiv:1312.3005.

Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018.Quora question pairs.

Kevin Clark, Minh-Thang Luong, Christopher D Man-ning, and Quoc V Le. 2018. Semi-supervised se-quence modeling with cross-view training. arXivpreprint arXiv:1809.08370.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of the 25th International Conference onMachine Learning, ICML ’08.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈cBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 670–680, Copen-hagen, Denmark. Association for ComputationalLinguistics.

Andrew M Dai and Quoc V Le. 2015. Semi-supervisedsequence learning. In Advances in neural informa-tion processing systems, pages 3079–3087.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale HierarchicalImage Database. In CVPR09.

William B Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In Proceedings of the Third International Workshopon Paraphrasing (IWP2005).

Dan Hendrycks and Kevin Gimpel. 2016. Bridgingnonlinearities and stochastic regularizers with gaus-sian error linear units. CoRR, abs/1606.08415.

Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. InACL. Association for Computational Linguistics.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. In ACL.

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InAdvances in neural information processing systems,pages 3294–3302.

Quoc Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In Inter-national Conference on Machine Learning, pages1188–1196.

Hector J Levesque, Ernest Davis, and Leora Morgen-stern. 2011. The winograd schema challenge. InAaai spring symposium: Logical formalizations ofcommonsense reasoning, volume 46, page 47.

Page 14: arXiv:1810.04805v1 [cs.CL] 11 Oct 2018 · 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 7UP 2SHQ$,*37 /VWP (/0R /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP /VWP 7 7 7 1 ( ( (1

Lajanugen Logeswaran and Honglak Lee. 2018. Anefficient framework for learning sentence represen-tations. In International Conference on LearningRepresentations.

Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In NIPS.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems 26, pages 3111–3119. Curran Associates,Inc.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Matthew Peters, Waleed Ammar, Chandra Bhagavat-ula, and Russell Power. 2017. Semi-supervised se-quence tagging with bidirectional language models.In ACL.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In NAACL.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing with unsupervised learning. Technical re-port, OpenAI.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questionsfor machine comprehension of text. arXiv preprintarXiv:1606.05250.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 conference onempirical methods in natural language processing,pages 1631–1642.

Wilson L Taylor. 1953. cloze procedure: A newtool for measuring readability. Journalism Bulletin,30(4):415–433.

Erik F Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. InProceedings of the seventh conference on Naturallanguage learning at HLT-NAACL 2003-Volume 4,pages 142–147. Association for Computational Lin-guistics.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: A simple and general methodfor semi-supervised learning. In Proceedings of the48th Annual Meeting of the Association for Compu-tational Linguistics, ACL ’10, pages 384–394.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 6000–6010.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, andPierre-Antoine Manzagol. 2008. Extracting andcomposing robust features with denoising autoen-coders. In Proceedings of the 25th internationalconference on Machine learning, pages 1096–1103.ACM.

Alex Wang, Amapreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R Bowman. 2018.Glue: A multi-task benchmark and analysis platformfor natural language understanding. arXiv preprintarXiv:1804.07461.

A. Warstadt, A. Singh, and S. R. Bowman. 2018. Cor-pus of linguistic acceptability.

Adina Williams, Nikita Nangia, and Samuel R Bow-man. 2018. A broad-coverage challenge corpusfor sentence understanding through inference. InNAACL.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144.

Jason Yosinski, Jeff Clune, Yoshua Bengio, and HodLipson. 2014. How transferable are features in deepneural networks? In Advances in neural informationprocessing systems, pages 3320–3328.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and YejinChoi. 2018. Swag: A large-scale adversarial datasetfor grounded commonsense inference. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing (EMNLP).

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In Proceedings of the IEEEinternational conference on computer vision, pages19–27.