Policy Shaping and Generalized Update Equations for ...dkm/papers/mchy-emnlp.2018.pdf · Policy...

Policy Shaping and Generalized Update Equations forSemantic Parsing from Denotations

Dipendra Misra?, Ming-Wei Chang†, Xiaodong He�, Wen-tau Yih‡?Cornell University, †Google AI Language, �JD AI Research

‡Allen Institute for Artificial [email protected], [email protected]@jd.com, [email protected]

Abstract

Semantic parsing from denotations faces twokey challenges in model training: (1) givenonly the denotations (e.g., answers), searchfor good candidate semantic parses, and (2)choose the best model update algorithm. Wepropose effective and general solutions to eachof them. Using policy shaping, we bias thesearch procedure towards semantic parses thatare more compatible to the text, which pro-vide better supervision signals for training. Inaddition, we propose an update equation thatgeneralizes three different families of learningalgorithms, which enables fast model explo-ration. When experimented on a recently pro-posed sequential question answering dataset,our framework leads to a new state-of-the-art model that outperforms previous work by5.0% absolute on exact match accuracy.

1 Introduction

Semantic parsing from denotations (SpFD) is theproblem of mapping text to executable formal rep-resentations (or program) in a situated environ-ment and executing them to generate denotations(or answer), in the absence of access to correctrepresentations. Several problems have been han-dled within this framework, including question an-swering (Berant et al., 2013; Iyyer et al., 2017) andinstructions for robots (Artzi and Zettlemoyer,2013; Misra et al., 2015).

Consider the example in Figure 1. Given thequestion and a table environment, a semanticparser maps the question to an executable pro-gram, in this case a SQL query, and then exe-cutes the query on the environment to generate theanswer England. In the SpFD setting, the train-ing data does not contain the correct programs.Thus, the existing learning approaches for SpFDperform two steps for every training example, asearch step that explores the space of programs

Question: what nation scored the most points

Index Name Nation Points Games Pts/game1 Karen Andrew England 44 5 8.82 Daniella Waterman England 40 5 83 Christelle Le Duff France 33 5 6.64 Charlotte Barras England 30 5 65 Naomi Thomas Wales 25 5 5

Select Nation Where Points is Maximum

Program:

Answer:

Environment:

England

Figure 1: An example of semantic parsing from deno-tations. Given the table environment, map the questionto an executable program that evaluates to the answer.

and finds suitable candidates, and an update stepthat uses these programs to update the model. Fig-ure 2 shows the two step training procedure for theabove example.

In this paper, we address two key challengesin model training for SpFD by proposing a novellearning framework, improving both the searchand update steps. The first challenge, the exis-tence of spurious programs, lies in the search step.More specifically, while the success of the searchstep relies on its ability to find programs that aresemantically correct, we can only verify if the pro-gram can generate correct answers, given that nogold programs are presented. The search step iscomplicated by spurious programs, which happento evaluate to the correct answer but do not rep-resent accurately the meaning of the natural lan-guage question. For example, for the environ-ment in Figure 1, the program Select NationWhere Name = Karen Andrew is spurious.Selecting spurious programs as positive examplescan greatly affect the performance of semanticparsers as these programs generally do not gen-

Question: what nation scored the most points

Index Name Nation Points Games Pts/game1 Karen Andrew England 44 5 8.82 Daniella Waterman England 40 5 83 Christelle Le Duff France 33 5 6.64 Charlotte Barras England 30 5 65 Naomi Thomas Wales 25 5 5

Select Nation Where Pts/game is Maximum

Select Nation Where Index is Minimum

Select Nation Where Points = 44

Select Nation Where Points is Maximum

Select Nation Where Name = Karen Andrew

Programs:Program:

Answer: England

Search Update

Step Step

Marginal Likelihood

Policy Gradient

Off-Policy Gradient

Margin Methods

Figure 2: An example of semantic parsing from denotation. Given the question and the table environment, thereare several programs which are spurious.

eralize to unseen questions and environments.The second challenge, choosing a learning al-

gorithm, lies in the update step. Because of theunique indirect supervision setting of SpFD, thequality of the learned semantic parser is dictatedby the choice of how to update the model pa-rameters, often determined empirically. As a re-sult, several families of learning methods, includ-ing maximum marginal likelihood, reinforcementlearning and margin based methods have beenused. How to effectively explore different modelchoices could be crucial in practice.

Our contributions in this work are twofold. Toaddress the first challenge, we propose a policyshaping (Griffith et al., 2013) method that incorpo-rates simple, lightweight domain knowledge, suchas a small set of lexical pairs of tokens in the ques-tion and program, in the form of a critique policy(§ 3). This helps bias the search towards the cor-rect program, an important step to improve super-vision signals, which benefits learning regardlessof the choice of algorithm. To address the secondchallenge, we prove that the parameter update stepin several algorithms are similar and can be viewedas special cases of a generalized update equation(§ 4). The equation contains two variable termsthat govern the update behavior. Changing thesetwo terms effectively defines an infinite class oflearning algorithms where different values lead tosignificantly different results. We study this effectand propose a novel learning framework that im-proves over existing methods.

We evaluate our methods using the sequentialquestion answering (SQA) dataset (Iyyer et al.,2017), and show that our proposed improvementsto the search and update steps consistently en-hance existing approaches. The proposed algo-rithm achieves new state-of-the-art and outper-forms existing parsers by 5.0%.

2 Background

We give a formal problem definition of the seman-tic parsing task, followed by the general learningframework for solving it.

2.1 The Semantic Parsing TaskThe problem discussed in this paper can be for-mally defined as follows. Let X be the set ofall possible questions, Y programs (e.g., SQL-likequeries), T tables (i.e., the structured data in thiswork) and Z answers. We further assume accessto an executor Φ : Y × T → Z , that given a pro-gram y ∈ Y and a table t ∈ T , generates an an-swer Φ(y, t) ∈ Z . We assume that the executorand all tables are deterministic and the executorcan be called as many times as possible. To facili-tate discussion in the following sections, we definean environment function et : Y → Z , by applyingthe executor to the program as et(y) = Φ(y, t).

Given a question x and an environment et, ouraim is to generate a program y∗ ∈ Y and then exe-cute it to produce the answer et(y∗). Assume thatfor any y ∈ Y , the score of y being a correct pro-gram for x is scoreθ(y, x, t), parameterized by θ.The inference task is thus:

y∗ = arg maxy∈Y

scoreθ(y, x, t) (1)

As the size of Y is exponential to the length ofthe program, a generic search procedure is typi-cally employed for Eq. (1), as efficient dynamicalgorithms typically do not exist. These searchprocedures generally maintain a beam of programstates sorted according to some scoring function,where each program state represents an incom-plete program. The search then generates a newprogram state from an existing state by perform-ing an action. Each action adds a set of tokens(e.g., Nation) and keyword (e.g., Select) to a

program state. For example, in order to generatethe program in Figure 1, the DynSP parser (Iyyeret al., 2017) will take the first action as adding theSQL expression Select Nation. Notice thatscoreθ can be used in either probabilistic or non-probabilistic models. For probabilistic models, weassume that it is a Boltzmann policy, meaning thatpθ(y | x, t) ∝ exp{scoreθ(y, x, t)}.

2.2 Learning

Learning a semantic parser is equivalent to learn-ing the parameters θ in the scoring function, whichis a structured learning problem, due to the large,structured output space Y . Structured learning al-gorithms generally consist of two major compo-nents: search and update. When the gold pro-grams are available during training, the search pro-cedure finds a set of high-scoring incorrect pro-grams. These programs are used by the updatestep to derive loss for updating parameters. For ex-ample, these programs are used for approximatingthe partition-function in maximum-likelihood ob-jective (Liang et al., 2011) and finding set of pro-grams causing margin violation in margin basedmethods (Daumé III and Marcu, 2005). Depend-ing on the exact algorithm being used, these twocomponents are not necessarily separated into iso-lated steps. For instance, parameters can be up-dated in the middle of search (e.g., Huang et al.,2012).

For learning semantic parsers from denotations,where we assume only answers are available in atraining set {(xi, ti, zi)}Ni=1 of N examples, thebasic construction of the learning algorithms re-mains the same. However, the problems thatsearch needs to handle in SpFD is more challeng-ing. In addition to finding a set of high-scoring in-correct programs, the search procedure also needsto guess the correct program(s) evaluating to thegold answer zi. This problem is further com-plicated by the presence of spurious programs,which generate the correct answer but are seman-tically incompatible with the question. For ex-ample, although all programs in Figure 2 evalu-ate to the same answer, only one of them is cor-rect. The issue of the spurious programs also af-fects the design of model update. For instance,maximum marginal likelihood methods treat allthe programs that evaluate to the gold answerequally, while maximum margin reward networksuse model score to break tie and pick one of the

programs as the correct reference.

3 Addressing Spurious Programs:Policy Shaping

Given a training example (x, t, z), the aim ofthe search step is to find a set K(x, t, z) of pro-grams consisting of correct programs that eval-uate to z and high-scoring incorrect programs.The search step should avoid picking up spuriousprograms for learning since such programs typ-ically do not generalize. For example, in Fig-ure 2, the spurious program Select NationWhere Index is Min will evaluate to an in-correct answer if the indices of the first two rowsare swapped1. This problem is challenging sinceamong the programs that evaluate to the correctanswer, most of them are spurious.

The search step can be viewed as following anexploration policy bθ(y|x, t, z) to explore the setof programs Y . This exploration is often per-formed by beam search and at each step, we ei-ther sample from bθ or take the top scoring pro-grams. The set K(x, t, z) is then used by the up-date step for parameter update. Most search strate-gies use an exploration policy which is based onthe score function, for example bθ(y|x, t, z) ∝exp{scoreθ(y, t)}. However, this approach cansuffer from a divergence phenomenon wherebythe score of spurious programs picked up by thesearch in the first epoch increases, making it morelikely for the search to pick them up in the fu-ture. Such divergence issues are common withlatent-variable learning and often require carefulinitialization to overcome (Rose, 1998). Unfortu-nately such initialization schemes are not appli-cable for deep neural networks which form themodel of most successful semantic parsers to-day (Jia and Liang, 2016; Misra and Artzi, 2016;Iyyer et al., 2017). Prior work, such as ε-greedyexploration (Guu et al., 2017), has reduced theseverity of this problem by introducing randomnoise in the search procedure to avoid saturat-ing the search on high-scoring spurious programs.However, random noise need not bias the searchtowards the correct program(s). In this paper, weintroduce a simple policy-shaping method to guidethe search. This approach allows incorporatingprior knowledge in the exploration policy and canbias the search away from spurious programs.

1This transformation preserves the answer of the question.

Algorithm 1 Learning a semantic parser from denotation us-ing generalized updates.

Input: Training set {(xi, ti, zi}Ni=1 (see Section 2), learningrate µ and stopping epoch T (see Section 4).

Definitions: scoreθ(y, x, t) is a semantic parsing modelparameterized by θ. ps(y | x, t) is the policy used forexploration and search(θ, x, t, z, ps) generates candi-date programs for updating parameters (see Section 3).∆ is the generalized update (see Section 4).

Output: Model parameters θ.1: » Iterate over the training data.2: for t = 1 to T , i = 1 to N do3: » Find candidate programs using the shaped policy.4: K = search(θ, xi, ti, zi, ps)5: » Compute generalized gradient updates6: θ = θ + µ∆(K)

7: return θ

Policy Shaping Policy shaping is a method tointroduce prior knowledge into a policy (Griffithet al., 2013). Formally, let the current behaviorpolicy be bθ(y|x, t, z) and a predefined critiquepolicy, the prior knowledge, be pc(y|x, t). Pol-icy shaping defines a new shaped behavior policypb(y|x, t) given by:

pb(y|x, t) =bθ(y|x, t, z)pc(y|x, t)∑

y′∈Y bθ(y′|x, t, z)pc(y′|x, t)

. (2)

Using the shaped policy for exploration biasesthe search towards the critique policy’s preference.We next describe a simple critique policy that weuse in this paper.

Lexical Policy Shaping We qualitatively ob-served that correct programs often contains tokenswhich are also present in the question. For exam-ple, the correct program in Figure 2 contains thetoken Points, which is also present in the question.We therefore, define a simple surface form simi-larity feature match(x, y) that computes the ratioof number of non-keyword tokens in the programy that are also present in the question x.

However, surface-form similarity is often notenough. For example, both the first and fourth pro-gram in Figure 2 contain the token Points but onlythe fourth program is correct. Therefore, we alsouse a simple co-occurrence feature that triggers onfrequently co-occurring pairs of tokens in the pro-gram and instruction. For example, the token mostis highly likely to co-occur with a correct programcontaining the keyword Max. This happens for theexample in Figure 2. Similarly the token not mayco-occur with the keyword NotEqual. We assumeaccess to a lexicon Λ = {(wj , ωj)}kj=1 containing

k lexical pairs of tokens and keywords. Each lex-ical pair (w,ω) maps the token w in a text to akeyword ω in a program. For a given program yand question x, we define a co-occurrence scoreas co_occur(y, x) =

∑(w,ω)∈Λ 1{w ∈ x ∧ ω ∈

y}}. We define critique score critique(y, x) asthe sum of the match and co_occur scores. Thecritique policy is given by:

pc(y|x, t) ∝ exp (η ∗ critique(y, x)) , (3)

where η is a single scalar hyper-parameter denot-ing the confidence in the critique policy.

4 Addressing Update Strategy Selection:Generalized Update Equation

Given the set of programs generated by the searchstep, one can use many objectives to update theparameters. For example, previous work haveutilized maximum marginal likelihood (Krishna-murthy et al., 2017; Guu et al., 2017), reinforce-ment learning (Zhong et al., 2017; Guu et al.,2017) and margin based methods (Iyyer et al.,2017). It could be difficult to choose the suitablealgorithm from these options.

In this section, we propose a principle and gen-eral update equation such that previous update al-gorithms can be considered as special cases to thisequation. Having a general update is importantfor the following reasons. First, it allows us tounderstand existing algorithms better by examin-ing their basic properties. Second, the generalizedupdate equation also makes it easy to implementand experiment with various different algorithms.Moreover, it provides a framework that enables thedevelopment of new variations or extensions of ex-isting learning methods.

In the following, we describe how the com-monly used algorithms are in fact very similar –their update rules can all be viewed as specialcases of the proposed generalized update equation.Algorithm 1 shows the meta-learning framework.For every training example, we first find a set ofcandidates using an exploration policy (line 4).We use the program candidates to update the pa-rameters (line 6).

4.1 Commonly Used Learning Algorithms

We briefly describe three algorithms: maximummarginalized likelihood, policy gradient and max-imum margin reward.

Maximum Marginalized Likelihood The max-imum marginalized likelihood method maximizesthe log-likelihood of the training data by marginal-izing over the set of programs.

JMML = log p(zi|xi, ti)= log

∑y∈Y

p(zi|y, ti)p(y|xi, ti) (4)

Because an answer is deterministically com-puted given a program and a table, we definep(z | y, t) as 1 or 0 depending upon whether the yevaluates to z given t, or not. Let Gen(z, t) ⊆ Ybe the set of compatible programs that evaluate toz given the table t. The objective can then be ex-pressed as:

JMML = log∑

y∈Gen(zi,ti)

p(y|xi, ti) (5)

In practice, the summation over Gen(.) is approx-imated by only using the compatible programs inthe set K generated by the search step.

Policy Gradient Methods Most reinforcementlearning approaches for semantic parsing assumeaccess to a reward function R : Y ×X ×Z → R,giving a scalar reward R(y, z) for a given pro-gram y and the correct answer z.2 We can fur-ther assume without loss of generality that the re-ward is always in [0, 1]. Reinforcement learningapproaches maximize the expected reward JRL:

JRL =∑y∈Y

p(y|xi, ti)R(y, zi) (6)

JRL is hard to approximate using numerical in-tegration since the reward for all programs maynot be known a priori. Policy gradient methodssolve this by approximating the derivative using asample from the policy. When the search space islarge, the policy may fail to sample a correct pro-gram, which can greatly slow down the learning.Therefore, off-policy methods are sometimes in-troduced to bias the sampling towards high-rewardyielding programs. In those methods, an addi-tional exploration policy u(y|xi, ti, zi) is used toimprove sampling. Importance weights are usedto make the gradient unbiased (see Appendix forderivation).

2This is essentially a contextual bandit setting. Guu et al.(2017) also used this setting. A general reinforcement learn-ing setting requires taking a sequence of actions and receiv-ing a reward for each action. For example, a program can beviewed as a sequence of parsing actions, where each actioncan get a reward. We do not consider the general setting here.

Maximum Margin Reward For every trainingexample (xi, ti, zi), the maximum margin rewardmethod finds the highest scoring program yi thatevaluates to zi, as the reference program, from theset K of programs generated by the search. With amargin function δ : Y×Y×Z → R and referenceprogram y, the set of programs V that violate themargin constraint can thus be defined as:

V = {y′ | y′ ∈ Y and scoreθ(y, x, t)

≤ scoreθ(y′, x, t) + δ(y, y′, z)}, (7)

where δ(y, y′, z) = R(y, z)−R(y′, z). Similarly,the program that most violates the constraint canbe written as:

y = arg maxy′∈Y{scoreθ(y′, x, t) + δ(y, y′, z)

−scoreθ(y, x, t)} (8)

The most-violation margin objective (negativemargin loss) is thus defined as:

JMMR = −max{0, scoreθ(y, xi, ti)−scoreθ(yi, xi, ti) + δ(yi, y, zi)}

Unlike the previous two learning algorithms, mar-gin methods only update the score of the referenceprogram and the program that violates the margin.

4.2 Generalized Update Equation

Although the algorithms described in §4.1 seemvery different on the surface, the gradients of theirloss functions can in fact be described in the samegeneralized form, given in Eq. (9)3. In additionto the gradient of the model scoring function, thisequation has two variable terms, w(·), q(·). Wecall the first term w(y, x, t, z) intensity, which is apositive scalar value and the second term q(y|x, t)the competing distribution, which is a probabilitydistribution over programs. Varying them makesthe equation equivalent to the update rule of thealgorithms we discussed, as shown in Table 1.We also consider meritocratic update policy whichuses a hyperparameter β to sharpen or smooth theintensity of maximum marginal likelihood (Guuet al., 2017).

Intuitively, w(y, x, t, z) defines the positive partof the update equation, which defines how aggres-sively the update favors program y. Likewise,q(y|x, t) defines the negative part of the learning

3See Appendix for the detailed derivation.

Generalized Update Equation:

∆(K) =∑y∈K

w(y, x, t, z)

∇θscoreθ(y, x, t)−∑y′∈Y

q(y′|x, t)∇θscoreθ(y′, x, t)

(9)

Learning Algorithm Intensity Competing Distributionw(y, x, t, z) q(y|x, t)

Maximum Margin Likelihood p(z|y)p(y|x)∑y′ p(z|y′)p(y′|x) p(y|x)

Meritocratic(β) (p(z|y)p(y|x))β∑y′ (p(z|y′)p(y′|x))β

p(y|x)

REINFORCE 1{y = y}R(y, z) p(y|x)

Off-Policy Policy Gradient 1{y = y}R(y, z) p(y|x)u(y|x,z) p(y|x)

Maximum Margin Reward (MMR) 1{y = y∗} 1{y = y}Maximum Margin Avg. Violation Reward (MAVER) 1{y = y∗} 1/|V|1{y ∈ V}

Table 1: Parameter updates for various learning algorithms are special cases of Eq. (9), with different choices ofintensity w and competing distribution q. We do not show dependence upon table t for brevity. For off-policypolicy gradient, u is the exploration policy. For margin methods, y∗ is the reference program (see §4.1), V is theset of programs that violate the margin constraint (cf. Eq. (7)) and y is the most violating program (cf. Eq. (8)). ForREINFORCE, y is sampled from K using p(.) whereas for Off-Policy Policy Gradient, y is sampled using u(.).

algorithm, namely how aggressively the updatepenalizes the members of the program set.

The generalized update equation provides a toolfor better understanding individual algorithm, andhelps shed some light on when a particular methodmay perform better.

Intensity versus Search Quality In SpFD, theeffectiveness of the algorithms for SpFD is closelyrelated to the quality of the search results giventhat the gold program is not available. Intuitively,if the search quality is good, the update algorithmcould be aggressive on updating the model param-eters. When the search quality is poor, the algo-rithm should be conservative.

The intensity w(·) is closely related to the ag-gressiveness of the algorithm. For example, themaximum marginal likelihood is less aggressivegiven that it produces a non-zero intensity overall programs in the program set K that evaluate tothe correct answer. The intensity for a particularcorrect program y is proportional to its probabil-ity p(y|x, t). Further, meritocratic update becomesmore aggressive as β becomes larger.

In contrast, REINFORCE and maximum mar-gin reward both have a non-zero intensity onlyon a single program in K. This value is 1.0 formaximum margin reward, while for reinforcementlearning, this value is the reward. Maximum mar-gin reward therefore updates most aggressively infavor of its selection while maximum marginal

likelihood tends to hedge its bet. Therefore, themaximum margin methods should benefit the mostwhen the search quality improves.

Stability The general equation also allows us toinvestigate the stability of a model update algo-rithm. In general, the variance of update directioncan be high, hence less stable, if the model updatealgorithm has peaky competing distribution, or itputs all of its intensity on a single program. Forexample, REINFORCE only samples one programand puts non-zero intensity only on that program,so it could be unstable depending on the samplingresults.

The competing distribution affects the stabilityof the algorithm. For example, maximum marginreward penalizes only the most violating programand is benign to other incorrect programs. There-fore, the MMR algorithm could be unstable duringtraining.

New Model Update Algorithm The generalequation provides a framework that enables the de-velopment of new variations or extensions of ex-isting learning methods. For example, in order toimprove the stability of the MMR algorithm, wepropose a simple variant of maximum margin re-ward, which penalizes all violating programs in-stead of only the most violating one. We call thisapproach maximum margin average violation re-ward (MAVER), which is included in Table 1 aswell. Given that MAVER effectively considers

more negative examples during each update, weexpect that it is more stable compared to the MMRalgorithm.

5 Experiments

We describe the setup in §5.1 and results in §5.2.

5.1 Setup

Dataset We use the sequential question answer-ing (SQA) dataset (Iyyer et al., 2017) for our ex-periments. SQA contains 6,066 sequences andeach sequence contains up to 3 questions, with17,553 questions in total. The data is partitionedinto training (83%) and test (17%) splits. We use4/5 of the original train split as our training set andthe remaining 1/5 as the dev set. We evaluate us-ing exact match on answer. Previous state-of-the-art result on the SQA dataset is 44.7% accuracy,using maximum margin reward learning.

Semantic Parser Our semantic parser is basedon DynSP (Iyyer et al., 2017), which contains aset of SQL actions, such as adding a clause (e.g.,Select Column) or adding an operator (e.g.,Max). Each action has an associated neural net-work module that generates the score for the ac-tion based on the instruction, the table and the listof past actions. The score of the entire program isgiven by the sum of scores of all actions.

We modified DynSP to improve its represen-tational capacity. We refer to the new parser asDynSP++. Most notably, we included new fea-tures and introduced two additional parser actions.See Appendix 8.2 for more details. While theseimprovements help us achieve state-of-the-art re-sults, the majority of the gain comes from thelearning contributions described in this paper.

Hyperparameters For each experiment, wetrain the model for 30 epochs. We find the op-timal stopping epoch by evaluating the model onthe dev set. We then train on train+dev set tillthe stopping epoch and evaluate the model on theheld-out test set. Model parameters are trained us-ing stochastic gradient descent with learning rateof 0.1. We set the hyperparameter η for policyshaping to 5. All hyperparameters were tuned onthe dev set. We use 40 lexical pairs for definingthe co-occur score. We used common Englishsuperlatives (e.g., highest, most) and comparators(e.g., more, larger) and did not fit the lexical pairsbased on the dataset.

Given the model parameter θ, we use a baseexploration policy defined in (Iyyer et al., 2017).This exploration policy is given by bθ(y |x, t, z) ∝ exp(λ · R(y, z) + scoreθ(y, θ, z)).R(y, z) is the reward function of the incompleteprogram y, given the answer z. We use a rewardfunction R(y, z) given by the Jaccard similarity ofthe gold answer z and the answer generated by theprogram y. The value of λ is set to infinity, whichessentially is equivalent to sorting the programsbased on the reward and using the current modelscore for tie breaking. Further, we prune all syn-tactically invalid programs. For more details, werefer the reader to (Iyyer et al., 2017).

5.2 Results

Table 2 contains the dev and test results when us-ing our algorithm on the SQA dataset. We ob-serve that margin based methods perform betterthan maximum likelihood methods and policy gra-dient in our experiment. Policy shaping in generalimproves the performance across different algo-rithms. Our best test results outperform previousSOTA by 5.0%.

Policy Gradient vs Off-Policy Gradient RE-INFORCE, a simple policy gradient method,achieved extremely poor performance. This likelydue to the problem of exploration and having tosample from a large space of programs. This isfurther corroborated from observing the much su-perior performance of off-policy policy gradientmethods. Thus, the sampling policy is an impor-tant factor to consider for policy gradient methods.

The Effect of Policy Shaping We observe thatthe improvement due to policy shaping is 6.0%on the SQA dataset for MAVER and only 1.3%for maximum marginal likelihood. We also ob-serve that as β increases, the improvement due topolicy shaping for meritocratic update increases.This supports our hypothesis that aggressive up-dates of margin based methods is beneficial whenthe search method is more accurate as comparedto maximum marginal likelihood which hedges itsbet between all programs that evaluate to the rightanswer.

Stability of MMR In Section 4, the general up-date equation helps us point out that MMR couldbe unstable due to the peaky competing distribu-tion. MAVER was proposed to increase the stabil-ity of the algorithm. To measure stability, we cal-

Algorithm Dev Testw.o. Shaping w. Shaping w.o. Shaping w. Shaping

Maximum Margin Likelihood 33.2 32.5 31.0 32.3Meritocratic (β = 0) 27.1 28.1 31.3 30.1

Meritocratic (β = 0.5) 28.3 28.7 31.7 32.0Meritocratic (β =∞) 39.3 41.6 41.6 45.2

REINFORCE 10.2 11.8 2.4 4.0Off-Policy Policy Gradient 36.6 38.6 42.6 44.1

MMR 38.4 40.7 43.2 46.9MAVER 39.6 44.1 43.7 49.7

Table 2: Experimental results on different model update algorithms, with and without policy shaping.

w q DevMMR MML 41.9Off-Policy Policy Gradient MMR 37.0MMR MMR 40.7

Table 3: The dev set results on the new variations of theupdate algorithms.

culate the mean absolute difference of the devel-opment set accuracy between successive epochsduring training, as it indicates how much an al-gorithm’s performance fluctuates during training.With this metric, we found mean difference forMAVER is 0.57% where the mean difference forMMR is 0.9%. This indicates that MAVER is infact more stable than MMR.

Other variations We also analyze other possi-ble novel learning algorithms that are made pos-sible due to generalized update equations. Ta-ble 3 reports development results using these algo-rithms. By mixing different intensity scalars andcompeting distribution from different algorithms,we can create new variations of the model updatealgorithm. In Table 3, we show that by mixing theMMR’s intensity and MML’s competing distribu-tion, we can create an algorithm that outperformMMR on the development set.

Policy Shaping helps against Spurious Pro-grams In order to better understand if policyshaping helps bias the search away from spuriousprograms, we analyze 100 training examples. Welook at the highest scoring program in the beam atthe end of training using MAVER. Without policyshaping, we found that 53 programs were spuri-ous while using policy shaping this number camedown to 23. We list few examples of spurious pro-gram errors corrected by policy shaping in Table 4.

Policy Shaping vs Model Shaping Critiquepolicy contains useful information that can biasthe search away from spurious programs. There-fore, one can also consider making the critiquepolicy as part of the model. We call this modelshaping. We define our model to be the shapedpolicy and train and test using the new model. Us-ing MAVER updates, we found that the dev ac-curacy dropped to 37.1%. We conjecture that thestrong prior in the critique policy can hinder gen-eralization in model shaping.

6 Related Work

Semantic Parsing from Denotation Mappingnatural language text to formal meaning repre-sentation was first studied by Montague (1970).Early work on learning semantic parsers rely onlabeled formal representations as the supervisionsignals (Zettlemoyer and Collins, 2005, 2007;Zelle and Mooney, 1993). However, because get-ting access to gold formal representation gener-ally requires expensive annotations by an expert,distant supervision approaches, where semanticparsers are learned from denotation only, have be-come the main learning paradigm (e.g., Clarkeet al., 2010; Liang et al., 2011; Artzi and Zettle-moyer, 2013; Berant et al., 2013; Iyyer et al., 2017;Krishnamurthy et al., 2017). Guu et al. (2017)studied the problem of spurious programs and con-sidered adding noise to diversify the search proce-dure and introduced meritocratic updates.

Reinforcement Learning Algorithms Rein-forcement learning algorithms have been appliedto various NLP problems including dialogue (Liet al., 2016), text-based games (Narasimhan et al.,2015), information extraction (Narasimhan et al.,2016), coreference resolution (Clark and Man-

Question without policy shaping with policy shaping“of these teams, which had more SELECT Club SELECT Club

than 21 losses?" WHERE Losses = ROW 15 WHERE Losses > 21“of the remaining, which SELECT Nation WHERE FollowUp WHERE

earned the most bronze medals?" Rank = ROW 1 Bronze is Max“of those competitors from germany, SELECT Name WHERE FollowUp WHERE

which was not paul sievert?" Time (hand) = ROW 3 Name != ROW 5

Table 4: Training examples and the highest ranked program in the beam search, scored according to the shapedpolicy, after training with MAVER. Using policy shaping, we can recover from failures due to spurious programsin the search step for these examples.

ning, 2016), semantic parsing (Guu et al., 2017)and instruction following (Misra et al., 2017). Guuet al. (2017) show that policy gradient methodsunderperform maximum marginal likelihood ap-proaches. Our result on the SQA dataset sup-ports their observation. However, we show thatusing off-policy sampling, policy gradient meth-ods can provide superior performance to maxi-mum marginal likelihood methods.

Margin-based Learning Margin-based meth-ods have been considered in the context of SVMlearning. In the NLP literature, margin basedlearning has been applied to parsing (Taskaret al., 2004; McDonald et al., 2005), text clas-sification (Taskar et al., 2003), machine transla-tion (Watanabe et al., 2007) and semantic pars-ing (Iyyer et al., 2017). Kummerfeld et al. (2015)found that max-margin based methods generallyoutperform likelihood maximization on a range oftasks. Previous work have studied connections be-tween margin based method and likelihood maxi-mization for supervised learning setting. We showthem as special cases of our unified update equa-tion for distant supervision learning. Similar tothis work, Lee et al. (2016) also found that in thecontext of supervised learning, margin-based al-gorithms which update all violated examples per-form better than the one that only updates the mostviolated example.

Latent Variable Modeling Learning semanticparsers from denotation can be viewed as a latentvariable modeling problem, where the program isthe latent variable. Probabilistic latent variablemodels have been studied using EM-algorithm andits variant (Dempster et al., 1977). The graphicalmodel literature has studied latent variable learn-ing on margin-based methods (Yu and Joachims,2009) and probabilistic models (Quattoni et al.,2007). Samdani et al. (2012) studied various vari-

ants of EM algorithm and showed that all of themare special cases of a unified framework. Our gen-eralized update framework is similar in spirit.

7 Conclusion

In this paper, we propose a general update equa-tion from semantic parsing from denotation andpropose a policy shaping method for addressingthe spurious program challenge. For the future,we plan to apply the proposed learning frameworkto more semantic parsing tasks and consider newmethods for policy shaping.

8 Acknowledgements

We thank Ryan Benmalek, Alane Suhr, YoavArtzi, Claire Cardie, Chris Quirk, Michel Galleyand members of the Cornell NLP group for theirvaluable comments. We are also grateful to AllenInstitute for Artificial Intelligence for the com-puting resource support. This work was initiallystarted when the first author interned at MicrosoftResearch.

ReferencesYoav Artzi and Luke Zettlemoyer. 2013. Weakly su-

pervised learning of semantic parsers for mappinginstructions to actions. Transactions of the Associa-tion of Computational Linguistics, 1:49–62.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing.

Kevin Clark and D. Christopher Manning. 2016. Deepreinforcement learning for mention-ranking coref-erence models. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing.

James Clarke, Dan Goldwasser, Ming-Wei Chang, andDan Roth. 2010. Driving semantic parsing from the

world’s response. In Proceedings of the Conferenceon Computational Natural Language Learning.

Hal Daumé III and Daniel Marcu. 2005. Learningas search optimization: Approximate large marginmethods for structured prediction. In Proceedingsof the 22nd international conference on Machinelearning, pages 169–176. ACM.

Arthur P Dempster, Nan M Laird, and Donald B Rubin.1977. Maximum likelihood from incomplete datavia the em algorithm. Journal of the royal statisticalsociety. Series B (methodological), pages 1–38.

Shane Griffith, Kaushik Subramanian, JonathanScholz, Charles Lee Isbell, and Andrea LockerdThomaz. 2013. Policy shaping: Integrating humanfeedback with reinforcement learning. In Advancesin Neural Information Processing Systems.

Kelvin Guu, Panupong Pasupat, Evan Liu, and PercyLiang. 2017. From language to programs: Bridg-ing reinforcement learning and maximum marginallikelihood. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1051–1062. Asso-ciation for Computational Linguistics.

Liang Huang, Suphan Fayong, and Yang Guo. 2012.Structured perceptron with inexact search. In Pro-ceedings of the 2012 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages142–151. Association for Computational Linguis-tics.

Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. 2017.Search-based neural structured learning for sequen-tial question answering. In Proceedings of the 55thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages1821–1831. Association for Computational Linguis-tics.

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers).

Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-ner. 2017. Neural semantic parsing with type con-straints for semi-structured tables. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 1516–1526.

Jonathan K. Kummerfeld, Taylor Berg-Kirkpatrick,and Dan Klein. 2015. An empirical analysis of opti-mization for max-margin nlp. In EMNLP.

Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2016.Global neural CCG parsing with optimality guaran-tees. In Proceedings of the 2016 Conference on Em-pirical Methods in Natural Language Processing,EMNLP 2016, Austin, Texas, USA, November 1-4,2016, pages 2366–2376.

Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,Michel Galley, and Jianfeng Gao. 2016. Deep rein-forcement learning for dialogue generation. In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing.

Percy Liang, Michael I Jordan, and Dan Klein. 2011.Learning dependency-based compositional seman-tics. In Proceedings of the Conference of the As-sociation for Computational Linguistics.

Ryan T. McDonald, Koby Crammer, and FernandoPereira. 2005. Online large-margin training of de-pendency parsers. In ACL.

Dipendra Misra, John Langford, and Yoav Artzi. 2017.Mapping instructions and visual observations to ac-tions with reinforcement learning. In Proceedingsof the 2017 Conference on Empirical Methods inNatural Language Processing (EMNLP). Associa-tion for Computational Linguistics. Arxiv preprint:https://arxiv.org/abs/1704.08795.

Dipendra K. Misra and Yoav Artzi. 2016. Neural shift-reduce CCG semantic parsing. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing.

Kumar Dipendra Misra, Kejia Tao, Percy Liang, andAshutosh Saxena. 2015. Environment-driven lex-icon induction for high-level instructions. In Pro-ceedings of the 53rd Annual Meeting of the Associ-ation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers).

Richard Montague. 1970. English as a formal lan-guage.

Karthik Narasimhan, Tejas Kulkarni, and ReginaBarzilay. 2015. Language understanding for text-based games using deep reinforcement learning. InProceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing.

Karthik Narasimhan, Adam Yala, and Regina Barzilay.2016. Improving information extraction by acquir-ing external evidence with reinforcement learning.In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing.

Ariadna Quattoni, Sybor Wang, Louis-PhilippeMorency, Morency Collins, and Trevor Darrell.2007. Hidden conditional random fields. IEEEtransactions on pattern analysis and machineintelligence, 29(10).

Kenneth Rose. 1998. Deterministic annealing for clus-tering, compression, classification, regression, andrelated optimization problems. Proceedings of theIEEE, 86(11):2210–2239.

Rajhans Samdani, Ming-Wei Chang, and Dan Roth.2012. Unified expectation maximization. In Pro-ceedings of the 2012 Conference of the North Amer-ican Chapter of the Association for Computational

https://arxiv.org/abs/1704.08795

Linguistics: Human Language Technologies, pages688–698. Association for Computational Linguis-tics.

Ben Taskar, Carlos Guestrin, and Daphne Koller. 2003.Max-margin markov networks. In NIPS.

Ben Taskar, Dan Klein, Mike Collins, Daphne Koller,and Christopher Manning. 2004. Max-margin pars-ing. In Proceedings of the 2004 Conference on Em-pirical Methods in Natural Language Processing.

Taro Watanabe, Jun Suzuki, Hajime Tsukada, andHideki Isozaki. 2007. Online large-margin train-ing for statistical machine translation. In EMNLP-CoNLL.

Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine Learning, 8.

Chun-Nam John Yu and Thorsten Joachims. 2009.Learning structural svms with latent variables. InProceedings of the 26th annual international confer-ence on machine learning, pages 1169–1176. ACM.

John M Zelle and Raymond J Mooney. 1993. Learningsemantic grammars with constructive inductive logicprogramming. In AAAI, pages 817–822.

Luke S. Zettlemoyer and Michael Collins. 2005.Learning to map sentences to logical form: Struc-tured classification with probabilistic categorialgrammars. In Proceedings of the Conference on Un-certainty in Artificial Intelligence.

Luke S. Zettlemoyer and Michael Collins. 2007. On-line learning of relaxed CCG grammars for parsingto logical form. In Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing.

Victor Zhong, Caiming Xiong, and Richard Socher.2017. Seq2SQL: Generating structured queriesfrom natural language using reinforcement learning.CoRR, abs/1709.00103.

Appendix

8.1 Deriving the updates of common algorithms

Below we derive the gradient of various learning algorithms. We assume access to a training data{(xi, ti, zi)}Ni=1 with N examples. Given an input instruction x and table t, we model the score of aprogram using a score function scoreθ(y, x, z) with parameters θ. When the model is probabilistic, weassume it is a Boltzmann distribution given by p(y | x, t) ∝ exp{scoreθ(y, x, t)}.

In our result, we will be using.

∇θ log p(y | x, t) = ∇θscoreθ(y, x, t)−∑y′∈Y

p(y′ | x, t)∇θscoreθ(y′, x, t) (10)

Maximum Marginal Likelihood The maximum marginal objective JMML can be expressed as:

JMML =N∑i=1

log∑

y∈Gen(ti,zi)

p(y | xi, ti)

where Gen(t, z) is the set of all programs from Y that generate the answer z on table t. Taking thederivative gives us:

∇θJMML =N∑i=1

∇θ log∑

y∈Gen(ti,zi)

p(y | xi, ti)

=

N∑i=1

∑y∈Gen(ti,zi)

∇θp(y | xi, ti)∑y∈Gen(ti,zi)

p(y | xi, ti)

Then using Equation 10, we get:

∇θJMML =N∑i=1

∑y∈Gen(ti,zi)

w(y | xi, ti)

∇θscoreθ(y, x, t)−∑y′∈Y

p(y′ | x, t)∇θscoreθ(y′, x, t)

(11)

where

w(y, x, t) =p(y | x, t)∑

y′∈Gen(t,z) p(y′ | x, t)

Policy Gradient Methods Reinforcement learning based approaches maximize the expected rewardobjective.

JRL =

N∑i=1

∑y∈Y

p(y | xi, ti)R(y, zi) (12)

We can then compute the derivate of this objective as:

∇θJRL =

N∑i=1

∑y∈Y∇θp(y | xi, ti)R(y, zi) (13)

The above summation can be expressed as expectation (Williams, 1992).

∇θJRL =N∑i=1

∑y∈Y

p(y | xi, ti)∇θ log p(y | xi, ti)R(y, zi) (14)

For every example i, we sample a program yi from Y using the policy p(. | xi, ti). In practice thissampling is done over the output programs of the search step.

∇θJRL ≈N∑i=1

∇θ log p(yi | xi, ti)R(yi, zi)

using gradient of log p(. | .)

≈N∑i=1

R(yi, zi)

∇θscoreθ(yi, xi, t)−∑y′∈Y

p(y′ | xi, ti)∇θscoreθ(y′, xi, ti)

Off-Policy Policy Gradient Methods In off-policy policy gradient method, instead of sampling aprogram using the current policy p(. | .), we use a separate exploration policy u(. | .). For the ith

training example, we sample a program yi from the exploration policy u(. | xi, ti, zi). Thus the gradientof expected reward objective from previous paragraph can be expressed as:

∇θJRL =N∑i=1

∑y∈Y∇θp(y | xi, ti)R(y, zi)

=

N∑i=1

∑y∈Y

u(y | xi, ti, zi)p(y | xi, ti)u(y | xi, ti, zi)

∇θ log p(y | xi, ti)R(y, zi)

using, for every i yi ∼ u(. | xi, ti, zi)

≈N∑i=1

p(y | xi, ti)u(y | xi, ti, zi)

∇θ log p(y | xi, ti)R(y, zi)

the ratio of p(y|x,t)u(y|x,t,z) is the importance weight correction. In practice, we sample a program from the

output of the search step.

Maximum Margin Reward (MMR) For the ith training example, let K(xi, ti, zi) be the set of pro-grams produced by the search step. Then MMR finds the highest scoring program in this set, whichevaluates to the correct answer. Let this program be yi. MMR optimizes the parameter to satisfy thefollowing constraint:

scoreθ(yi, xi, ti) ≥ scoreθ(y′, xi, ti) + δ(yi, y

′, zi) y′ ∈ Y (15)

where the margin δ(yi, y′, zi) is given by R(yi, zi)−R(y′, zi). Let V be the set of violations given by:V = {scoreθ(y′, xi, ti)− scoreθ(yi, xi, ti) + δ(yi, y

′, zi) > 0 | y ∈ Y}.At each training step, MMR only considers the program which is most violating the constraint. When|V| > 0 then let y∗ be the most violating program given by:

y = arg maxy′∈Y

{scoreθ(y

′, xi, ti)− scoreθ(yi, xi, ti) +R(yi, zi)−R(y′, zi)}

= arg maxy′∈Y

{scoreθ(y

′, xi, ti)−R(y′, zi)}

Using the most violation approximation, the objective for MMR can be expressed as negative of hingeloss:

JMMR = −max{0, scoreθ(y, xi, ti)− scoreθ(yi, xi, ti) +R(yi, zi)−R(y, zi)} (16)

Our definition of y∗ allows us to write the above objective as:

JMMR = −1{V > 0}{scoreθ(y, xi, ti)− scoreθ(yi, xi, ti) +R(yi, zi)−R(y, zi)} (17)

the gradient is then given by:

∇θJMMR = −1{V > 0}{∇θscoreθ(y, xi, ti)−∇θscoreθ(yi, xi, ti)} (18)

Maximum Margin Average Violation Reward (MAVER) Given a training example, MAVER con-siders the same constraints and margin as MMR. However instead of considering only the most violatedprogram, it considers all violations. Formally, for every example (xi, ti, zi) we compute the ideal pro-gram yi as in MMR. We then optimize the average negative hinge loss error over all violations:

JMAV ER = − 1

V∑y′∈V{scoreθ(y′, xi, ti)− scoreθ(yi, xi, ti) +R(yi, zi)−R(y′, zi)} (19)

Taking the derivative we get:

∇θJMAV ER = − 1

V∑y′∈V{∇θscoreθ(y′, xi, ti)−∇θscoreθ(yi, xi, ti))}

= ∇θscoreθ(yi, xi, ti))−∑y′∈V

1

|V|∇θscoreθ(y′, xi, ti)

8.2 Changes to DynSP ParserWe make following 3 changes to the DynSP parser to increase its representational power. The new parseris called DynSP++. We describe these three changes below:

1. We add two new actions: disjunction (OR) and follow-up cell (FpCell). The disjunction operationis used to describe multiple conditions together example:

Question: what is the population of USA or China?Program: Select Population Where Name = China OR Name = USA

Follow-up cell is only used for a question which is following another question and whose answer isa single cell in the table. Follow-up cell is used to select values for another column correspondingto this cell.

Question: and who scored that point?Program: Select Name Follow-Up Cell

2. We add surface form features in the model for column and cell. These features trigger on tokenmatch between an entity in the table (column name or cell value) and a question. We consider twotokens: exact match and overlap. The exact match is 1.0 when every token in the entity is present inthe question and 0 otherwise. Overlap feature is 1.0 when atleast one token in the entity is presentin the question and 0 otherwise. We also consider related-column features that were consideredby Krishnamurthy et al. (2017).

3. We also add recall features which measure how many tokens in the question that are also present inthe table are covered by a given program. To compute this feature, we first compute the set E1 of alltokens in the question that are also present in the table. We then find a set of non-keyword tokens E2

that are present in the program. The recall score is then given by w ∗ |E1−E2||E1| , where w is a learnedparameter.

Policy Shaping and Generalized Update Equations for ...dkm/papers/mchy-emnlp.2018.pdf · Policy...

Documents

Transcript of Policy Shaping and Generalized Update Equations for ...dkm/papers/mchy-emnlp.2018.pdf · Policy...