Generating Sentences from Disentangled Syntactic and...

Generating Sentences from Disentangled Syntactic and Semantic Spaces

Yu Bao1∗ Hao Zhou2∗ Shujian Huang1† Lei Li2 Lili Mou3

Olga Vechtomova3 XIN-YU DAI1 Jiajun CHEN1

1National Key Laboratory for Novel Software Technology, Nanjing University, China2ByteDance AI Lab, Beijing, China3University of Waterloo, Canada

{baoy,huangsj,dxy,chenjj}@nlp.nju.edu.cn{zhouhao.nlp,lileilab}@bytedance.com

[email protected], [email protected]

Abstract

Variational auto-encoders (VAEs) are widelyused in natural language generation due tothe regularization of the latent space. How-ever, generating sentences from the continu-ous latent space does not explicitly model thesyntactic information. In this paper, we pro-pose to generate sentences from disentangledsyntactic and semantic spaces. Our proposedmethod explicitly models syntactic informa-tion in the VAE’s latent space by using the lin-earized tree sequence, leading to better perfor-mance of language generation. Additionally,the advantage of sampling in the disentangledsyntactic and semantic latent spaces enables usto perform novel applications, such as the un-supervised paraphrase generation and syntax-transfer generation. Experimental results showthat our proposed model achieves similar orbetter performance in various tasks, comparedwith state-of-the-art related work. ‡

1 Introduction

Variational auto-encoders (VAEs, Kingma andWelling, 2014) are widely used in language gen-eration tasks (Serban et al., 2017; Kusner et al.,2017; Semeniuta et al., 2017; Li et al., 2018b).VAE encodes a sentence into a probabilistic latentspace, from which it learns to decode the samesentence. In addition to traditional reconstruc-tion loss of an autoencoder, VAE employs an ex-tra regularization term, penalizing the Kullback–Leibler (KL) divergence between the encoded pos-terior distribution and its prior. This property en-ables us to sample and generate sentences fromthe continuous latent space. Additionally, we can

∗Equal contributions.†Corresponding author.‡We release the implementation and models at https://

github.com/baoy-nlp/DSS-VAE

even manually manipulate the latent space, inspir-ing various applications such as sentence interpo-lation (Bowman et al., 2016) and text style trans-fer (Hu et al., 2017).

However, the continuous latent space of VAEblends syntactic and semantic information to-gether, without modeling the syntax explicitly. Weargue that it may be not necessarily the best inthe text generation scenario. Recently, researchershave shown that explicitly syntactic modelingimproves the generation quality in sequence-to-sequence models (Eriguchi et al., 2016; Zhouet al., 2017; Li et al., 2017; Chen et al., 2017). It isstraightforward to adopt such idea in the VAE set-ting, since a vanilla VAE does not explicitly modelthe syntax. A line of studies (Kusner et al., 2017;Gomez-Bombarelli et al., 2018; Dai et al., 2018)propose to impose context-free grammars (CFGs)as hard constraints in the VAE decoder, so thatthey could generate syntactically valid outputs ofprograms, molecules, etc.

However, the above approaches cannot be ap-plied to syntactic modeling in VAE’s continuouslatent space, and thus, we do not enjoy the twobenefits of VAE, namely, sampling and manipula-tion, towards the syntax of a sentence.

In this paper, we propose to generate sentencesfrom a disentangled syntactic and semantic spacesof VAE (called DSS-VAE). DSS-VAE explicitlymodels syntax in the continuous latent space ofVAE, while retaining the sampling and manipu-lation benefits. In particular, we introduce twocontinuous latent variables to capture semanticsand syntax, respectively. To separate the seman-tic and syntactic information from each other, weborrow the adversarial approaches from the textstyle-transfer research (Hu et al., 2017; Fu et al.,2018; John et al., 2018), but adapt it into ourscenario of syntactic modeling. We also observethat syntax and semantics are highly interwoven,

https://github.com/baoy-nlp/DSS-VAE

https://github.com/baoy-nlp/DSS-VAE

and therefore further propose an adversarial recon-struction loss to regularize the syntactic and se-mantic spaces.

Our proposed DSS-VAE takes following advan-tages:

First, explicitly syntactic modeling in VAE’s la-tent space improves the quality of unconditionallanguage generation. Experiments show that,compared with traditional VAE, DSS-VAE gen-erates more fluent sentences (lower perplexity),while preserving more amount of encoded infor-mation (higher BLEU scores for reconstruction).Comparisons with a state-of-the-art syntactic lan-guage model (Shen et al., 2017) are also included.

Second, the advantage of manipulation in thesyntactic and semantic spaces of DSS-VAE pro-vides a natural way of unsupervised paraphrasegeneration. If we sample a vector in the syntacticspace but perform max a posterior (MAP) infer-ence in the semantic space, we are able to gener-ate a sentence with the same meaning but differentsyntax. This is known as unsupervised paraphrasegeneration, as no parallel corpus is needed duringtraining. Experiments show that DSS-VAE outper-forms the traditional VAE as well as a state-of-the-art Metropolis-Hastings sampling approach (Miaoet al., 2019) in this task.

Additionally, with the disentangled syntacticand semantic latent spaces, we propose an inter-esting application that transfers the syntax of onesentence to another. Both qualitative and quan-titative experimental results show that DSS-VAEcould graft the designed syntax to another sen-tence under certain circumstances.

2 Related Work

The variational auto-encoders (VAEs) is proposedby Kingma and Welling (2014) for image gener-ation. Bowman et al. (2016) successfully appliedVAE in the NLP domain, showing that VAE im-proves recurrent neural network (RNN)-based lan-guage modeling (RNN-LM, Mikolov et al., 2010);that VAE allows sentence sampling and sentenceinterpolation in the continuous latent space. Later,VAE is widely used in various natural languagegeneration tasks (Gupta et al., 2018; Kusner et al.,2017; Hu et al., 2017; Deriu and Cieliebak, 2018).

Syntactic language modeling, to the best ofour knowledge, could be dated back to Chelba(1997). Charniak (2001) and Clark (2001) pro-pose to utilize a top-down parsing mechanism for

language modeling. Dyer et al. (2016) and Kun-coro et al. (2017) introduce the neural network tothis direction. The Parsing-Reading-Predict Net-work (PRPN, Shen et al., 2017), which reports astate-of-the-art results on syntactic language mod-eling, learns a latent syntax by training with a lan-guage modeling objective. Different from theirwork, our approach models syntax in a continu-ous space, facilitating sampling and manipulationof syntax.

Our work is also related to style-transfer textgeneration (Fu et al., 2018; Li et al., 2018a; Johnet al., 2018). In previous work, the style is usu-ally defined by categorical features such as sen-timent. We move one step forward, extendingtheir approach to the sequence level and dealingwith more complicated, non-categorical syntacticspaces. Due to the complication of syntax, we fur-ther design adversarial reconstruction losses to en-courage the separation of syntax and semantics.

3 Approach

In this section, we present our proposed DSS-VAEin detail. We first introduce the variational au-toencoder in §3.1. Then, we describe the gen-eral architecture of DSS-VAE in §3.2, where weexplain how we generate sentences from disen-tangled syntactic and semantic latent spaces andhow we disentangle information from the two sep-arated spaces. Model training is discussed in §3.3.

3.1 Variational AutoencoderA traditional VAE employs a probabilistic latentvariable z to encode the information of a sentencex, and then decodes the original x from z. Theprobability of a sentence x could be computed as:

p(x) =

∫p(z)p(x|z) dz (1)

where p(z) is the prior, and p(x|z) is given bythe decoder. VAE is trained by maximizing theevidence lower bound (ELBO):

log p(x) ≥ ELBO

= Eq(z|x)

[log p(x|z)

]−KL

(q(z|x)

∥∥ p(z))(2)

3.2 Proposed Method: DSS-VAEOur DSS-VAE is built upon the vanilla VAE, butextends Eqn. (1) by adopting two separate latentvariables zsem and zsyn to capture semantic and

syntactic information, respectively. Specifically,we assume that the probability of a sentence x inDSS-VAE could be computed as:

p(x) =

∫p(zsem, zsyn)p(x|zsem, zsyn) dzsem dzsyn

=

∫p(zsem)p(zsyn)p(x|zsem, zsyn) dzsem dzsyn

where p(zsem) and p(zsyn) are the priors; bothare set to be independent multivariate GaussianN (0, I).

Similar to (2), we optimize the evidence lowerbound (ELBO) for training:

log p(x) ≥ ELBO

= Eq(zsem|x)q(zsyn|x)

[log p(x|zsem, zsyn)

]−KL

(q(zsem|x)

∥∥ p(zsem))

−KL(q(zsyn|x)

∥∥ p(zsyn))

where q(zsem|x) and q(zsyn|x) are posteriorsfor the two latent variables. We further as-sume the variational posterior families, q(zsem|x)and q(zsyn|x), are independent, taking the formN (µsem,σ

2sem) and N (µsyn,σ

2syn), respectively,

We use RNN to parameterize the posteriors (alsocalled the encoder). Here, µsem, σsem, µsyn, andσsyn are predicted by the encoder network, de-scribed as follows.

Encoding In the encoding phase, we first ob-tain the sentence representation rx by an RNNwith the gated recurrent units (GRUs, Cho et al.,2014); then, rx is evenly split into two spacesrx = [rsem

x ; rsynx ].

For the semantic encoder, we compute the meanand variance of q(zsem|x) from rsem

x as:[µsemσsem

]=

[Wµ

semW σ

sem

]ReLU(Wsemr

semx + bsem)

where the activation function is the rectifiedlinear unit (ReLU, Nair and Hinton, 2010).Wµ

sem,W σsem,Wsem, and bsem are the parameters of

the semantic encoder.Likewise, a syntactic encoder predicts µsyn and

σsyn for q(zsyn|x) in the same way, with parame-ters Wµ

syn,W σsyn,Wsyn, and bsyn.

Decoding in the Training Phase We first sam-ple from the posterior distributions by the repa-rameterization trick (Kingma and Welling, 2014),

S

NP VP (.,.)

(PRP,This) (VBZ,is) NP

(DT,an) (JJ,interesting) (NN,idea)

Constituency parse tree

Linearized representation

S NP PRP /NP VP VBZ NP DT JJ NN /NP /VP . /S

Figure 1: The parse tree and its linearized tree sequenceof a sentence “This is an interesting idea.”

obtaining sampled semantic and syntactic repre-sentations, zsem and zsyn; then, they are concate-nated as z = [zsem; zsyn] and fed as the initial stateof the decoder for reconstruction.

Decoding in the Test Phase The treatment de-pends on applications. If we would like to syn-thesize a sentence from scratch, both zsyn andzsem are sampled from prior. If we would liketo preserve/vary semantics/syntax, max a poste-rior (MAP) inference or sampling could be appliedin respective spaces. Details are provided in § 4.

In the following part, we will introduce howsyntax is modeled in our approach and how syn-tax and semantics are ensured to be separated.

3.2.1 Modeling Syntax by PredictingLinearized Tree Sequence

While previous studies have tackled the problemof categorical sentiment modeling in the latentspace (Hu et al., 2017; Fu et al., 2018), syntaxis much more complicated and not finitely cate-gorical. We propose to adopt the linearized treesequence to explicitly model syntax in the latentspace of VAE.

Figure 1 shows the constituency parse tree ofthe sentence “This is an interesting idea.” The lin-earized tree sequence can be obtained by travers-ing the syntactic tree in a top-down order; ifthe node is non-terminal, we add a backtrackingnode (e.g., /NP) after its child nodes are traversed.

We ensure that zsyn contains syntactic informa-tion by predicting the linearized tree sequence.In training, the parse tree for sentences are ob-tained by the ZPar1 toolkit, and serves as thegroundtruth training signals; in testing, we do notneed external syntactic trees. We build an RNN

1https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html

https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html

https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html

S NP .. . /S

This is ...

This is ...

S NP .. . /S

This is ...

This is ...

Figure 2: Overview of our DSS-VAE. Forward dashedarrows are multi-task losses; backward dashed arrowsare adversarial losses.

(independent of the VAE’s decoder) to predictsuch linearized parse trees, where each parsing to-ken is represented by an embedding (similar to atraditional RNN decoder). Notice that, a node andits backtracking, e.g., NP and /NP, have differentembeddings.

The linearized tree sequence has achievedpromising parsing results in a traditional con-stituency parsing task (Vinyals et al., 2015; Liuet al., 2018; Vaswani et al., 2017), which shows itsability of preserving syntactic information. Ad-ditionally, the linearized tree sequence works ina sequence-to-sequence fashion, so that it can beused to regularize the latent spaces.

3.2.2 Disentangling Syntax and Semanticsinto Different Latent Spaces

Having solved the problem of syntactic modeling,we now turn to the question: how could we disen-tangle syntax and semantics from each other?

We are inspired by the research in text styletransfer and apply auxiliary losses to regularize thelatent space (Hu et al., 2017; Fu et al., 2018).

In particular, we adopt the multi-task and adver-sarial losses in John et al. (2018), but extend it tothe sequence level. In §3.2.3, we further proposetwo adversarial reconstruction losses to discour-age the model to encode a sentence from a singlesubspace.

Multi-Task Loss Intuitively, a multi-task lossensures that each space (zsyn or zsem) should cap-ture respective information.

For the semantic space, we predict the bag-of-words (BoW) distribution of a sentence from zsem

with softmax, whose objective is the cross-entropyloss against the groundtruth distribution t, givenby:

L(mul)sem = −

∑w∈V

tw log p(w|zsem) (3)

where p(w|zsyn) is the predicted distribution.BoW has been explored by previous work (Wenget al., 2017; John et al., 2018), showing good abil-ity of preserving semantics.

For the syntactic space, the multi-task losstrains a model to predict syntax on zsyn. Due toour proposal in §3.2.1, we could build a dedicatedRNN, predicting the tokens in the linearized parsetree sequence, whose loss is:

L(mul)syn = −

∑n

i=1log p(si|s1 · · · si−1, zsyn) (4)

where si is a token in the linearized parse tree(with a total length of n).

Adversarial Loss The adversarial loss is widelyused for aligning samples from different distri-butions. It has various applications, includingstyle transfer (Hu et al., 2017; Fu et al., 2018;John et al., 2018) and domain adaptation (Tzenget al., 2017). To apply adversarial losses, weadd extra model components (known as adver-saries) to predict semantic information tw basedon the syntactic space zsyn, but to predict syntac-tic information s1 · · · sn−1 based on the semanticspace zsem. They are denoted by padv(w|zsyn) andpadv(si|s1 · · · si−1, zsem).

The training of these adversaries are similar to(3) and (4), except that the gradient only trainsthe adversaries themselves, and does not back-propagate to VAE.

Then, VAE is trained to “fool” the adversariesby maximizing their losses, i.e., minimizing thefollowing terms:

L(adv)sem =

∑w∈V

tw log padv(w|zsyn) (5)

L(adv)syn =

∑n

i=1log padv(si|s1 · · · si−1, zsem) (6)

In this phase, the adversaries are fixed and theirparameters are not updated.

3.2.3 Adversarial Reconstruction LossOur next intuition is that syntax and semantics aremore interwoven to each other than other informa-tion such as style and content.

Suppose, for example, the syntax and seman-tics have been perfectly separated by the losses in

§3.2.2, where zsem could predict BoW well, butdoes not contain any information about the syn-tactic tree. Even in this ideal case, the decodercan reconstruct the original sentence from zsem bysimply learning to re-order words (as zsem doescontain BoW). Such word re-ordering knowledgeis indeed learnable (Ma et al., 2018), and doesnot necessarily contain the syntactic information.Therefore, the multi-task and adversarial losses forsyntax and semantics do not suffice to regularizeDSS-VAE.

We now propose an adversarial reconstructionloss to discourage the sentence being predicted bya single subspace zsyn or zsem. When combined,however, they should provide a holistic view ofthe entire sentence. Formally, let zs be a latentvariable (zs = zsyn or zsem). A decoding adver-sary is trained to predict the sentence based on zs,denoted by prec(xi|x1 · · ·xi−1, zs). Then, the ad-versarial reconstruction loss is imposed by mini-mizing

L(adv)rec (zs) =

∑M

i=1log prec(xi|x<i, zs) (7)

Such adversarial reconstruction loss is applied toboth the syntactic and semantic spaces, shown byblack bashed arrows in Figure 2.

3.3 Training DetailsOverall Training Objective The overall train-ing loss is a combination of the VAE loss (2), themulti-task and adversarial losses for syntax and se-mantics (3–6), as well as the adversarial recon-struction losses (7), , i.e., minimizing

L = Lvae + Laux

= − Eq(zsem|x)q(zsyn|x)

log[p(x|zsem, zsyn)

]+ λKL

sem KL(q(zsem|x)

∥∥ p(zsem))

+ λKLsyn KL

(q(zsyn|x)

∥∥ p(zsyn))

+ λmulsemL(mul)

sem + λadvsemL(adv)

sem + λrecsemL(adv)

rec (zsem)

+ λmulsynL(mul)

syn + λadvsynL(adv)

syn + λrecsynL(adv)

rec (zsyn)

(8)

where the λKLsem, λKL

syn, λmulsem, λadv

sem, λrecsem, λmul

syn , λadvsyn ,

and λrecsyn are the hyperparameters to adjust the im-

portance of each loss in overall objective.

Hyperparameter Tuning We select the param-eter values with the lowest ELBO value on thevalidation set in all experiments. They are tunedby (grouped) grid search on the validation set, but

due to the large hyperparameter space, we conducttuning mostly for sensitive hyperparameters andadmit that it is empirical. We choose the VAE asour baseline, and the KL weight of VAE is tunedin the same way. We list the hyperparameters inAppendix A.

The training objective is optimized byAdam (Kingma and Ba, 2015) with β1 =0.9, β2 = 0.995, and the initial learning rate is0.001. Word embeddings are 300-dimensionaland initialized randomly. The dimension of eachlatent space (namely, zsyn and zsem) is 100.

KL Annealing and Word Dropout We adoptthe tricks of KL annealing and word dropoutfrom Bowman et al. (2016) to avoid KL collapse.We anneal λKL

syn and λKLsyn from zero to predefined

values in a sigmoid manner. Besides, the worddropout trick randomly replaces the ground-truthtoken with<unk> with a fixed probability of 0.50at each time step of the decoder during training.

4 Experiments

We evaluate our method on reconstruction and un-conditional language generation (§4.1). Then, weapply it two applications, namely, unsupervisedparaphrase generation (§4.2) and syntax-transfergeneration (§4.3).

4.1 Reconstruction and UnconditionalLanguage Generation

First, we compare our model in reconstruc-tion and unconditional language generation witha traditional VAE and a syntactic languagemodel (PRPN, Shen et al., 2017).

Dataset We followed previous work (Bowmanet al., 2016) and used a standard benchmark, theWSJ sections in the Penn Treebank (PTB) (Mar-cus et al., 1993). We also followed the standardsplit: Sections 2–21 for training, Section 24 forvalidation, and Section 23 for test.

Settings We trained VAE and DSS-VAE, bothwith 100-dimensional RNN states. For the vo-cabulary, we chose 30k most frequent words. Wetrained PRPN with the default parameter in thecode base.2

Evaluation We evaluate model performancewith the following metrics:

2https://github.com/yikangshen/PRPN

https://github.com/yikangshen/PRPN

KL-Weight BLEU↑ Forward PPL↓

1.3 7.26 34.011.2 7.41 35.001.0 8.19 36.530.7 8.98 42.440.5 9.07 44.110.3 9.26 48.700.1 9.36 49.73

Table 1: BLEU and Forward PPL of VAE with vary-ing KL weights on the PTB test set. The larger↑ (orlower↓), the better.

1. Reconstruction BLEU. The reconstructiontask aims to generate the input sentence it-self. In the task, both syntactic and se-mantic vectors are chosen as the predictedmean of the encoded distribution. We eval-uate the reconstruction performance by theBLEU score (Papineni et al., 2002) with in-put as the reference.3 It reflects how wellthe model could preserve input information,and is crucial for representation learning and“goal-oriented” text generation.

2. Forward PPL. We then perform uncondi-tioned generation, where both syntactic andsemantic vectors are sampled from prior. For-ward perplexity (PPL) (Zhao et al., 2018)is the generated sentences’ perplexity scorepredicted by a pertained language model.4

It shows the fluency of generated sentencesfrom VAE’s prior. We computed ForwardPPL based on 100K sampled sentences.

3. Reverse PPL. Unconditioned generation isfurther evaluated by Reverse PPL (Zhaoet al., 2018). It is obtained by first training alanguage model5 on 100K sampled sentencesfrom a generation model; then, Reverse PPLis the perplexity of the PTB test sets with thetrained language model. Reverse PPL evalu-ates the diversity and fluency of sampled sen-tences from a language generation model. Ifsampled sentences are of low diversity, thelanguage model would be trained only onsimilar sentences; if the sampled sentencesare of low fluency, the language model would

3We evaluate the corpus BLEU implemented in https://www.nltk.org/ modules/nltk/translate/bleu score.html

4We used an LSTM language model trained onthe One-Billion-Word Corpus (http://www.statmt.org/lm-benchmark).

5Tied LSTM-LM with 300 dimensions and two lay-ers, implemented in https://github.com/pytorch/examples/tree/master/word language model

ᤒ໒ 1

VAE

34.01 7.26 31.82 8.01

35 7.41 32.09 8.71

36.53 8.19 35.29 9.29

42.44 8.98 35.54 9.55

44.11 9.07 42.36 11.69

48.7 9.26 42.5 11.12

47.33 8.98 45.6 9.6

49.73 9.36 49.79 11.09

syntax-VAE

BLEU

7

8

9

10

11

12

Forward-PPL

31 36 41 46 51

VAE DSS-VAE

�1

Figure 3: Comparing DSS-VAE and VAE in languagegeneration with different KL weight. We performedlinear regression for each model to show the trend. Theupper-left corner (larger BLEU but smaller PPL) indi-cates a better performance.

be trained on unfluent sentences. Both willlead to higher Reverse PPL. For comparingVAE and DSS-VAE, we sample latent vari-ables from the prior, and feed them to the de-coder for generation; for LSTM-LM, we firstfeed the start sentence token <s> to the de-coder, and sample the word at each time stepby predicted probabilities (i.e., forward sam-pling).

Results We see in Table 1 that BLEU and PPLare more or less contradictory. Usually, a smallerKL weight makes the autoencoder less “varia-tional” but more “deterministic,” leading to lessfluent sampled sentences but better reconstruction.If the trade-off is not analyzed explicitly, the VAEvariant could have arbitrary results based on KL-weight tuning, which is unfair.

We therefore present the scatter plot in Figure 3,showing the trend of forward PPL and BLEUscores with different KL weights. Clearly, DSS-VAE outperforms a plain VAE in BLEU if For-ward PPL is controlled, and in Forward PPL ifBLEU is controlled. The scatter plot shows thatour proposed DSS-VAE outperforms the originalcounterpart in language generation with differentKL weights.

In terms of Reverse PPL (Table 2), DSS-VAEalso achieves better Reverse PPL than a tradi-tional VAE. Since DSS-VAE leverages syntax toimprove the sentence generation, we also include astate-of-the-art syntactic language model (PRPN-LM, Shen et al., 2017) for comparison. Re-sults show that DSS-VAE has achieved a Re-verse PPL comparable to (and slightly better than)

https://www.nltk.org/_modules/nltk/translate/bleu_score.html

https://www.nltk.org/_modules/nltk/translate/bleu_score.html

http://www.statmt.org/lm-benchmark

http://www.statmt.org/lm-benchmark

https://github.com/pytorch/examples/tree/master/word_language_model

https://github.com/pytorch/examples/tree/master/word_language_model

Model Reverse PPL↓

Real data 70.76LSTM-LM 132.46PRPN-LM 116.67VAE 125.86DSS-VAE 116.23

Table 2: Reverse PPL reflect the diversity and fluencyof sampling data, the lower↓, the better. Training onthe model sampled and evaluated on the real test set.We set the same KL weight for DSS-VAE and VAEhere.(KL weight=1.0)

PRPN-LM. It is also seen that explicitly model-ing syntactic structures does yield better gener-ation results—DSS-VAE and PRPN consistentlyoutperform VAE and LSTM-LM in sentence gen-eration.

We also include the Reverse PPL of the realtraining sentences. As expected, training a lan-guage model on real data outperforms trainingon sampled sentences from a generation model,showing that there is still much room for improve-ment for all current sentence generators.

4.2 Unsupervised Paraphrase Generation

Given an input sentence, paraphrase generationaims to synthesize a sentence that appears differ-ent from the input, but conveys the same mean-ing. We propose a novel approach to unsupervisedparaphrase generation with DSS-VAE. Suppose aDSS-VAE is well trained according to §3.3, ourapproach works in the inference stage.

For a particular input sentence x∗, letq(zsyn|x∗) and q(zsem|x∗) be the encoded pos-terior distributions of the syntactic and semanticspaces, respectively. The inferred latent vectorsare:

z∗sem = argmaxzsemq(zsem|x∗) (9)

z∗syn ∼ q(zsyn|x∗) (10)

and are further combined as:

z∗ =[z∗syn; z∗sem

](11)

Finally, z∗ is fed to the decoder and perform agreedy decoding for paraphrase generation.

The intuition behind is that, when generatingthe paraphrase, semantics should remain the same,but the syntax of a paraphrase could (and should)vary. Therefore, we sample a z∗syn vector from itsprobabilistic distribution, while fixing z∗sem.

Model BLEU-ref↑ BLEU-ori↓

Origin Sentence† 30.49 100VAE-SVG-eq (supervised)‡ 22.90 –VAE (unsupervised)† 9.25 27.23CGMH† 18.85 50.18DSS-VAE 20.54 52.77

Table 3: Performance of paraphrase generation. Thelarger↑ (or lower↓), the better. Some results are quotedfrom †Miao et al. (2019) and ‡Gupta et al. (2018).

Dataset We used the established Quora dataset6

to evaluate paraphrase generation, following pre-vious work (Miao et al., 2019). The dataset con-tains 140k pairs of paraphrase sentences and 260kpairs of non-paraphrase sentences. In the stan-dard dataset split, there are 3k and 30k held-outvalidation and test sets, respectively. In this ex-periment, we consider the unsupervised settingas Miao et al. (2019), using all non-paraphrasesentences as training samples. It is also noted thatwe only valid our model on the non-paraphraseheld-out validation set by selecting with the lowestvalidation ELBO.

Evaluation Since the test set contains a refer-ence paraphrase for each input, it is straightfor-ward to compute the BLEU against the refer-ence, denoted by BLEU-ref. However, this metricalone does not model whether the generated sen-tence is different from the input, and thus, Miaoet al. (2019) propose to measure this by comput-ing BLEU against the original sentence (denotedas BLEU-ori), which ideally should be low. Weonly consider the DSS-VAE that yields a BLEU-ori lower than 55, which is empirically suggestedby Miao et al. (2019) that ensures the obtainedsentence is different from the original to at leasta certain degree.

Results Table 3 shows the performance of un-supervised paraphrase generation. In the first rowof Table 3, simply copying the original sentencesyields the highest BLEU-ref, but is meaninglessas it has a BLEU-ori score of 100. We see thatDSS-VAE outperforms the CGMH and the orig-inal VAE in BLEU-ref. Especially, DSS-VAEachieves a closer BLEU-ref compared with super-vised paraphrase methods (Gupta et al., 2018).

We admit that it is hard to present the trade-offby listing a single score for each model in the Ta-ble 3. We therefore have the scatter plot in Fig-

6https://www.kaggle.com/c/quora-question-pairs/data

https://www.kaggle.com/c/quora-question-pairs/data

ᤒ໒ 1

40.04 16.4 27.23 9.25 50.18 18.85

syntax-VAE

27.26 11.57

24.81 10.91

30.62 13.26

VAE

35.01 15.32

20.84 9.69

51.86 20.44

44.49 18.15

47.43 19.04

50.46 20.01

CGMH

BLEU-ref

8

11.5

15

18.5

22

BLEU-ori20 25 30 35 40 45 50 55

50.18, 18.85

27.23, 9.25

DSS-VAE VAE CGMH

�1

Figure 4: Trade-off between BLEU-ori (the lower, thebetter) and BLEU-ref (the larger, the better) in unsu-pervised paraphrase generation. Again, the upper-leftcorner indicates a better performance.

ure 4 to further compare these methods. As seen,the trade-off is pretty linear and less noisy com-pared with Figure 3. It is seen that the line ofDSS-VAE is located to the upper-left of the com-peting methods. In other words, the plain VAE andCGMH are “inadmissible,” meaning that DSS-VAE simultaneously outperforms them in bothBLEU-ori and BLEU-ref, indicating that DSS-VAE outperforms previous state-of-the-art meth-ods in unsupervised paraphrase generation.

4.3 Syntax-Transfer Generation

In this experiment, we propose a novel applicationof syntax-transfer text generation, inspired by pre-vious sentiment-style transfer studies (Hu et al.,2017; Fu et al., 2018; John et al., 2018).

Consider two sentences:

x1: There is a dog behind the door.

x2: The child is playing in the garden.

If we would like to generate a sentence having thesyntax of “there is/are” as x1 but conveyingthe meaning of x2, we could graft the respectivesyntactic and semantic vectors as:

z∗sem = argmaxzsemq(zsem|x2)

z∗syn = argmaxzsynq(zsyn|x1)

z =[z∗sem; z∗syn

]and then feed z to the decoder to obtain a syntax-transferred sentence.

Dataset and Evaluation To evaluate this task,we constructed a subset of the Stanford Natural

Language Inference (SNLI), containing 1000 non-paraphrase pairs. SNLI sentences can be thoughtof as a simple domain-specific corpus, but were allwritten by humans. In each pair we constructed,one sentence serves as the semantic provider (de-noted by Refsem), and the other serves as the syn-tactic provider (denoted by Refsyn). The goal ofsyntax-transfer text generation is to synthesize asentence that resembles Refsem but not Refsyn insemantics, and resembles Refsyn but not Refsem insyntax. For the semantic part, we use the tradi-tional word-based BLEU scores to evaluate howthe generated sentence is close to Refsem but dif-ferent from Refsyn. For syntactic similarity, we usethe zss package7 to calculate the Tree Edit Dis-tance (TED, Zhang and Shasha, 1989). TED is es-sentially the minimum-cost sequence of node editoperations (namely, delete, insert, and rename) be-tween two trees, which reflects the difference oftwo syntactic trees.

Since we hope the generated sentence has ahigher word-BLEU score compared with Refsembut a lower word-BLEU score compared withRefsyn, we compute their difference, denoted by∆word-BLEU, to consider both. Likewise, ∆TEDis also computed. We further take the geometricmean of ∆word-BLEU and ∆TED to take bothinto account.

Results We see from Table 4 that a traditionalVAE cannot accomplish the task of syntax transfer.This is because Refsyn and Refsem—even if we ar-tificially split the latent space into two parts—playthe same role in the decoder. With the multi-taskand adversarial losses for syntactic and semanticlatent spaces, the total difference is increased by12.09, which shows the success of syntax-transfersentence generation. This further implies that ex-plicitly modeling syntax is feasible in the latentspace of VAE. We incrementally applied the ad-versarial reconstruction loss, proposed in § 3.2.3.

As seen, an adversarial reconstruction loss dras-tically strengthens the role of the other space. Forexample, +L(adv)

rec (zsem) repels information to thesyntactic space and achieves the highest ∆TED.

When applying the adversarial reconstructionlosses to both semantic and syntactic spaces,we have a balance between ∆word-BLEU and∆TED, both ranking second in the respectivecolumns. Eventually, we achieve the highest totaldifference, showing that our full DSS-VAE model

7https://github.com/timtadh/zhang-shasha

https://github.com/timtadh/zhang-shasha

Modelword-BLEU (corpus)

∆word-BLEU↑Average TED (per sentence)

∆TED↑ Geo Mean ∆↑Refsem

↑ Refsyn↓ Refsem

↑ Refsyn↓

VAE 6.81 6.68 0.13 149.22 148.59 0.63 0.29L(mul)

sem + L(mul)syn + L(adv)

sem + L(adv)syn 12.14 6.22 5.92 159.51 134.80 24.71 12.09

+L(adv)rec (zsem) 11.83 6.60 5.23 163.40 131.27 32.13 12.96

+L(adv)rec (zsyn) 14.33 6.07 8.26 159.20 134.22 24.98 14.36

+L(adv)rec (zsyn) + L(adv)

rec (zsem) 13.74 6.15 7.59 161.94 131.09 30.85 15.30

Table 4: Performance of syntax-transfer generation. The larger↑ (or lower↓), the better. The results of VAE areobtained by averaging interpolation. ∆word-BLEU = word-BLEU(Refsem)−word-BLEU(Refsyn). We also com-pute the difference as ∆TED = TED(Refsem)− TED(Refsyn) to measure if the generated sentence is syntacticallysimilar to Refsyn but not Refsem. Due to the difference of scale between BLEU and TED, we compute the geometricmean of ∆word-BLEU and ∆TED reflect the total differences.

achieves the best performance of syntax-transfergeneration.

Discussion on syntax transfer between incom-patible sentences We provide a few case stud-ies of syntax-transfer generation in Appendix B.We empirically find that the syntactic transfer be-tween “compatible” sentences give more promis-ing results than transfer between “incompatible”sentences. Intuitively, this is reasonable because itmay be hard to transfer a sentence with a length of5, say, to a sentence with a length of 50.

5 Conclusion

In this paper, we propose a novel DSS-VAEmodel, which explicitly models syntax in the dis-tributed latent space of VAE and enjoys the ben-efits of sampling and manipulation in terms ofthe syntax of a sentence. Experiments show thatDSS-VAE outperforms the VAE baseline in recon-struction and unconditioned language generation.We further make use of the sampling and manipu-lation advantages of DSS-VAE in two novel ap-plications, namely unsupervised paraphrase andsyntax-transfer generation. In both experiments,DSS-VAE achieves promising results.

Acknowledgments

We would like to thank the anonymous review-ers for their insightful comments. This work issupported by the National Science Foundation ofChina (No. 61772261 and No. 61672277) and theJiangsu Provincial Research Foundation for BasicResearch (No. BK20170074).

ReferencesSamuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-

drew Dai, Rafal Jozefowicz, and Samy Bengio.

2016. Generating sentences from a continuousspace. In CoNLL, pages 10–21.

Eugene Charniak. 2001. Immediate-head parsing forlanguage models. In ACL, pages 124–131.

Ciprian Chelba. 1997. A structured language model.In ACL, pages 498–500.

Huadong Chen, Shujian Huang, David Chiang, and Ji-ajun Chen. 2017. Improved neural machine trans-lation with a syntax-aware encoder and decoder. InACL, pages 1936–1945.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In EMNLP, pages1724–1734.

Alexander Clark. 2001. Unsupervised induction ofstochastic context-free grammars using distribu-tional clustering. In Proceedings of the ACL2001 Workshop on Computational Natural Lan-guage Learning.

Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, andLe Song. 2018. Syntax-directed variational autoen-coder for structured data. In ICLR.

Jan Milan Deriu and Mark Cieliebak. 2018. Syntac-tic manipulation for generating more diverse and in-teresting texts. In Proceedings of the 11th Interna-tional Conference on Natural Language Generation,pages 22–34.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and Noah A Smith. 2016. Recurrent neural networkgrammars. In NAACL, pages 199–209.

Akiko Eriguchi, Kazuma Hashimoto, and YoshimasaTsuruoka. 2016. Tree-to-sequence attentional neu-ral machine translation. In ACL, pages 823–833.

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, DongyanZhao, and Rui Yan. 2018. Style transfer in text: Ex-ploration and evaluation. In AAAI, pages 663–670.

https://doi.org/10.18653/v1/K16-1002

https://doi.org/10.18653/v1/K16-1002

https://doi.org/10.3115/1073012.1073029

https://doi.org/10.3115/1073012.1073029

https://doi.org/10.3115/976909.979681

http://www.aclweb.org/anthology/P17-1177


http://www.aclweb.org/anthology/D14-1179



https://www.aclweb.org/anthology/W01-0713



https://arxiv.org/pdf/1802.08786.pdf


http://www.aclweb.org/anthology/W18-65#page=42



https://arxiv.org/abs/1602.07776

https://arxiv.org/abs/1602.07776



http://arxiv.org/abs/1711.06861


Rafael Gomez-Bombarelli, Jennifer N Wei, DavidDuvenaud, Jose Miguel Hernandez-Lobato, Ben-jamın Sanchez-Lengeling, Dennis Sheberla, JorgeAguilera-Iparraguirre, Timothy D Hirzel, Ryan PAdams, and Alan Aspuru-Guzik. 2018. Automaticchemical design using a data-driven continuous rep-resentation of molecules. ACS Central Science,4(2):268–276.

Ankush Gupta, Arvind Agarwal, Prawaan Singh, andPiyush Rai. 2018. A deep generative framework forparaphrase generation. In AAAI, pages 5149–5156.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P. Xing. 2017. Toward con-trolled generation of text. In ICML, pages 1587–1596.

Vineet John, Lili Mou, Hareesh Bahuleyan, and OlgaVechtomova. 2018. Disentangled representationlearning for text style transfer. arXiv preprintarXiv:1808.04339.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In ICLR.

Diederik P Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In ICLR.

Adhiguna Kuncoro, Miguel Ballesteros, LingpengKong, Chris Dyer, Graham Neubig, and Noah ASmith. 2017. What do recurrent neural networkgrammars learn about syntax? In EACL, pages1249–1258.

Matt J. Kusner, Brooks Paige, and Jose MiguelHernandez-Lobato. 2017. Grammar variational au-toencoder. In ICML, pages 1945–1954.

Juncen Li, Robin Jia, He He, and Percy Liang. 2018a.Delete, Retrieve, Generate: a simple approach tosentiment and style transfer. In ACL, pages 1865–1874.

Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, MinZhang, and Guodong Zhou. 2017. Modeling sourcesyntax for neural machine translation. In ACL, pages688–697.

Juntao Li, Yan Song, Haisong Zhang, Dongmin Chen,Shuming Shi, Dongyan Zhao, and Rui Yan. 2018b.Generating classical chinese poems via conditionalvariational autoencoder and adversarial training. InEMNLP, pages 3890–3900.

Lemao Liu, Muhua Zhu, and Shuming Shi. 2018. Im-proving sequence-to-sequence constituency parsing.In AAAI, pages 4873–4880.

Shuming Ma, Xu Sun, Yizhong Wang, and JunyangLin. 2018. Bag-of-words as target for neural ma-chine translation. In ACL, pages 332–338.

Mitchell P Marcus, Mary Ann Marcinkiewicz, andBeatrice Santorini. 1993. Building a large annotatedcorpus of english: The Penn Treebank. Computa-tional linguistics, 19(2):313–330.

Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li.2019. CGMH: Constrained sentence generation byMetropolis-Hastings sampling. In AAAI.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In IN-TERSPEECH.

Vinod Nair and Geoffrey E Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In ICML, pages 807–814.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automaticevaluation of machine translation. In ACL, pages311–318.

Stanislau Semeniuta, Aliaksei Severyn, and ErhardtBarth. 2017. A hybrid convolutional variational au-toencoder for text generation. In EMNLP, pages627–637.

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron C Courville,and Yoshua Bengio. 2017. A hierarchical latentvariable encoder-decoder model for generating di-alogues. In AAAI, pages 3295–3301.

Yikang Shen, Zhouhan Lin, Chin-Wei Huang, andAaron Courville. 2017. Neural language modelingby jointly learning syntax and lexicon. In ICLR.

Eric Tzeng, Judy Hoffman, Kate Saenko, and TrevorDarrell. 2017. Adversarial discriminative domainadaptation. In CVPR, pages 7167–7176.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS, pages 5998–6008.

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey Hinton. 2015. Gram-mar as a foreign language. In NIPS, pages 2773–2781.

Rongxiang Weng, Shujian Huang, Zaixiang Zheng,XIN-YU DAI, and CHEN Jiajun. 2017. Neural ma-chine translation with word predictions. In EMNLP,pages 136–145.

Kaizhong Zhang and Dennis Shasha. 1989. Simplefast algorithms for the editing distance between treesand related problems. SIAM journal on computing,18(6):1245–1262.

Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexan-der M. Rush, and Yann LeCun. 2018. Adversariallyregularized autoencoders. In ICML, pages 5897–5906.

Hao Zhou, Zhaopeng Tu, Shujian Huang, Xiaohua Liu,Hang Li, and Jiajun Chen. 2017. Chunk-based bi-scale decoder for neural machine translation. InACL, pages 580–586.

https://dash.harvard.edu/bitstream/handle/1/35164975/235.pdf?sequence=1&isAllowed=y





http://proceedings.mlr.press/v70/hu17e.html

http://proceedings.mlr.press/v70/hu17e.html







https://aclweb.org/anthology/E17-1117

https://aclweb.org/anthology/E17-1117

http://proceedings.mlr.press/v70/kusner17a/kusner17a.pdf

http://proceedings.mlr.press/v70/kusner17a/kusner17a.pdf

http://www.aclweb.org/anthology/N18-1169

http://www.aclweb.org/anthology/N18-1169





https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16347/16018

https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16347/16018



http://anthology.aclweb.org/J/J93/J93-2004.pdf

http://anthology.aclweb.org/J/J93/J93-2004.pdf



https://www.isca-speech.org/archive/archive_papers/interspeech_2010/i10_1045.pdf

https://www.isca-speech.org/archive/archive_papers/interspeech_2010/i10_1045.pdf

https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf

https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf

https://www.aclweb.org/anthology/P02-1040

https://www.aclweb.org/anthology/P02-1040



http://www.cs.toronto.edu/~lcharlin/papers/vhred_aaai17.pdf







https://arxiv.org/pdf/1706.03762v1.pdf

https://arxiv.org/pdf/1706.03762v1.pdf

http://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf

http://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf

https://www.aclweb.org/anthology/D17-1013

https://www.aclweb.org/anthology/D17-1013

http://www.grantjenks.com/wiki/_media/ideas/simple_fast_algorithms_for_the_editing_distance_between_tree_and_related_problems.pdf







A Hyperparameter Details

We list the hyperparaemters in Tables 5 and 6. Ev-ery 500 batch, we save the model if it achieves alower evidence lower bound (ELBO) on the vali-dation set.

B Case Study of Syntax Transfer

We provide a few examples in Table 7. We see inall cases that a plain VAE “interpolates” two sen-tences without the consideration of syntax and se-mantics, whereas our DSS-VAE is able to transferthe syntax without changing the meaning much.In the first example, DSS-VAE successfully trans-fer a “subject-be-predicative” sentence to a “thereis/are” sentence. For the second example, the se-mantic reference has the same syntactic structureas the syntax reference, and as a result, DSS-VAEgenerates the same sentence as Refsem. For the lastexample, we transfer a “there is/are“ sentence toa “subject-be-predicative“ sentence, and our DSS-VAE is also able to generate the desired syntax.

Hyper-parameters ValueλKLsem 1.0λKLsyn 1.0

λmulsem 0.5λmulsyn 0.5

λadvsem 0.5λadvsyn 0.5

λrecsem 0.5λrecsyn 0.5

Batch size 32GRU Dropout 0.1

Table 5: The hyper-parameters we used in PTBdataset

Hyper-parameters ValueλKLsem 1/3λKLsyn 2/3

λmulsem 5.0λmulsyn 1.0

λadvsem 0.5λadvsyn 0.5

λrecsem 1.0λrecsyn 0.05

Batch size 50GRU Dropout 0.3

Table 6: The hyper-parameters we used in Quoradataset.

Semantic and Syntactic Providers Syntax-Transfer OutputRefsyn: There is an apple on the table.Refsem: The airplane is in the sky.

VAE: The man is in the kitchen.DSS-VAE: There is a airplane in the sky.

Refsyn: The shellfish was cooked in a wok.Refsem: The stadium was packed with people.

VAE: The man was filled with people.DSS-VAE: The stadium was packed with people.

Refsyn: The child is playing in the garden.Refsem: There is a dog behind the door.

VAE: There is a person in the garden.DSS-VAE: A dog is walking behind the door.

Table 7: Case studies of syntax transfer generation.

Generating Sentences from Disentangled Syntactic and...

Documents

Transcript of Generating Sentences from Disentangled Syntactic and...