Dual Learning for Query Generation and Query Selection in ...

10
Dual Learning for ery Generation and ery Selection in ery Feeds Recommendation Kunxun Qi 1,2 , Ruoxu Wang 2 , Qikai Lu 3 , Xuejiao Wang 2 Ning Jing 2 , Di Niu 3 , Haolan Chen 2, 1 School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2 Platform and Content Group, Tencent, Shenzhen, China 3 University of Alberta, Edmonton, Canada [email protected],[email protected] ABSTRACT Query feeds recommendation is a new recommended paradigm in mobile search applications, where a stream of queries need to be recommended to improve user engagement. It requires a great quantity of attractive queries for recommendation. A conventional solution is to retrieve queries from a collection of past queries recorded in user search logs. However, these queries usually have poor readability and limited coverage of article content, and are thus not suitable for the query feeds recommendation scenario. Furthermore, to deploy the generated queries for recommendation, human validation, which is costly in practice, is required to filter unsuitable queries. In this paper, we propose TitIE, a query mining system to generate valuable queries using the titles of documents. We employ both an extractive text generator and an abstractive text generator to generate queries from titles. To improve the acceptance rate during human validation, we further propose a model-based scoring strategy to pre-select the queries that are more likely to be accepted during human validation. Finally, we propose a novel dual learning approach to jointly learn the generation model and the selection model by making full use of the unlabeled corpora under a semi-supervised scheme, thereby simultaneously improving the performance of both models. Results from both offline and online evaluations demonstrate the superiority of our approach. CCS CONCEPTS Information systems Query suggestion; Query intent . KEYWORDS Query feeds recommendation;Query generation;Query suggestion ACM Reference Format: Kunxun Qi 1,2 , Ruoxu Wang 2 , Qikai Lu 3 , Xuejiao Wang 2 , Ning Jing 2 , Di Niu 3 , Haolan Chen 2, . 2021. Dual Learning for Query Generation and Query Selection in Query Feeds Recommendation. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM ’21), November 1–5, 2021, Virtual Event, QLD, Australia. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3459637.3481910 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CIKM ’21, November 1–5, 2021, Virtual Event, QLD, Australia © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8446-9/21/11. . . $15.00 https://doi.org/10.1145/3459637.3481910 1 INTRODUCTION Feeds recommendation aims at recommending contents to users by never-ending feeds in mobile applications [35]. In this paper, we study query feeds recommendation (QFR), where a stream of search queries are recommended to users by never-ending feeds as they scroll down in the mobile application under the search box. It has been widely employed in existing search applications to increase user engagement and improve their search experience [7, 17]. Figure 1 shows an example of QFR. Different from related tasks, such as query recommendation [8, 18, 30, 42], QFR recommends queries that appeal to users based on their user profiles without an anchor query, which means that plenty of interesting queries are needed to cover users’ interests. Figure 1: Example of the QFR scenario. In general the recommended queries can be easily collected from user search logs [2, 3, 30, 42]. However, these queries are generally unsuited for the QFR scenario. The reason is two-fold. On one hand, the queries generated by users during active search are usually presented as a set of keywords split by space with unlimited length and poor readability. These queries are inconcise and unattractive, and are thus unsuited for the QFR scenario. On the other hand, collecting queries from user search logs is hysteretic in time, suffering heavily from the problem of limited coverage and generalizability [8, 36]. In this paper, we study the generation of valuable queries from document titles for QFR in Tencent QQ Browser 1 , where millions of new content documents are created each day. Query generation 1 Tencent QQ Browser has the largest market share in the Chinese mobile browser market with more than 100 million daily active users.

Transcript of Dual Learning for Query Generation and Query Selection in ...

Page 1: Dual Learning for Query Generation and Query Selection in ...

Dual Learning forQuery Generation andQuery SelectioninQuery Feeds RecommendationKunxun Qi1,2, Ruoxu Wang2, Qikai Lu3, Xuejiao Wang2

Ning Jing2, Di Niu3, Haolan Chen2,�1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China

2Platform and Content Group, Tencent, Shenzhen, China3University of Alberta, Edmonton, Canada

[email protected],[email protected]

ABSTRACTQuery feeds recommendation is a new recommended paradigmin mobile search applications, where a stream of queries need tobe recommended to improve user engagement. It requires a greatquantity of attractive queries for recommendation. A conventionalsolution is to retrieve queries from a collection of past queriesrecorded in user search logs. However, these queries usually havepoor readability and limited coverage of article content, and arethus not suitable for the query feeds recommendation scenario.Furthermore, to deploy the generated queries for recommendation,human validation, which is costly in practice, is required to filterunsuitable queries. In this paper, we propose TitIE, a query miningsystem to generate valuable queries using the titles of documents.We employ both an extractive text generator and an abstractive textgenerator to generate queries from titles. To improve the acceptancerate during human validation, we further propose a model-basedscoring strategy to pre-select the queries that are more likely to beaccepted during human validation. Finally, we propose a novel duallearning approach to jointly learn the generation model and theselection model by making full use of the unlabeled corpora undera semi-supervised scheme, thereby simultaneously improving theperformance of both models. Results from both offline and onlineevaluations demonstrate the superiority of our approach.

CCS CONCEPTS• Information systems→ Query suggestion; Query intent.

KEYWORDSQuery feeds recommendation;Query generation;Query suggestionACM Reference Format:Kunxun Qi1,2, Ruoxu Wang2, Qikai Lu3, Xuejiao Wang2, Ning Jing2, DiNiu3, Haolan Chen2,�. 2021. Dual Learning for Query Generation andQuery Selection in Query Feeds Recommendation. In Proceedings of the 30thACM International Conference on Information and Knowledge Management(CIKM ’21), November 1–5, 2021, Virtual Event, QLD, Australia. ACM, NewYork, NY, USA, 10 pages. https://doi.org/10.1145/3459637.3481910

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’21, November 1–5, 2021, Virtual Event, QLD, Australia© 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8446-9/21/11. . . $15.00https://doi.org/10.1145/3459637.3481910

1 INTRODUCTIONFeeds recommendation aims at recommending contents to usersby never-ending feeds in mobile applications [35]. In this paper,we study query feeds recommendation (QFR), where a stream ofsearch queries are recommended to users by never-ending feedsas they scroll down in the mobile application under the searchbox. It has been widely employed in existing search applications toincrease user engagement and improve their search experience [7,17]. Figure 1 shows an example of QFR. Different from related tasks,such as query recommendation [8, 18, 30, 42], QFR recommendsqueries that appeal to users based on their user profiles without ananchor query, which means that plenty of interesting queries areneeded to cover users’ interests.

Figure 1: Example of the QFR scenario.

In general the recommended queries can be easily collectedfrom user search logs [2, 3, 30, 42]. However, these queries aregenerally unsuited for the QFR scenario. The reason is two-fold.On one hand, the queries generated by users during active searchare usually presented as a set of keywords split by space withunlimited length and poor readability. These queries are inconciseand unattractive, and are thus unsuited for the QFR scenario. Onthe other hand, collecting queries from user search logs is hystereticin time, suffering heavily from the problem of limited coverage andgeneralizability [8, 36].

In this paper, we study the generation of valuable queries fromdocument titles for QFR in Tencent QQ Browser1, where millionsof new content documents are created each day. Query generation1Tencent QQ Browser has the largest market share in the Chinese mobile browsermarket with more than 100 million daily active users.

Page 2: Dual Learning for Query Generation and Query Selection in ...

from article titles is able to provide massive amount of interestingand attractive queries for QFR, helping to improve the interest cov-erage and search experience of users. Nowadays, neural sequenceto sequence (seq2seq) models [28] have been dominant for textgeneration tasks, including text summarization [24] and machinetranslation [9]. However, common seq2seq models suffer from theproblem of the low generalizability in open domain [8, 18], heav-ily relying on the manually annotated training corpus. In practice,manual annotation of training corpus is laborious and error prone.Thus, an effective strategy for training a query generation modelwith limited annotated corpus is necessary. Moreover, the generatedqueries require human validation before incorporation into onlineQFR. Yet, the crowdsource resources have a daily usage limit. To usethe crowdsource resources more efficiently, an automatic strategyfor pre-selecting high quality generated queries is necessary.

To tackle the above challenges, we propose TitIE (short for Ti-tle Information Extractor), a query mining system for providingvaluable queries for QFR at Tencent QQ Browser, as illustrated inFigure 2. For extracting queries from article titles, we propose aquery extraction model named QEBERT (short for Query Extractionbased on pre-trained language model BERT [5]) to extract valuablequeries from article titles. It creates a query by extracting a textspan (excerpt) from a given title instead of generating a new queryfrom the title. For some titles, however, there are no suitable spansfor which queries can be extracted. To fill this gap, we introduce thepre-trained text generation model mT5 [41] for abstractive querygeneration, which is complementary to the QEBERT model. Inpractice, the generated queries require human validation beforeincorporation into online QFR. Yet, the crowdsource resources havea daily usage limit. To improve the acceptance rate during humanvalidation, as to make full use of the crowdsource resources, wepropose a query selection model named QSBERT (short for QuerySelection based on pre-trained language model BERT) to pre-selectthe extracted and generated queries, such that only queries thatare predicted as acceptable are presented for human validation.The selected queries are then manually validated by crowdsourceto select the acceptable (measured by informativeness, coherence,etc.) queries for recommending to users. To further improve theperformance, we carefully design a novel dual learning frameworkto jointly learn the query generation models and the query selectionmodel, making full use of the unlabeled corpus to mutually improvethe performance of both models, as illustrated in Figure 3.

We empirically evaluate our framework through both offlineand online evaluations. The offline evaluation is two-fold. First, wecreate a 50K manually annotated dataset which is collected fromdocument titles in Tencent QQ Browser. It consists of 40K (query,title) tuples for training, 5K for validation and 5K for test. Experi-mental results on the test set shows that our approach outperformsall baseline models on a number of metrics, including: ROUGE-n[16], BLEU-n [19] and Exact match score. Second, we carry outhuman evaluation to verify whether our approach can improve thefluency and relevance of generated queries. Results show that ourapproach outperforms all baseline models. For online evaluation,we verify the feasibility of our system through both A/B Testingand human validation in Tencent QQ Browser. Results from A/BTesting and human validation reveal that the Click-Through Rate(CTR) increases by 26.89% and the acceptance rate increases by

33.0% when TitIE is employed for query generation in our QFRsystem.

TitIE has been deployed in Tencent QQ browser since January,2021. Up to today, TitIE has generated more than 5 million high-quality queries from daily article titles, while still growing at a rateof 60K found per day. We make the following contributions in thedesign of TitIE:(1) We release a manually annotated dataset, collected from thetitles of real-world documents in Tencent QQ Browser, to facilitatethe training and evaluation of query generation from documenttitles for the query feeds recommendation scenario.(2) We propose a simple but effective query extraction model namedQEBERT, based on pre-trained language model BERT, to extractqueries from document titles. This overcomes the low generaliz-ability limitation of existing seq2seq models. We also introducea pre-trained text generation model, mT5, for abstractive querygeneration, which is complementary with the QEBERT model.(3) We propose a query selection model named QSBERT for pre-selecting queries that are predicted as acceptable for human valida-tion before incorporation into QFR, thus maximizing the utilizationof crowdsource resources.(4) We propose a novel training framework based on dual learningto jointly learn the query generation model and the query selectionmodel. This allows us to make full use of the unlabeled corpus andmutually improve the performance of both models. To the best ofour knowledge, this is the first adaptation of dual training to querygeneration and selection.(5) We conduct extensive experiments to verify the effectivenessand the superiority of our approach, including offline evaluations,online evaluations and case study.

2 RELATEDWORKSOur work is mainly related to the research fields of query rec-ommendation, text summarization and dual learning in naturallanguage processing (NLP).

2.1 Query recommendationQuery recommendation (QR), also called query suggestion or queryexpansion, aims at recommending a set of queries that are moreexpressive than the anchor query created by users. Early work forQR can be classified into document-based methods [6, 33, 39] andlog-based methods [2, 3, 30, 42]. More recently, direct generationof queries from a given query using Seq2Seq [28] models attractsincreasing attention. The work [18] proposes to incorporate thepointer network [24] to directly copy words from the input context.The work [15] investigates the problem of feedback-aware querysuggestion and proposes a adversarial framework for improvingperformance. The work [8] proposes a two-stage generation frame-work by partitioning the QR task into the problems of relevantwords discovery and context dependent query generation.

Different from the QR task, the query feeds recommendationscenario recommends a great number of interesting queries to usersbased on their user profiles without an anchor query, which meansthat massive attractive queries are needed to cover users’ interests.Thus we study generating queries from document titles in thispaper.

Page 3: Dual Learning for Query Generation and Query Selection in ...

User click logs Mining titles Query selection

Human validationDatabaseRecommend system

(全新丰田汉兰达运动版车型海外实拍,搭载3.5L发动机)

The brand-new sport-utilityToyota Highlander is photo-graphed overseas, equippedwith 3.5L engine.

Query extraction

(全新丰田汉兰达运动版)

The brand-new sport-utilityToyota Highlander.

Query-1 Score: 0.98Query-2 Score: 0.86

……

Query-k Score: 0.34

Query-k+1 Score: 0.21

Selected

Threshold

Filtered

User click

logs on titlesHigh pv

titles

Generated

queries

Selected

queries

Validated

queries

Entire set

of queries

Recommending

queries to users

Search

behavior

Mobile users

Query generation

(全新丰田汉兰达运动版实拍图)

The photo of the brand-newsport-utility Toyota Highlander.

Figure 2: The overview of the TitIE system.

2.2 Text SummarizationThe approaches for text summarization can generally be categorizedto extractive and abstractive.

Extractive Summarization methods usually summarize thegiven documents by extracting and merging sentences within thedocuments. Recently, pre-trained language models like BERT [5]are widely used in different fields in NLP, including extractivesummarization. The work [40] proposes a discourse-aware neuralsummarization model based on BERT to reduce summarizationredundancy. The work [45] proposes an extractive summarizationmethod based on BERT for text matching. Inspired by the promisingresult of these works, we propose QEBERT for extracting the textspan from a given title as the extracted query via span extraction byanswer pointer network [32]. It does not require a large vocabularyfor adapting the open domains.

Abstractive Summarization methods usually compose a newsummary entirely based on the understanding of the documents.Seq2seq (short for sequence-to-sequence) [28] models have beena performant paradigm in abstractive summarization. The work[24] proposes an attention-based seq2seq model for abstractivesummarization which not only generates words from the vocabu-lary, but also copy words from the source document. The work [31]proposes the Transformer, which relies solely on the Multi-Head At-tention [31] mechanism for text understanding and text generation.More recently, pre-trained text generation models like T5 [22] havebeen dominant for several NLP tasks, including text summariza-tion. Inspired by the promising result of pre-trained text generationmodels, we also integrate mT5 [41], a multi-lingual version of T5,for Chinese query generation.

2.3 Dual Learning in NLPDual learning is a new learning paradigm to leverage the symmet-ric structure of some learning problems that are emerged in dualforms, such as English↔French translation and speech↔text trans-formation [37, 38]. It has been widely employed in many NLP tasks.

The work [9] firstly proposes a novel dual learning framework formachine translation. The work [20] proposes a dual learning frame-work for unsupervised text style transfer. The work [26] extendsthe emotion-controllable response generation to a dual task to gen-erate emotional responses and emotional queries alternatively by adual learning framework. Inspired by the promising result of theseworks, we carefully design a dual learning framework to jointlylearn query generation and query selection.

3 THE TITIE SYSTEMIn this section, we describe the details of our proposed TitIE systemfor query feeds recommendation. The overview of TitIE is illus-trated in Figure 2. Given a number of user click logs, TitIE firstmines the document titles that are widely clicked by users, namelythe high PV titles. These titles are fed into the query generationmodule that consists of a query extraction model QEBERT, and aquery generation model based on the pre-trained mT5, which aredescribed in detail in section 2.1 and 2.2 respectively. The generatedqueries are selected by the query selection model named QSBERT,which is described in detail in section 2.3. The selected queries aremanually validated by crowdsource to filter the unsuitable queriesfor further recommendation. These include queries that are con-fusing or inappropriate to user. The validated queries are storedin a large-scale database along with query profile features that areused in the recommendation system. The recommendation systempresents attractive and interesting queries to users based on queryprofiles and user profiles. The search behaviors of the users arerecorded through user click logs, thereby forming a closed loopwith the recommendation system.

3.1 Formalization of QEBERTPrevious works have shown concerns about the low generalizabilityof the seq2seq models as the training corpus and the vocabulary sizeare usually limited in open domain [8, 36]. In practice, the systemsuffers heavily from cold start problem, wherein the annotation

Page 4: Dual Learning for Query Generation and Query Selection in ...

Query generation models

Labeled corpus Unlabeled corpus

Training data Predicting on

unlabeled data

Extracted queries

from unlabeled data

Adding selected queries

with pseudo label

Valid data

Evaluation

Reward

Predicted score

Query selection model

Reward

Prediction

Figure 3: The architecture of our proposed dual learning framework.

of training corpus is laborious and error prone. To tackle theselimitations, we propose QEBERT, an efficient and effective neuralmodel for query extraction from document titles based on pre-trained language model BERT [5]. Inspired by the promising resultof span extraction [32] for machine reading comprehension [25, 32]and extractive text summarization [40, 45], we also adapt a well-known answer pointer network [32] to extract text spans from titlesas extracted queries. Given a document title, QEBERT predicts thestart and end position of the potential query within the given title,as formally defined below.

Lexical Encoder.Given a document title𝑇 , where𝑇 = (𝑤𝑡1, ...,𝑤

𝑡𝑛)

denotes the token sequence within the title. Following [5], we createa new sequence 𝐻 = (𝑤ℎ

1 , ...,𝑤ℎ𝑛+2) by adding a special token [CLS]

in front of the first token 𝑤𝑡1 and a [SEP] after the last token 𝑤𝑡

𝑛 .This new sequence of tokens is fed into a lexical encoder, which is apre-trained language model such as BERT [5]. By 𝑋 ∈ R(𝑛+2)×𝑑 wedenote the output of the lexical encoder, where 𝑛 + 2 is the numberof tokens in the new sequence and 𝑑 is the dimension of tokenembedding.

Answer Pointer Layer. This layer aims to compute a probabil-ity distribution on the start position and a probability distributionon the end position for a query span. Formally, 𝑝1

𝑖, the probability

that the 𝑖𝑡ℎ token is the start position, and 𝑝2𝑖, the probability that

the 𝑖𝑡ℎ token is the end position, are defined by

𝑣1𝑖 = 𝑤⊤1 𝑥𝑖 + 𝑏1 𝑣2𝑖 = 𝑤⊤2 𝑥𝑖 + 𝑏2 (1)

𝑝1𝑖 =exp(𝑣1

𝑖)∑𝑛+2

𝑗=1 exp(𝑣1𝑗 )𝑝2𝑖 =

exp(𝑣2𝑖)∑𝑛+2

𝑗=1 exp(𝑣2𝑗 )(2)

where 𝑥𝑖 denotes the 𝑖𝑡ℎ vector in 𝑋 , 𝑤1 ∈ R𝑑 and 𝑤2 ∈ R𝑑 aretrainable vectors, 𝑏1 and 𝑏2 are trainable biases.

Given a set of training data D = {𝐻𝑖 , 𝑌1𝑖, 𝑌 2

𝑖}𝑁𝑖=1, where 𝐻𝑖 de-

notes the input token sequence, 𝑌 1𝑖the labeled start position of the

query span within the title, 𝑌 2𝑖the labeled end position of the query

span within the title, and 𝑁 the total number of training examples,

the entire model is fine-tuned by minimizing the cross-entropy loss:

L1 (\ ) = −1𝑁

𝑁∑𝑖=1

log 𝑝1𝑌 1𝑖

+ log 𝑝2𝑌 2𝑖

(3)

where \ denotes all the trainable parameters in the model.

3.2 Formalization of mT5 for query generationFor some title, there exists no suitable span for which queries canbe extracted. In this case, we need to generate new queries for thesetitles. As mentioned before, common seq2seq generation modelssuffer from the problem of the low generalizability in open domain.Thus we propose to use the mT5 model [41] (the multi-lingual ver-sion of the T5 [22] model) for query generation to complement theQEBERT model. T5 (short for Text-to-Text Transfer Transformer)is a large-scale Transformer [31] model pre-trained on a large textcorpus [22]. The model mT5 is the multi-lingual version of the T5model. It is pre-trained on a large-scale 750GB multi-lingual corpus[41] covering over 101 languages, including Chinese. Different fromthe common seq2seq models, mT5 employs the SentencePiece [14]technology to encode text as WordPiece tokens, which means thatit does not require a large vocabulary to improve generalizabil-ity. Moreover, as mT5 is pre-trained on a large corpus, it can befine-tuned for the downstream task with fewer training data.

The model details of mT5 are presented in [22, 31, 41]. Given aset of training data D = {𝐻𝑖 , 𝑄𝑖 }𝑁𝑖=1, where 𝐻𝑖 denotes the inputtoken sequence, 𝑄𝑖 = (𝑤𝑞

1 , ...,𝑤𝑞𝑚) the corresponding query for the

title, and 𝑁 the total number of training examples, the fine-tuningobjective of mT5 can be formulated as:

\𝑚𝑇 5 = argmax∑

(𝐻,𝑄) ∈D

𝑁𝑚∑𝑚=1

log 𝑃 (𝑤𝑞𝑚 |𝑤

𝑞<𝑚, 𝐻 ;\𝑚𝑇 5) (4)

where \𝑚𝑇 5 denotes the parameters of the mT5 model and 𝑁𝑚

the length of the corresponding query token sequence.

Page 5: Dual Learning for Query Generation and Query Selection in ...

3.3 Formalization of QSBERTThe generated queries (both from QEBERT and mT5) must thenundergo human validation by crowdsource to filter queries that areconfusing or inappropriate to users. However, there is a daily limitto the crowdsource usage. To improve the utilization of crowdsourceresources, a strategy of pre-selecting only good-quality queriesfor human validation is necessary. To address this challenge, wepropose QSBERT, a query selection model based on the pre-trainedlanguage model BERT [5] for pre-selecting generated queries thatare more likely to be accepted during human validation. It is a3-way classifier that classifies the input query into one of threecategories, including the normal query, the confusing query andthe inappropriate query. Note that the context of the title is notutilized in QSBERT with the form like natural language inference(NLI) [1, 34] as the input title is not considered in human validation.

Lexical Encoder. Given a query 𝑄 , where 𝑄 = (𝑤𝑞

1 , ...,𝑤𝑞𝑚)

denotes the token sequence within the query. We create a newsequence 𝑃 = (𝑤𝑝

1 , ...,𝑤𝑝

𝑚+2) by adding a special token [CLS] infront of the first token𝑤𝑞

1 and a [SEP] behind the last token𝑤𝑞𝑚 . This

new sequence of tokens is fed to BERT to calculate its contextualrepresentation. By𝐶 ∈ R(𝑚+2)×𝑑 we denote the output of the lexicalencoder, where𝑚 + 2 is the number of tokens in the new sequenceand 𝑑 is the dimension of token embedding. Following [5], we usethe contextual embedding of the [CLS] token, namely 𝐶1, as thecontextualized representation for further classification.

Selection Module. This module aims to select the generatedqueries that are likely to be accepted during human validation.Given a 𝑑−dimensional contextualized representation, a 3-way dis-criminator is employed to calculate the predicted probabilities ofthe three categories, formally defined by

𝑦𝑖 = softmax(𝑊𝑐𝑖 + 𝑏) (5)

where 𝑐𝑖 denotes the contextualized representation 𝐶1 of the 𝑖𝑡ℎrecord,𝑊 ∈ R𝐿×𝑑 is a trainable matrix, 𝑏 ∈ R𝐿 is a trainable vectorand 𝐿 denotes the number of labels. Let 𝑙 denotes the label indexof the category of the normal query, we use the probability of thiscategory, namely 𝑦𝑖,𝑙 , as the predicted score for query selection.Given a threshold 𝛿 , the queries that meet the conditions𝑦𝑖,𝑙 > 𝛿 arefed to crowdsource for human validation, ordering by 𝑦𝑖,𝑙 reversely.If the number of selected queries exceed the daily crowdsourcecapacity, then the top queries with highest scores are selected forhuman validation.

Given a set of training data D = {𝑃𝑖 , 𝑌𝑖 }𝑁𝑖=1, where 𝑃𝑖 denotesthe input token sequence, 𝑌𝑖 the label of the query and 𝑁 the totalnumber of training examples, the entire model is fine-tuned byminimizing the cross-entropy loss:

L2 (\ ) = −1𝑁

𝑁∑𝑖=1

𝐿∑𝑗=1

𝐼 ( 𝑗 = 𝑌𝑖 ) log𝑦𝑖, 𝑗 (6)

where 𝑦𝑖, 𝑗 denotes the 𝑗𝑡ℎ element of 𝑦𝑖 , \ the trainable parame-ters in the model and 𝐼 (.) is an indicator function that returns 1 ifthe condition is true or 0 otherwise.

Algorithm 1 Dual learning for joint training query generation andquery selection.

Require: a set of labeled data D𝑎 = {𝑇𝑖 , 𝑄𝑖 }𝑁1𝑖=1, a set of unlabeled

dataD𝑏 = {𝑇𝑖 }𝑁2𝑖=1, a set of validation dataD𝑐 = {𝑇𝑖 , 𝑄𝑖 }𝑁3

𝑖=1 forvalidation, the number of epochs 𝑁𝑇 , the number of episode𝑁𝐸 , a proportion 𝛽 for sampling subset.

1: Initialize the model parameters \𝑔𝑒𝑛 and \𝑠𝑒𝑙 by the pre-trainedlanguage models.

2: for epoch from 1 to 𝑁𝑇 do3: Train the query generation model on D𝑎 by:4: \𝑔𝑒𝑛 ← \𝑔𝑒𝑛 + [𝑟𝑔𝑒𝑛∇\𝑔𝑒𝑛 log 𝑃 (𝑄 |𝑇 ;\𝑔𝑒𝑛)5: Predict on D𝑏 with predicted scores 𝑧.6: Get a set of pseudo examples D̂𝑏 = {𝑇𝑖 , �̂�𝑖 }𝑁2

𝑖=1.7: Get 𝛽% subset S from D̂𝑏 by ranking 𝑧.8: for episode from 1 to 𝑁𝐸 do9: Divide S into a set of mini-batches 𝐵 = {𝐵1, ..., 𝐵𝑁𝑘

}.10: Shuffle 𝐵.11: for episode from 1 to 𝑁𝐸 do12: Select queries from 𝐵𝑘 , get selected instances 𝐵𝑠

𝑘.

13: Train the query generation model on 𝐵𝑠𝑘.

14: Evaluate on validation set D𝑐 , get a sequence of themean ROUGE scores 𝑅 = {𝑅1, ..., 𝑅𝑁𝑘

}.15: end for16: Calculate a sequence of rewards {𝑟𝑘 }

𝑁𝑘

𝑘=1 by Equation (8).17: for k from 1 to 𝑁𝐸 do18: Train the query selection model by:19: \𝑠𝑒𝑙 ← \𝑠𝑒𝑙 + [

𝑟𝑠𝑒𝑙𝑘

∥𝐵𝑘 ∥∑∥𝐵𝑘 ∥

𝑗=1 ∇\𝑠𝑒𝑙 log 𝑃 (𝑦 𝑗 |𝑄 𝑗 ;\𝑠𝑒𝑙 )20: end for21: end for22: D𝑠 ← ∪𝑁𝑘

𝑘=1𝐵𝑠𝑘

23: D𝑎 ← D𝑎 ∪ D𝑠

24: D𝑏 ← D𝑏\D𝑠

25: end for

4 THE DUAL LEARNING FRAMEWORKIn this section, we describe the detail of our proposed dual learningframework to jointly learn the query generation model and thequery selection model, as illustrated in Figure 3. Given a labeledcorpus D𝑎 = {𝑇𝑖 , 𝑄𝑖 }𝑁1

𝑖=1, an unlabeled corpus D𝑏 = {𝑇𝑖 }𝑁2𝑖=1 for

semi-supervised learning, and a validation corpusD𝑐 = {𝑇𝑖 , 𝑄𝑖 }𝑁3𝑖=1,

where 𝑇 = (𝑤𝑡1, ...,𝑤

𝑡𝑛) denotes the token sequence within the title,

𝑄 = (𝑤𝑞

1 , ...,𝑤𝑞𝑚) the token sequence within the annotated query,

and 𝑁1, 𝑁2, and 𝑁3 the number of records in the respective corpora,our framework first trains a query generation model using the la-beled data D𝑎 . It then generates queries for the unlabelled titles inD𝑏 . The selection model then computes a score for each generatedquery. A subset of the predicted data with high predicted scoresis selected to augment the existing labelled corpus. Thereafter, weconcurrently train the generator and selector model in the subse-quent training loop. Inspired by promising results of reinforcementlearning [9, 43] for sequential decision-making process, we proposea reward based strategy to tune the query generation model andthe evaluation scores from the generation model to tune the query

Page 6: Dual Learning for Query Generation and Query Selection in ...

selection model. The entire process of our framework is formalizedin Algorithm 1. Different from the work [43], our framework em-ploys two agents (i.e. the generation and selection models) withdifferent rewards, which are formally described bellow.

4.1 Reward for Query Generation ModelIn general, the quality of the generated queries varies considerably.This could potentially lead to error accumulation during the trainingprocess. To address this limitation, we propose a reward basedstrategy to guide the learning process of the query generationmodel, such that examples with higher rewards contribute moreduring training. We define the rewards of original examples fromthe labeled corpus as 1.0. For examples generated from the unlabeledcorpus, the rewards are the corresponding scores predicted by thequery selection model. Formally, the reward is defined as:

𝑟𝑔𝑒𝑛

𝑖=

{1.0 𝑖 ∈ 𝐴𝑦𝑖,𝑙 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

(7)

where 𝑦𝑖,𝑙 denotes the predicted score of the 𝑖𝑡ℎ example and 𝐴the set of the indices of the original labeled examples in D𝑎 .

4.2 Reward for Query Selection ModelIn our framework, the query selection model aims to augment thetraining set with generated queries. We observe that the querygeneration model performs better when high quality queries areselected for training. Thus, we propose a reward strategy for tun-ing the training process of the query selection model based on theperformance of the query generation model on the validation setD𝑐 . However, the performance of the query generation model isevaluated after training on the entire dataset, which means that thereward feedback is hysteretic for the selection process. Followingthe work [43], we address this limitation by regarding the selec-tion process as a sequential decision-making process, wherein thereward score is calculated for each batch of selected queries. Foreach batch, we train the query generation model using the selectedexamples and then evaluate on the validation set. We use the meanof ROUGE-n score (ROUGE-1, ROUGE-2 and ROUGE-l) as the eval-uation metric. For each episode, we get a sequence of mean ROUGEscores 𝑅 = {𝑅1, ..., 𝑅𝑁𝑘

}, where 𝑁𝑘 denotes the number of batches.For batch 𝑘 , the reward is formally defined as:

𝑟𝑠𝑒𝑙𝑘

=𝑅𝑘 − `

𝜎(8)

where ` and 𝜎 respectively denote the mean and the standarddeviation of 𝑅.

4.3 OptimizationLet 𝑃 (𝑄 |𝑇 ;\𝑔𝑒𝑛) and 𝑃 (𝑦 |𝑄 ;\𝑠𝑒𝑙 ) denote the query generationmodeland the query selection model, respectively. Based on the policygradient theorem [29], the models are trained by maximizing theexpected reward E[𝑟 ]. The gradients of the expected reward E[𝑟 ]with respect to parameters \𝑔𝑒𝑛 and \𝑠𝑒𝑙 are formally defined as:

∇\𝑔𝑒𝑛E[𝑟𝑔𝑒𝑛] = 𝑟𝑔𝑒𝑛∇\𝑔𝑒𝑛 log 𝑃 (𝑄 |𝑇 ;\𝑔𝑒𝑛) (9)

∇\𝑠𝑒𝑙E[𝑟𝑠𝑒𝑙 ] = 𝑟𝑠𝑒𝑙∇\𝑠𝑒𝑙 log 𝑃 (𝑦 |𝑄 ;\𝑠𝑒𝑙 ) (10)

Table 1: Statistical information on datasets for training thequery generation models.

Train Valid TestSize 40,000 5,000 5,000Avg # of tokens in titles 23.25 23.84 22.74Avg # of tokens in queries 9.56 10.01 10.03Avg # of queries for each title 1.00 1.16 1.16# of queries are spans of titles 30,866 3937 3944

where 𝑟𝑔𝑒𝑛 and 𝑟𝑠𝑒𝑙 denote the reward of the query generationmodel and the query selection model, respectively. The parametersof the models are updated by

\ ← \ + [∇\E[𝑟 ] (11)where [ denotes the learning rate and \ the parameters of the

model. As the reward score of the query selection model are cal-culated for each batch of selected queries, following the work [43],the gradient is calculated by

∇\𝑠𝑒𝑙E[𝑟𝑠𝑒𝑙 ] =

𝑟𝑠𝑒𝑙𝑘

∥𝐵𝑘 ∥

∥𝐵𝑘 ∥∑𝑗=1∇\𝑠𝑒𝑙 log 𝑃 (𝑦 𝑗 |𝑄 𝑗 ;\𝑠𝑒𝑙 ) (12)

where ∥𝐵𝑘 ∥ denotes the number of examples in a batch, 𝑟𝑠𝑒𝑙𝑘

thereward of the batch 𝐵𝑘 computed by Equation (8).

5 EXPERIMENTSIn this section, we describe the details of our experiments2. Weconducted extensive experiments to verify the effectiveness of ourapproach, including offline/online evaluations and case study.

5.1 Dataset and Evaluation MetricsWe created a 50K manually annotated dataset from a collection ofdocument titles in Tencent QQ Browser from September to Decem-ber, 2020. The original dataset is randomly split into a 40K trainingset, a 5K validation set and a 5K test set. Table 1 reports detailedstatistical information on the datasets. For all datasets, a recordconsists of a title and some associated query, which is either a textspan within the title or new query written by a human annota-tor. Note that for the training set, each title corresponds to onlyone query, while for the validation and test set, each title may beassociated with up to three queries. Note also that the length ofeach annotated query is limited to 5-13 characters for better per-formance in the mobile app. We also collected an unlabeled corpuswith 500K document titles from Tencent QQ Browser, which is usedin our proposed dual learning framework for data augmentation.For evaluating the performance of query selection, we collected a10K test set from human evaluation by crowdsource.

For evaluation metrics, we used the well-know ROUGE-n [16]scores, BLEU-n [19] scores and Exact Match (EM) score as theperformance metrics. If, in the validation or test set, a title has morethan one annotated query, we average the scores over the queriesas the final score of the example.2The data and the code of our implementations are available at: https://github.com/qikunxun/TitIE

Page 7: Dual Learning for Query Generation and Query Selection in ...

Table 2: Comparison results for query generation on the test set.

Models ROUGE-1 ROUGE-2 ROUGE-l BLEU-1 BLEU-2 BLEU-3 BLEU-4 EMSeq2seq 55.0 35.2 53.2 49.3 38.7 30.2 24.2 3.9Transformer 57.1 47.7 56.8 51.7 46.9 42.3 38.1 4.3PGN 65.0 59.2 64.7 59.1 56.0 53.1 50.3 12.1QEBERT-1K 72.6 70.7 72.6 69.5 68.6 67.4 66.3 51.0QEBERT-10K 80.9 78.8 80.9 77.9 76.9 75.6 74.3 58.5QEBERT-40K 83.0 80.8 82.9 79.8 78.7 77.4 76.2 59.9QEBERT-40K† 82.8 80.5 82.6 79.6 78.5 77.2 75.9 59.5mT5-1K 69.6 63.5 69.6 64.7 61.9 58.9 55.8 28.2mT5-10K 80.0 76.8 80.0 76.0 74.6 72.9 71.1 49.8mT5-40K 84.2 81.2 84.1 81.0 79.6 77.9 76.2 56.8mT5-40K† 84.5 81.4 84.4 81.3 79.7 78.1 76.3 56.5DL-QEBERT-1K 75.3 73.1 75.2 72.1 71.0 69.8 68.5 52.2DL-QEBERT-10K 82.6 80.3 82.5 79.5 78.4 77.1 75.7 59.3DL-QEBERT-40K 84.6 82.3 84.5 81.3 80.2 78.8 77.5 60.5DL-mT5-1K 71.8 66.5 71.8 67.1 64.6 61.8 58.9 31.5DL-mT5-10K 82.1 78.9 82.0 78.8 77.2 75.4 73.5 53.0DL-mT5-40K 85.3 82.5 85.2 82.0 80.7 79.2 77.5 58.6

5.2 Experimental SettingsWe implemented our models in Tensorflow 1.14.0 and trained allthe models with a single Tesla P40 GPU. We applied our proposeddual learning framework on both the extractive generation modelQEBERT and the generative generation model mT5. We call theenhanced models DL-QEBERT and DL-mT5, respectively.

For DL-QEBERT, the lexical encoder was initialized by the Chi-nese pre-trained BERT-base model with 12 transformer layers,which yields output token embeddings with a dimension of 768.The transformer encoder was built with 12 heads. DL-BERT wastrained by Adam [13] with 5 training epochs. The initial learningrate was set as 1e-4, the mini-batch size as 128 and the dropout[27] rate as 0.1. Note that only the queries that are text spans of thetitles are used for training the QEBERT model.

For DL-mT5, the entire model was initialized by the pre-trainedmT5-base model with 12 transformer layers and a hidden dimensionof 768. The transformer encoder and the transformer decoder werebuilt with 12 heads. DL-mT5 was also trained by Adam [13] with8 training epochs. The initial learning rate was set as 2e-4, themini-batch size as 128 and the dropout [27] rate as 0.1.

For above models, we set the hyper-parameter 𝑁𝑇 as 3, 𝑁𝐸 as 3,𝛽 as 10 according to the performance on the development set. Everyinput title is truncated to 64 tokens and every query is truncated to13 tokens. Inspired by the well-known policy network pre-trainingstrategy [4, 21, 43] widely used in reinforcement learning, we alsopre-trained the QSBERT model on a 100k online data annotated bycrowdsource.

5.3 Baseline ModelsWe compared our proposed framework against the following base-line models. The hidden dimension was set as 200 and dropout ratewas set as 0.1 for all the baseline models. All the baseline modelsare trained by the entire 40K training set.

Table 3: Results for offline evaluation on query selection.

Models Precision Recall F1-measureQSBERT 75.1 82.0 78.4DL-QSBERT 78.3 79.4 78.8

Seq2seq [28] is the original sequence-to-sequence generativemodel. We implement this baseline with a Bi-LSTM [10] encoderand a Uni-LSTM decoder.

Transformer [31] is the first sequence transductionmodel basedentirely on attention mechanism. It is widely used for pre-trainedlanguage models [5, 22, 41, 44].

PGN [24] (short for Pointer Generator Networks) is an attention-based seq2seq model which not only generates text using the targetvocabulary but also can copy words from the source text.

QEBERT-40K† andmT5-40K† are the QEBERTmodel and themT5 model that are trained by the well-known self-training [23]framework with the same labeled and unlabeled data used in DL-QEBERT-40K and DL-mT5-40K, respectively. They serve as ablationcomparisons to help verify the effectiveness of the proposed duallearning framework.

5.4 Offline EvaluationFollowing the previous work [11] for evaluating text summariza-tion, we carried out two forms of offline evaluation, an automaticevaluation and a human evaluation.

Automatic Evaluation. Table 2 reports that the comparison re-sults between the baseline models and the variants of our proposedmodels on the test set. Every value is the test score in percentage.The models QEBERT-40K and mT5-40K are the QEBERT modeland mT5 model trained on the entire 40K training set, respectively.QEBERT-10K and mT5-10K are the respective models trained on a10K subset sampled from the full training set, and QEBERT-1K and

Page 8: Dual Learning for Query Generation and Query Selection in ...

Seq2seq Transformer PGN QEBERT DL-QEBERT mT5 DL-mT51.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.19

1.43

1.50

1.60

1.72

1.62

1.73

1.02

1.21

1.32

1.52 1.53

1.47

1.56

fluencyrelevance

Figure 4: Comparison results for human evaluation.

mT5-1K refer to the models trained on a sampled 1K subset. Resultsshow that our proposed models outperform all the baseline modelsin all the metrics. The performance difference between DL-mT5-40K and mT5-40K is statistically significant with p-value 2.2e-6 <0.001 by a two-tailed t-test. The performance difference betweenDL-QEBERT-40K andQEBERT-40K is statistically significant with p-value 7.0e-6 < 0.001 by a two-tailed t-test. These results demonstratethe proposed dual learning framework improves the performancesof the original QEBERT and mT5 models by a significant margin. Toverify the effectiveness of our proposed dual learning framework,we also conducted ablation comparison by training the QEBERTand mT5 models under the self-training framework [23] with thesame unlabeled corpus used in the dual learning framework, asshown by QEBERT-40K† and mT5-40K†. Results show that modelstrained under our proposed dual learning framework achieves bet-ter performance than those trained by the self-training framework.The performance difference between DL-mT5-40K and mT5-40K†is statistically significant with p-value 2.0e-4 < 0.001 by a two-tailedt-test. The performance difference between DL-QEBERT-40K andQEBERT-40K† is statistically significant with p-value 6.8e-7 < 0.001by a two-tailed t-test. These results indicate that our proposed duallearning framework is more effective for enhancing the perfor-mance of query generation. In addition, we also found that theproposed dual learning framework pushes the original models bya higher absolute gain in performance when the labeled corpus islimited, as shown in the comparisons between DL-QEBERT-1K andQEBERT-1K as well as DL-mT5-1K and mT5-1K. It implies that theproposed framework is more suitable for the scenarios with limitedlabeled data for training.

Table 3 reports that the comparison results between the QSBERTmodel and the DL-QSBERT on a 10K test set for evaluating query se-lection. Every value is the test score in percentage. The DL-QSBERTmodel achieves a precision score of 78.3% on the test set, yieldingan absolute gain of 3.2% over QSBERT. As mentioned above, thecrowdsource resources have a daily usage limit (about 100K queriesare evaluated per day). We hope that the selected queries are morelikely to be accepted by human evaluation. Thus we focus more onthe precision score of the query selection model. The absolute gainin precision score indicates that the DL-QSBERT model is betterable to identify potential acceptable queries.

Seq2seq

Transformer PGN

Search logsQEBERT mT5

DL-QEBERTDL-mT5

DL-QEBERT+

DL-QSBERTDL-mT5+

DL-QSBERTDL-mT5+

DL-QEBERT

+DL-QSBERT

(TitlE)

0

10

20

30

40

50

60

11.1

23.0 24.328.7

31.5 31.433.7 35.6

56.8 58.661.7

Figure 5: Evaluation on three-day average acceptance rates.

Human Evaluation. To evaluate the linguistic quality of thegenerated queries, we carried out a human evaluation, as shown inFigure 4. For human evaluation, we focused on twomain aspects, thefluency and the relevance. The fluency indicator focuses on whetherthe generated queries are well-formed and grammatically correctwhile the relevance indicator reflects whether the generated queriescover the salient information within the document titles. Followingthe previous work [11, 12], we sampled 300 records from the test setand employ two crowdsource annotators to inspect the generatedquery. Each generated query is rated on a scale of 0 (worst) to 2(best). Each annotator evaluates the generated queries from all ofthe different models. The final scores are then averaged across thetwo annotators. Results show that our proposed models outperformall baseline models. The proposed dual learning framework pushesmT5 by relative gains of 6.8% on fluency score and 6.1% on relevancescore, pushing QEBERT by relative gains of 7.5% on fluency scoreand 0.6% on relevance score. These results show that our proposedmodels are better able to generate valuable queries and the proposeddual learning framework significantly improve the fluency and therelevance for both the mT5 and QEBERT models.

5.5 Online EvaluationWe performed large-scale online evaluations to show that the TitIEsystem helps with maximizing the utilization of crowdsource re-sources and improving the performance of QFR in real world ap-plications. The online evaluation is two-fold. First, we evaluatedthe difference in acceptance rate between TitIE and its variants toverify that TitIE significantly improves the online acceptance rateof the generated queries. Second, we evaluated the performance ofthe recommendation system through large-scale A/B Testing.

Online Acceptance Rate Evaluation. Figure 5 reports the av-erage acceptance rates of different systems from online data col-lected over three days. For each day, 10K generated queries are man-ually validated by crowdsource. Results shows that our proposedmodels DL-QEBERT and DL-mT5 increases the acceptance rateof the baseline models by more than 12.9% (calculated by (33.7% +35.6%) / 2 - (11.1% + 23.0% + 24.3% + 28.7%) / 4) on average. Moreover,the acceptance rate increases by 23% when the generated querieswere pre-selected by the DL-QSBERT model. Further improvementis attained when the DL-QEBERT model and the DL-mT5 model

Page 9: Dual Learning for Query Generation and Query Selection in ...

Table 4: Results for online A/B Testing.

Metrics Percentage Lift Metrics Percentage LiftIPV +0.33% IUV +0.17%CPV +27.18% CUV +23.01%CTR-PV +26.89% CTR-UV +22.86%

are simultaneously deployed in our system, achieving 61.7% aver-age acceptance rate, yielding an absolute gain of 33.0% over thequeries mined from user search logs. These results implies thatthe proposed query generation models are able to generate querieswith higher qualities while the proposed query selection strategysignificantly improves the acceptance rate in human validation.

Online A/B Testing. We conducted a large-scale online A/BTesting to show that whether the TitIE system helps to improvethe performance of QFR. For online A/B Testing, we split usersinto several buckets, where each bucket contains about 1,200,000users. Note that both the users and queries are labeled with userprofiles and query profiles in our recommender system. We rankedcandidate queries by a Click-Through Rate (CTR) prediction model[17], where the candidate queries are retrieved by matching usersbased on their profile. The ranked queries are recommended tousers in the QFR scenario of Tencent QQ Browser. We observed theactivities of each bucket for 3 days based on the following metrics:• Impression Page View (IPV): number of times that recom-mended queries are viewed by users.• Impression User View (IUV): number of users who hasviewed the recommended queries.• ClickPageView (CPV): number of times that recommendedqueries are clicked by users.• Click User View (CUV): number of users who has clickedthe recommended queries.• Click Through Rate on Page View (CTR-PV): 𝐶𝑃𝑉

𝐼𝑃𝑉.

• Click Through Rate on User View (CTR-UV): 𝐶𝑈𝑉𝐼𝑈𝑉

.The CTR-UV and CTR-PV metrics are the most important met-

rics as they show how many queries are attractive to users in therecommender stream. We select two buckets with the most similaractivities for A/B testing to ensure that there is no statistical differ-ence between the users in the two buckets. We then ran the A/Btesting for 7 days and compared the results by above metrics. Forbucket A (i.e., the experiment group), the queries generated fromTitIE (about 5 million) are used for recommendation. For bucket B(i.e., the baseline group), the queries mined from user search logs(about 5 million) are used for recommendation.

Table 4 reports the results of the online A/B testing for the abovemetrics. Results show that the difference between the IPV and IUVmetrics reveal no statistical significance (p-value > 0.05) while theCPV, CUV, CTR-PV and CTR-UV metrics increased significantly(p-value < 0.001) when the generated queries from the TitIE sys-tem are employed for recommendation, by 27.18%, 23.01%, 26.89%and 22.86%, respectively. These results reveal that the impressiondifference between bucket A and B is virtually negligible while thequeries generated from the TitIE system are more attractive andinteresting to users, yielding to higher CTR scores.

Table 5: Case study for queries from different sources.

Source Title Query

Search logs - 英菲尼迪 q50l刹车(Infiniti q50l brake)

QEBERT NASA照片显示火星异常结构,轮廓清晰

(Photos from NASA showthe abnormal structure ofMars with clear outline)

NASA照片(Photos from NASA)

DL-QEBERT

NASA照片显示火星异常结构

(Photos from NASA showthe abnormal structure

of Mars)

mT5 吉利与比亚迪和奇瑞

相比,谁的技术更好?(Geely compared with BYDand Chery, whose tech-

nology is better?)

比亚迪和奇瑞哪个好(Which is better for BYD

than Chery)

DL-mT5

吉利与比亚迪和奇瑞哪个好

(Which is better for Geelythan BYD or Chery)

5.6 Case studyWe conducted a simple case study on the generated queries fromdifferent sources, as shown in Table 5. For the query mined fromuser search logs, it is presented as 3 keywords (“英菲尼迪(Infiniti)”,“q50l”, “刹车(brake)”) split by space. We speculate that this form ofqueries are less attractive to users (supported by the results of A/BTesting). For the input title “NASA照片显示火星异常结构,轮廓清晰 (Photos from NASA show the abnormal structure of Marswith clear outline)”, the generated query from DL-QEBERT is morefluent and informative than that from QEBERT. For the input title“吉利与比亚迪和奇瑞相比,谁的技术更好? (Geely comparedwithBYD and Chery, whose technology is better?)”, the generated queryfrom DL-mT5 is more relevant to the title than that from mT5.We believe that these enhancements are attributed to the rewardfeedback in our proposed dual leanring framework.

6 CONCLUSIONSIn this paper, we propose TitIE, a querymining system for providingvaluable queries for query feeds recommendation at Tencent QQBrowser. A query extraction model based on pre-trained languagemodel BERT and a query generation model based on pre-trainedtext generation model mT5 are proposed in TitIE for generatingvaluable queries from article titles. The generated queries are thenpre-selected by a proposed query selection model which is ableto select the queries that are likely to be accepted during humanvalidation. A novel dual learning framework is proposed to jointlylearn the query generation models and the query selection model,thereby mutually improving the performance of both. We conductboth offline and online evaluations to assess the effectiveness of ourproposed framework. For offline evaluation, results from the auto-matic and human evaluations show that the proposed dual learningframework yields significant improvements for both the query ex-traction and generation models. For online evaluation, results fromthe acceptance rate evaluation and the online A/B Testing showthat the TitIE system helps to improve the acceptance rate of humanvalidation and the performance of query feeds recommendation.

ACKNOWLEDGMENTSWe thank Hai Wan for helpful discussions on the paper. This workis supported by the National Natural Science Foundation of China(No. 61976232 and 61876204).

Page 10: Dual Learning for Query Generation and Query Selection in ...

REFERENCES[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Man-

ning. 2015. A large annotated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empirical Methods in Natural LanguageProcessing. 632–642.

[2] ZhicongCheng, BinGao, and Tie-Yan Liu. 2010. Actively predicting diverse searchintent from user browsing behaviors. In Proceedings of the 19th InternationalConference on World Wide Web. 221–230.

[3] Hang Cui, Ji-RongWen, Jian-Yun Nie, and Wei-Ying Ma. 2002. Probabilistic queryexpansion using query logs. In Proceedings of the 11th international conference onWorld Wide Web. 325–332.

[4] Aja Huang David Silver and et al. 2016. Mastering the game of Go with deepneural networks and tree search. Nature 529, 7587 (2016), 484–489.

[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics. 4171–4186.

[6] Bruno M. Fonseca, Paulo Braz Golgher, Bruno Pôssas, Berthier A. Ribeiro-Neto,and Nivio Ziviani. 2005. Concept-based interactive query expansion. In Proceed-ings of the 2005 ACMCIKM International Conference on Information and KnowledgeManagement. 696–703.

[7] Fred X. Han, Di Niu, Haolan Chen, Weidong Guo, Shengli Yan, and Bowei Long.2020. Meta-Learning for Query Conceptualization at Web Scale. In KDD ’20:The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.3064–3073.

[8] Fred X. Han, Di Niu, Haolan Chen, Kunfeng Lai, Yancheng He, and Yu Xu. 2019.A Deep Generative Approach to Search Extrapolation and Recommendation.In Proceedings of the 25th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining. 1771–1779.

[9] Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual Learning for Machine Translation. In Advances in NeuralInformation Processing Systems. 820–828.

[10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.Neural Computation 9, 8 (1997), 1735–1780.

[11] Hanqi Jin, TianmingWang, and XiaojunWan. 2020. Multi-Granularity InteractionNetwork for Extractive and Abstractive Multi-Document Summarization. InProceedings of the 58th Annual Meeting of the Association for ComputationalLinguistics. 6244–6254.

[12] Hanqi Jin, Tianming Wang, and Xiaojun Wan. 2020. SemSUM: Semantic Depen-dency Guided Neural Abstractive Summarization. In The Thirty-Fourth AAAIConference on Artificial Intelligence. 8026–8033.

[13] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-mization. In Proceedings of the 3rd International Conference on Learning Represen-tations.

[14] Taku Kudo and John Richardson. 2018. SentencePiece: A simple and languageindependent subword tokenizer and detokenizer for Neural Text Processing. InProceedings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing. Association for Computational Linguistics, 66–71.

[15] Ruirui Li, Liangda Li, Xian Wu, Yunhong Zhou, and Wei Wang. 2019. ClickFeedback-Aware Query Recommendation Using Adversarial Examples. In Pro-ceedings of the the World Wide Web Conference. 2978–2984.

[16] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries.In Text summarization branches out. 74–81.

[17] Bang Liu, Weidong Guo, Di Niu, Chaoyue Wang, Shunnan Xu, Jinghong Lin,Kunfeng Lai, and Yu Xu. 2019. A User-Centered Concept Mining System forQuery and Document Understanding at Tencent. In Proceedings of the 25th ACMSIGKDD International Conference on Knowledge Discovery & Data Mining. 1831–1841.

[18] Xiaoyu Liu, Shunda Pan, Qi Zhang, Yu-Gang Jiang, and Xuanjing Huang. 2018.Generating Keyword Queries for Natural Language Queries to Alleviate LexicalChasm Problem. In Proceedings of the 27th ACM International Conference onInformation and Knowledge Management. 1163–1172.

[19] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: aMethod for Automatic Evaluation of Machine Translation. In Proceedings of the40th Annual Meeting of the Association for Computational Linguistics. 311–318.

[20] Sheng Qian, Guanyue Li, Wen-Ming Cao, Cheng Liu, Si Wu, and Hau-San Wong.2019. Improving representation learning in autoencoders via multidimensionalinterpolation and dual regularizations. In Proceedings of the Twenty-Eighth Inter-national Joint Conference on Artificial Intelligence. 3268–3274.

[21] Pengda Qin, Weiran Xu, and William Yang Wang. 2018. Robust Distant Supervi-sion Relation Extraction via Deep Reinforcement Learning. In Proceedings of the56th Annual Meeting of the Association for Computational Linguistics. 2137–2147.

[22] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring theLimits of Transfer Learning with a Unified Text-to-Text Transformer. Journal ofMachine Learning Research 21 (2020), 140:1–140:67.

[23] H Scudder. 1965. Probability of error of some adaptive pattern-recognitionmachines. IEEE Transactions on Information Theory 11, 3 (1965), 363–371.

[24] Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point:Summarization with Pointer-Generator Networks. In Proceedings of the 55thAnnual Meeting of the Association for Computational Linguistics. 1073–1083.

[25] Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017.Bidirectional Attention Flow for Machine Comprehension. In Proceedings of the5th International Conference on Learning Representations.

[26] Lei Shen and Yang Feng. 2020. CDL: Curriculum Dual Learning for Emotion-Controllable Response Generation. In Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics. 556–566.

[27] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a simple way to prevent neural networks fromoverfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.

[28] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to SequenceLearning with Neural Networks. In Advances in Neural Information ProcessingSystems. 3104–3112.

[29] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour.1999. Policy Gradient Methods for Reinforcement Learning with Function Ap-proximation. In Advances in Neural Information Processing Systems. 1057–1063.

[30] Bilyana Taneva, Tao Cheng, Kaushik Chakrabarti, and Yeye He. 2013. Miningacronym expansions and their meanings using query click log. In Proceedings ofthe 22nd International World Wide Web Conference. 1261–1272.

[31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. In Advances inNeural Information Processing Systems. 5998–6008.

[32] Shuohang Wang and Jing Jiang. 2017. Machine Comprehension Using Match-LSTM and Answer Pointer. In Proceedings of the 5th International Conference onLearning Representations.

[33] Ryen W White and Gary Marchionini. 2007. Examining the effectiveness ofreal-time query expansion. Information Processing & Management 43, 3 (2007),685–704.

[34] Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A Broad-CoverageChallenge Corpus for Sentence Understanding through Inference. In Proceed-ings of the 2018 Conference of the North American Chapter of the Association forComputational Linguistics. 1112–1122.

[35] Xinwei Wu, Hechang Chen, Jiashu Zhao, Li He, Dawei Yin, and Yi Chang. 2021.Unbiased Learning to Rank in Feeds Recommendation. In The Fourteenth ACMInternational Conference on Web Search and Data Mining. 490–498.

[36] Yu Wu, Wei Wu, Dejian Yang, Can Xu, and Zhoujun Li. 2018. Neural ResponseGenerationWith Dynamic Vocabularies. In Proceedings of the Thirty-Second AAAIConference on Artificial Intelligence. 5594–5601.

[37] Yingce Xia, Tao Qin, Wei Chen, Jiang Bian, Nenghai Yu, and Tie-Yan Liu. 2017.Dual Supervised Learning. In Proceedings of the 34th International Conference onMachine Learning, Vol. 70. 3789–3798.

[38] Yingce Xia, Xu Tan, Fei Tian, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2018. InProceedings of the 35th International Conference on Machine Learning, Vol. 80.5379–5388.

[39] Jinxi Xu and W. Bruce Croft. 1996. Query Expansion Using Local and GlobalDocument Analysis. In Proceedings of the 19th Annual International ACM SIGIRConference on Research and Development in Information Retrieval. 4–11.

[40] Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Discourse-Aware NeuralExtractive Text Summarization. In Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics. 5021–5031.

[41] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, AdityaSiddhant, Aditya Barua, and Colin Raffel. 2020. mT5: A massively multilingualpre-trained text-to-text transformer. CoRR abs/2010.11934 (2020).

[42] Xiaohui Yan, Jiafeng Guo, and Xueqi Cheng. 2011. Context-aware query recom-mendation by learning high-order relation in query logs. In Proceedings of the 20thACM Conference on Information and Knowledge Management. ACM, 2073–2076.

[43] Zhiquan Ye, Yuxia Geng, Jiaoyan Chen, Jingmin Chen, Xiaoxiao Xu, SuhangZheng, Feng Wang, Jun Zhang, and Huajun Chen. 2020. Zero-shot Text Classifi-cation via Reinforced Self-training. In Proceedings of ACL. 3014–3024.

[44] Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and BillDolan. 2020. POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training. In Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing. 8649–8670.

[45] Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and XuanjingHuang. 2020. Extractive Summarization as Text Matching. In Proceedings of the58th Annual Meeting of the Association for Computational Linguistics. 6197–6208.