Answering Product-Questions by Utilizing Questions from ...

12
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 242–253 June 6–11, 2021. ©2021 Association for Computational Linguistics 242 Answering Product-Questions by Utilizing Questions from Other Contextually Similar Products Ohad Rozen 1* , David Carmel 2 , Avihai Mejer 2 , Vitaly Mirkis 2 , and Yftah Ziser 31 Computer Science Department, Bar-Ilan University, Ramat-Gan, Israel 2 Amazon 3 Facebook [email protected], {dacarmel, amejer, vitamin}@amazon.com, [email protected] Abstract Predicting the answer to a product-related question is an emerging field of research that recently attracted a lot of attention. Answer- ing subjective and opinion-based questions is most challenging due to the dependency on customer-generated content. Previous works mostly focused on review-aware answer pre- diction; however, these approaches fail for new or unpopular products, having no (or only a few) reviews at hand. In this work, we propose a novel and complementary approach for pre- dicting the answer for such questions, based on the answers for similar questions asked on similar products. We measure the contextual similarity between products based on the an- swers they provide for the same question. A mixture-of-expert framework is used to pre- dict the answer by aggregating the answers from contextually similar products. Empiri- cal results demonstrate that our model outper- forms strong baselines on some segments of questions, namely those that have roughly ten or more similar resolved questions in the cor- pus. We additionally publish two large-scale datasets 1 used in this work, one is of similar product question pairs, and the second is of product question-answer pairs. 1 Introduction Product-related Question Answering (PQA) is a popular and essential service provided by many e-commerce websites, letting consumers ask prod- uct related questions to be answered by other con- sumers based on their experience. The large archive of accumulated resolved questions can be further utilized by customers to support their purchase journey and automatic product question answer- ing tools (e.g. Jeon et al. (2005); Cui et al. (2017); * Work carried out during an internship at Amazon. Work carried out while working at Amazon. 1 The datasets are freely available at https:// registry.opendata.aws under the names Amazon- PQSim and Amazon-PQA. Carmel et al. (2018)). However, there are many unanswered questions on these websites, either be- cause a newly issued question has not attracted the community attention yet, or because of many other reasons (Park et al., 2015). This may frustrate e- commerce users, in particular when their purchase decision depends on the question’s answer. Auto- matic PQA may assist the customers and the sellers by answering these unanswered questions, based on various diversified resources. Previous PQA approaches leverage product spec- ifications and description information (Cui et al., 2017; Lai et al., 2018; Gao et al., 2019), as well as customer-reviews (Yu et al., 2012; McAuley and Yang, 2016; Yu and Lam, 2018; Das et al., 2019; Fan et al., 2019; Chen et al., 2019; Deng et al., 2020), for answering product related ques- tions. However, there are two notable shortcomings to these two approaches. Product information can typically address questions about product features and functionality, but can’t address complex and subjective questions such as opinion question (Is it good for a 10 year old?), advice-seeking question (What is the color that best fit my pink dress?), or unique usage questions (Can I play Fifa 2018 on this laptop?). Customer-reviews, on the other hand, can partially address this kind of questions (Wan and McAuley, 2016), yet there are many products with few or no reviews available, either because they are new on the site or are less popular. We propose a novel and complementary ap- proach for answering product-related questions based on a large corpus of PQA. Given an unan- swered product question, we seek similar resolved questions 2 about similar products and leverage their existing answers to predict the answer for the cus- tomer’s question. We call our method SimBA 2 We consider questions similar if they have the same se- mantic intent. For example, can I wash this?, Is the product washable?, Is it ok to clean it with water? are all considered as similar questions when asked in context of a similar product.

Transcript of Answering Product-Questions by Utilizing Questions from ...

Page 1: Answering Product-Questions by Utilizing Questions from ...

Proceedings of the 2021 Conference of the North American Chapter of theAssociation for Computational Linguistics Human Language Technologies pages 242ndash253

June 6ndash11 2021 copy2021 Association for Computational Linguistics

242

Answering Product-Questions by Utilizing Questions from OtherContextually Similar Products

Ohad Rozen1lowast David Carmel2 Avihai Mejer2 Vitaly Mirkis2 and Yftah Ziser3dagger1Computer Science Department Bar-Ilan University Ramat-Gan Israel

2Amazon3Facebook

ohadrozengmailcom dacarmel amejer vitaminamazoncom yftahzfbcom

Abstract

Predicting the answer to a product-relatedquestion is an emerging field of research thatrecently attracted a lot of attention Answer-ing subjective and opinion-based questions ismost challenging due to the dependency oncustomer-generated content Previous worksmostly focused on review-aware answer pre-diction however these approaches fail for newor unpopular products having no (or only afew) reviews at hand In this work we proposea novel and complementary approach for pre-dicting the answer for such questions basedon the answers for similar questions asked onsimilar products We measure the contextualsimilarity between products based on the an-swers they provide for the same question Amixture-of-expert framework is used to pre-dict the answer by aggregating the answersfrom contextually similar products Empiri-cal results demonstrate that our model outper-forms strong baselines on some segments ofquestions namely those that have roughly tenor more similar resolved questions in the cor-pus We additionally publish two large-scaledatasets1 used in this work one is of similarproduct question pairs and the second is ofproduct question-answer pairs

1 Introduction

Product-related Question Answering (PQA) is apopular and essential service provided by manye-commerce websites letting consumers ask prod-uct related questions to be answered by other con-sumers based on their experience The large archiveof accumulated resolved questions can be furtherutilized by customers to support their purchasejourney and automatic product question answer-ing tools (eg Jeon et al (2005) Cui et al (2017)

lowastWork carried out during an internship at AmazondaggerWork carried out while working at Amazon

1The datasets are freely available at httpsregistryopendataaws under the names Amazon-PQSim and Amazon-PQA

Carmel et al (2018)) However there are manyunanswered questions on these websites either be-cause a newly issued question has not attracted thecommunity attention yet or because of many otherreasons (Park et al 2015) This may frustrate e-commerce users in particular when their purchasedecision depends on the questionrsquos answer Auto-matic PQA may assist the customers and the sellersby answering these unanswered questions basedon various diversified resources

Previous PQA approaches leverage product spec-ifications and description information (Cui et al2017 Lai et al 2018 Gao et al 2019) as wellas customer-reviews (Yu et al 2012 McAuleyand Yang 2016 Yu and Lam 2018 Das et al2019 Fan et al 2019 Chen et al 2019 Denget al 2020) for answering product related ques-tions However there are two notable shortcomingsto these two approaches Product information cantypically address questions about product featuresand functionality but canrsquot address complex andsubjective questions such as opinion question (Is itgood for a 10 year old) advice-seeking question(What is the color that best fit my pink dress) orunique usage questions (Can I play Fifa 2018 onthis laptop) Customer-reviews on the other handcan partially address this kind of questions (Wanand McAuley 2016) yet there are many productswith few or no reviews available either becausethey are new on the site or are less popular

We propose a novel and complementary ap-proach for answering product-related questionsbased on a large corpus of PQA Given an unan-swered product question we seek similar resolvedquestions2 about similar products and leverage theirexisting answers to predict the answer for the cus-tomerrsquos question We call our method SimBA

2We consider questions similar if they have the same se-mantic intent For example can I wash this Is the productwashable Is it ok to clean it with water are all considered assimilar questions when asked in context of a similar product

243

(Similarity Based Answer Prediction) For exam-ple the answer for the question ldquoWill these jeansshrink after a washrdquo asked about a new pair ofjeans on the website may be predicted based onthe answers for similar questions asked about otherjeans that share properties such as fabric materialbrand or style An example is shown in Table 1The main hypothesis we explore in this work iswhether the answer to a product question can bepredicted based on the answers for similar ques-tions about similar products and how reliable thisprediction is

As our method relies on the existing PQA corpusit addresses the two mentioned shortcomings of theprevious approaches First it can address a vari-ety of product-related questions that are commonin PQA including subjective and usage questionsSecond our method can provide answers to new orless popular products as it leverages an existing setof similar questions from other similar products

A key element of our proposed method is a novelconcept that we refer to as Contextual Product Sim-ilarity which determines whether two products aresimilar in the context of a specific question Forexample two smart-watches may be similar withregards to their texting capability but different withregards to sleep monitoring In Section 3 we for-mally define this concept and propose a predictionmodel for measuring contextual similarity betweenproducts with respect to a given question Addi-tionally we describe an efficient method to trainthis model by leveraging an existing PQA corpus

Another appealing property of SimBA is its abil-ity to support the predicted answer by providingthe list of highly similar questions upon which theanswer was predicted hence increasing usersrsquo con-fidence and enhancing user engagement

Our main contributions are (a) A novel PQAmethod that overcomes several shortcomings ofprevious methods (b) A novel concept of Contex-tual Product Similarity and an effective way to au-tomatically collect annotations to train this model(c) Finally publishing two large scale datasets oneis a question similarity data set and the second is alarge-scale Amazon product questions and answersdataset details are provided in Section 4

Empirical evaluation of our method demon-strates that it outperforms a strong baseline in somequestion segments and that a hybrid model is ef-fective in all the vast majority of the questions

Product Dickies Menrsquos Jeans 100 CottonQ Will these shrink after a washPredicted answer No

Similar Product 1 Eddie Bauer Menrsquos Jeans 100Cotton

Q Do these shrink when you wash and dry themA No

Similar Product 2 Timberland PRO Menrsquos Jean 99Cotton 1 Polyurethane

Q Was there any shrinkage after washingA No shrinkage

Similar Product 3 Levirsquos Menrsquos Jeans 98 Cotton2 Elastane

Q Do these shrink at all during washingdryingA They have not shrunk

Table 1 Answer prediction example based on similar ques-tions asked about similar products The answer for allcontextually-similar products is lsquonorsquo therefore we predict theanswer lsquonorsquo for the target question

2 Related Work

Automatic aswering product related questions hasbecome a permanent service provided by many e-commerce websites and services (Cui et al 2017Carmel et al 2018) Questions are typically an-swered based on product details from the cata-log existing QampArsquos on the site and customer re-views Each of these resources used for answergeneration has been studied extensively by the re-search community recently probably due to thecomplexity of this task the availability of appro-priate datasets (McAuley 2016) and the emergentincrease in on-line shopping usage

Lai et al (2018) built a question answering sys-tem based on product facts and specifications Theytrained a question answering system by transferlearning from a large-scale Amazon dataset to theHome Depot domain Gao et al (2019) generatedan answer from product attributes and reviews us-ing adversarial learning model which is composedof three components a question-aware review rep-resentation module a key-value attribute graphand a seq2seq model for answer generation Yuet al (2012) answered opinion questions by ex-ploiting hierarchical organization of consumer re-views where reviews were organized according tothe product aspects

The publication of Amazon datasets of reviews3

and QampAs (McAuley 2016) triggered a flood ofstudies on review-aware answer prediction and gen-eration McAuley and Yang (2016) formulated thereview based question answering task as a mixture-of-experts framework mdash each review is an ldquoexpertrdquo

3httpsnijianmogithubioamazonindexhtml

244

that votes on the answer to a yesno question Theirmodel learns to identify lsquorelevantrsquo reviews basedon those that vote correctly In a following workWan and McAuley (2016) observed that questionshave multiple often divergent answers and thefull spectrum of answers should be further utilizedto train the answering system

Chen et al (2019) described a multi-task atten-tion mechanism which exploits large amounts ofQampAs and a few manually labeled reviews foranswer prediction Fan et al (2019) proposed aneural architecture directly fed by the raw text ofthe question and reviews to mark review segmentas the final answer in a reading comprehensionfashion Das et al (2019) learned an adversarialnetwork for inferring reviews which best answer aquestion or augment a given answer Deng et al(2020) incorporated opinion mining into the review-based answer generation Yu and Lam (2018) gen-erated aspect-specific representation for questionsand reviews for answer prediction for yes-no ques-tions Yu et al (2018) used transfer learning froma resource-rich source domain to a resource-poortarget domain by simultaneously learning sharedrepresentations of questions and reviews in a uni-fied framework of both domains

All this line of works assume the existence ofrich set of product reviews to be used for questionanswering This solution fails when no reviews areavailable The challenge of review generation fora given product while utilizing similar productsrsquoreviews was addressed by Park et al (2015) For agiven product they extracted useful sentences fromthe reviews of other similar products Similarly(Pourgholamali 2016) mined relevant content fora product from various content resources availablefor similar products Both works focused on the ex-traction of general useful product related informa-tion rather than answering a specific product ques-tion as in our case Second the product-similaritymethods they considered rely on product specifi-cations and description and do not depend on thequestion to be answered while our method con-siders a specific question at hand when estimatingcontextual product similarity

3 Similarity-Based Answer Prediction

In this section we introduce the Similarity-BasedAnswer-prediction (SimBA) method for predictingthe answer for a product question based on theanswers for other similar product questions We

Figure 1 Overview of SimBA answer prediction framework(1) K siblings to the product question are retrieved from corpusby AKNN (2) Siblings are filtered by the Q2Q model keepingonly twins (3) Twins are scored by the CPS model (4)A Mixture of Experts model uses these votes to predict theanswer

restrict our study to yesno questions only due totheir popularity in the PQA domain (54 on ourPQA dataset) and following common practicesin answer prediction studies (McAuley and Yang2016 Yu and Lam 2018) Figure 1 presents ourprediction framework and its main components

Formally a question-product-answer tuple isdenoted by rj = (qj pj aj) where aj isinprimeyesprimeprime noprime C = rjNj=1 is the set of N tuplesof a given product category rt = (qt pt )

4 is thetarget record of an unanswered question qt askedabout product pt We treat C as the knowledge-base we use for answering qt

Given a target record rt in order to predict itsanswer at we first retrieve a set of records fromC with the most similar questions to qt (Figure 1stage 1) We denote the retrieved records as sib-lings of rt We then filter the siblings by applyinga Question-to-Question similarity (Q2Q) modelkeeping only records with highly similar questionswhich are expected to have the same question in-tent as of qt (Figure 1 stage 2) We denote theserecords as twins of rt We then apply our Contex-tual Product Similarity (CPS) model to measurethe contextual similarity between rt and its twins(Figure 1 stage 3) The CPS similarity score isused to weight the twins by considering them asvoters applying a mixture-of-experts model overtheir answers for the final answer prediction (Fig-ure 1 stage 4) More details about the modelrsquoscomponents the training processes and other spec-ifications are described in the following

31 Sibling Retrieval

Given a target record rt and a corpus of product-question-answer records C our first goal is to re-

4The answer for the target record is unknown

245

trieve all records with a question having the sameintent as of qt As C might be very large applyinga complex neural model to measure the similarityof each question in C to qt is often infeasible Wetherefore apply a two step retrieval process In apreliminary offline step we index the records in Cby creating embedding vectors for their questionsusing a pre-trained encoder For retrieval doneboth during training and inference we similarlyembed the question qt into vector et We then usea fast Approximate K Nearest Neighbors (AKNN)search to retrieve K records with the most similarquestions based on the cosine similarity betweenet and the embedding vectors of the questions inC We denote the set of retrieved siblings of rt byS(rt)

32 Twin detectionThe retrieved sibling records are those with themost similar questions to the target question Inthe second step of the retrieval process we en-hance our record selection by applying a highlyaccurate transformer-based Question-to-Question(Q2Q) classifier (See Section 51) which we trainover our question to question similarity dataset(Section 41) The Q2Q(qt qk) classifier predictsthe similarity between a target question qt and eachof the questions qk in S(rt) A record rk is con-sidered a twin of rt if Q2Q(qt qk) gt γ where05 le γ le 10 is a hyper-parameter of the systemWe denote the set of twins of rt by T (rt)

33 Contextual Product Similarity (CPS)We consider products p1 and p2 to be contextuallysimilar with respect to a yesno question q if theanswer to q on both products is the same5 Givena pair of twin records (r1 r2) our CPS model isaims to predict the contextual similarity betweenthem ie whether their (highly similar) questionshave the same answer

Since r1 and r2 are twins their questions areexpected to have the same intent yet they mightbe phrased differently To avoid losing any infor-mation we provide both questions as input to theCPS model during training and during inferencetime

CPS Model Architecture Figure 2 depicts theCPS model for predicting the contextual similarity

5By design both products belong to the same productcategory C which prevents comparing unrelated productsFor example comparing an airhorn and a computer fan in thecontext of the question is it loud is therefore prevented

Figure 2 The Contextual Product Similarity (CPS) modelThe target question-product pair (qt pt) and the twin question-product pair (qj pj) are encoded using a transformer encoderwhile the questions attend the product text The texts of bothproducts are coupled and also encoded allowing the two prod-uct text attend each other The three output vectors are thenconcatenated and classified using an MLP classifier

between a target record rt and one of its twins -record rj For each record the question-productpair is embedded using a pre-trained transformerencoder allowing the product textual content andthe question text attend each other6

Ht = Encoder(qt pt) Hj = Encoder(qj pj)

The two models share weights to avoid over-fittingand for more efficient learning A second encoderembeds the textual content of both products encap-sulating the similarity between them

Htj = Encoder(pt pj)

Then a one hidden MLP layer takes the concatena-tion of the three embedding vectors to predict theprobability of at = aj

ψtj = CPS(rt rj) = P (at=aj |rt rj)=MLP (Ht oplusHj oplusHtj)

(1)

Another key advantage of the CPS model is itsability to be trained on a large scale without humanannotations by simply yielding the training labelsdirectly from the polarity between the answers oftwin pairs extracted from our training data For anypair of twins (ri rj)

label(ri rj) =

similar ai = aj

different ai 6= aj(2)

6The product textual content can be accumulated fromseveral resources In our experiments we restrict the productcontent to its title and bullet points

246

Question 1 Question 2 Label

Can this be used with regular light bulbs Can i put a regular light bulb in this SimilarCan i use these labels in a laser printer Can this be used in laser printer SimilarDoes the hat have an adjustable strap inside Does the hat adjust inside SimilarCan this organizer hold sleeved cards Can it hold cards with sleeves on them SimilarDoes this phone have bluetooth Does the phone have gps DifferentCan just two player play this game Whats the length of play of this game DifferentIs there a diffuser for this dryer Can this go in the dryer DifferentWhat material is the neck strap made of Does this come with a neck strap Different

Table 2 Examples from Amazon-PQSim Dataset Each example consists of a user-generated question pair and ahuman-annotated label for their similarity

34 Mixture of ExpertsA mixture of experts is a widely-used method tocombine the outputs of several classifiers by as-sociating a weighted confidence score with eachclassifier (Jacobs et al 1991) In our setting ex-perts are individual twins that lend support for oragainst a particular answer for a question Eachtwin is weighted by its contextual similarity to thetarget record rt as predicted by the CPS model

Given a target record rt the weight of each ofits twins rj isin T (rt) is determined by

λ(rj) = max(ψ2tj wmin)

where ψtj = CPS(rt rj) and 0 le wmin le 05is a lower weight-limit a hyper-parameter that wetune on the development set7

The predicted class of at is therefore derived by

Pred(at|rt) = sign

sumrjisinT (rt)

λ(rj)δ(aj)

(3)

where positivenegative Pred indicates lsquoyesrsquolsquonorsquo

respectively and δ(a) =

+1 a = lsquoyesrsquominus1 a = lsquonorsquo

Our methodology can be easily expanded to in-corporate more answer predictors (voters) of dif-ferent types into SimBA An example for such anexpansion is described at Section 53

4 Datasets

We introduce two new datasets to experiment withour answer prediction approach 1) The Ama-zon Product Question Similarity (Amazon-PQSim)dataset which is used to train our Q2Q model 2)The Amazon Product Question Answers (Amazon-PQA) dataset of product related QampAs used fortraining the SimBA model

7We tried using the CPS raw score for all twins iewmin = 0 however using a fine-tuned minimal weightyielded better results

41 Amazon-PQSim DatasetWe collected a first-of-a-kind question-to-questionsimilarity dataset of product-question pairs fromthe Amazon website (Amazon-PQSim See Table 2for examples) Unlike the Quora dataset of generalquestion pairs8 product questions are asked in thecontext of a designated product page This makesthem unique and different from questions askedin other domains For example the question Isit waterproof when appears on the Fitbit Flyerdetailed page should implicitly be interpreted as IsFitbit Flyer waterproof

The following steps were taken for the data col-lection (a) randomly sampling product-questionsfrom the Amazon website (b) filtering out someof these questions (eg non-English questions formore details see Appendix A) (c) For each ofthe remaining questions we retrieved up to threecandidate similar questions from the collectionA question is paired with the original question ifthe Jaccard similarity among them is in the rangeof [03 05] We ignore highly similar questions(gt 05) since we donrsquot want nearly verbatim pairsin our dataset as well as dissimilar pairs (lt 03)(d) Finally we used the Appen crowd-sourcing plat-form9 for manual annotation of question pairs sim-ilarity 10 Each question pair was labeled by atleast three judges and up to seven until reachingagreement of 70 or more

The above steps resulted in a nearly balanceddataset (108 positive-negative ratio) of more than180K product question pairs with judges agree-ment of 70 or more and among them about 90Kquestion pairs have perfect judges agreement (114

8httpswwwkagglecomcquora-question-pairs9httpsappencom

10As the questions are asked in context of a specific productthey are often written in an anaphoric form (eg Is it water-proof) To keep our dataset general we instructed the judgesto accept such questions as if they included the actual relatedproduct name For example the pair Is it waterproof and Isthis Fitbit waterproof were labeled as similar

247

positive-negative ratio)

42 Amazon-PQA DatasetWe collected a large corpus of product questionsand answers from the Amazon website similarto the popular Amazon QampA dataset (McAuley2016) Since our answer prediction method directlyutilizes an existing corpus of resolved questionswe aim to collect all available questions per nar-row sub-category instead of a sample of questionsacross broad categories by the popular AmazonQampA dataset For example instead of samplingfrom the broad Electronics category we collectall questions under the narrower Monitors and Re-ceivers categories

Raw Data Extraction We collected all prod-uct questions with their answers from 100 sub-categories available on the Amazon website inAugust 2020 Overall 10M questions were col-lected with 207M answers on 15M products Forfull statistics of the raw data see Table 7 in Ap-pendix A

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics See Appendix A for de-tails

YesNo Answer Labeling Questions are typi-cally answered by free-text answers posted in-dependently by multiple users In order to con-vert these answers into a single yesno answerwe first classified each answer into one of threeclasses yes no and maybe and then used major-ity vote among the classified answers We used apre-trained RoBERTa-based classifier and trainedthe model on McAuleyrsquos dataset (McAuley 2016)taking only yesno questions See Appendix A fordetails

5 Experiments

We experiment with eleven product categories cov-ered by our Amazon-PQA dataset (Section 42)training a SimBA answer prediction model for eachof the categories independently Next we describethe data preparation steps for each of the SimBAcomponents

51 Data PreparationSibling Retrieval Using AKNN For each recordr isin C (C is the category dataset) we use AKNNto retrieve the top-K similar siblings fromC while

Majority Jaccard USE RoBERTa Q2Qvote similarity cosine cosine531 620 696 707 832

Table 3 Classification accuracy of question similarity mod-els

making sure that neither of them share the sameproduct with r We collect training example pairsby coupling each record r with each of its siblingsDprime(C) =

⋃riisinC(ri rj)|rj isin S(ri)

For retrieval we use Universal Sentence Encoder(USE) (Cer et al 2018) to embed each question qiinto a 512-length vector ei We use the Annoy11

python library for the implementation of efficientAKNN retrieval In all experiments for each recordwe retrieve the top-K (K = 500) similar recordsbased on the cosine-similarity between the embed-ding vectors

Twin Detection Using the Q2Q Model Foreach sibling pair (ri rj) isin Dprime(C) we use ourQ2Q model to score their question-similarity andkeep only those with Q2Q(qi qj) gt γ to yield acollection of twin pairs D(C) We use γ = 09 toensure only highly similar question pairs

For our Q2Q model we apply a standard pre-trained RoBERTa (Liu et al 2019) classifierSpecifically we use Hugging-Face base-uncasedpre-trained model12 and fine-tune13 it for the classi-fication task on our Q2Q dataset14 while splittingthe data into train dev and test sets with 80-10-10 partition respectively For γ = 05 (itsminimal value) the model achieves test accuracyof 832 with a precision of 813 and a recallof 877 When setting the twin confidence levelthreshold to γ = 09 the precision of the Q2Qmodel raises to 899 with a recall of 695

We compare the performance of the Q2Q simi-larity classifier with several unsupervised baselinesnamely (a) Jaccard similarity (b) cosine similar-ity over USE embedding and (c) cosine similarityover RoBERTa15 embedding The results are sum-marized in Table 3 showing that the Q2Q modelsignificantly outperforms these baselines

11httpsgithubcomspotifyannoy12httpsgithubcomhuggingfacetransformers13We use batch size 32 maximum sequence length of 128

learning rate 5e-5 and 3 epochs14We only used the examples with full agreement15Hugging-Face sentence-transformers roberta-large-nli-

stsb-mean-tokens model

248

Category Acc Majority ∆

Light Bars 739 611 +128Monitors 782 682 +99Smartwatches 800 656 +144Receivers 775 676 +99Backpacks 839 760 +79Jeans 713 593 +119Beds 846 720 +126Home Office Desks 739 634 +105Masks 751 669 +82Posters amp Prints 723 609 +115Accessories 791 724 +66

Macro Average 772 667 +97

Table 4 CPS model test set results on the CPS auxiliarydatasets and the majority baseline of each category

52 CPS Model

Training The CPS model predicts the contextualsimilarity between a pair of twin records In our ex-periments the textual content of a product consistsof the product title concatenated with the productbullet points separated by semicolons The ques-tion text is the original query as appeared in theAmazon PQA-dataset For the encoding modulesof the CPS model we use a standard pre-trainedRoBERTa-based model as well while using the[SEP ] token for separating the two inputs to eachencoder For training twin pairs are labeled accord-ing to their contextual similarity using Equation 2

We train fine-tune and test an independentCPS model for each category set C using D(C)Ddev(C) and Dtest(C) (details of the data splitdescribed in Appendix A) The training set D(C)is created as described in Section 51 Ddev(C)and Dtest(C) are created the same with one mod-ification ndash rather than retrieving the siblings for arecord from the dataset it belongs to the siblingsare retrieved from D(C) for both Ddev(C) andDtest(C) This represents a real-world scenariowhere existing products with their related questionsare used as a corpus for predicting the answer to aquestion about a new product Each product withall related questions appear only in one of thesesets

Evaluation We evaluate the CPS model by mea-suring the accuracy of its contextual similarity pre-diction over Dtest(C) The accuracy per categoryis presented in Table 4 The model achieves a rela-tively high accuracy with a macro average of 772over all categories presenting a significant lift of97 over the majority decision baseline This isan encouraging result considering the fact that the

answers for many questions cannot be directly in-ferred from the product textual information Weconjecture that the model is able to learn the affinitybetween different products in the context of a givenquestion for predicting their contextual similarityFor example the two backpacks Ranvoo LaptopBackpack and Swiss Gear Bungee Backpack werecorrectly classified by the CPS model as similar(ψ ge 05) in context of the question ldquoWill this fitunder a plane seatrdquo and classified as different(ψ lt 05) in context of the question ldquoDoes it havea separate laptop sleeverdquo

53 Answer Prediction MethodsWe experiment with our SimBA model and witha few baselines over the test set of all categoriesThe first one is Majority which returns the majorityanswer among all records in the category Othermethods are described next

SimBA Given a target record rt SimBA scoreseach of its twins by the CPS model and predictsthe answer for qt using Equation 3 wmin was fine-tuned on the combined dev set of all categories andwas set to 038

Question Similarity Only (QSO) We modifythe SimBA model to ignore the CPS classifica-tion score when implementing the Mixture-of-Experts model (Eq 3) by setting an equalweight of 10 to all twin votes Pred(at|rt) =

sign(sum

rjisinT (rt)δ(aj)

)

Product Similarity Only (PSO) We modify theSimBA model by setting qt and qj to empty stringsat the input of the CPS model both during train-ing and during inference forcing it to rely on theproductsrsquo textual content alone The twin retrievalprocess remains untouched

Answer Prediction Classifier (APC) We exper-iment with a direct prediction approach that onlyconsiders the product textual content and the ques-tion for answer prediction For each category Cwe fine-tune a pre-trained RoBERTa-based clas-sifier over all records rj isin C using qj and pj(separated by the [SEP ] token) as input and δ(aj)as the training label

SimBA+APC The experimental results showthat different answer-prediction methods (egSimBA vs APC) may be preferable for differentproduct categories Therefore we combine bothmethods for achieving optimal results by mixing

249

Twins Answer(Monitors) Does this require WiFi 51 91 No(Backpacks) Will it fit under a plane seat 213 90 Yes(Smartwatches) Can it measure blood sugar level 34 97 No(Smartwatches) Does it come with a charger 269 96 Yes

Table 5 Examples for popular questions with high answeragreement over different products

Figure 3 Macro-average test accuracy over all categoriesThe horizontal axis indicate the minimal number of twins andthe percentages of test questions each subset represents

the vote of APC with the twin votes using theMixture-of-Experts approach

Pred(at|rt) =sign

(η(rt)δ(αt) +

sumrjisinT (rt)

λ(rj)δ(aj))

where αt is the APC predicted answer and η(rt) =η1 η2 and η3 for |T (rt)| le 10 10 lt |T (rt)| lt 50and |T (rt)| ge 50 respectively16 All η values(η gt 0) are fine-tuned on the development set foreach category separately The values we used aredetailed in Table 10 in Appendix A

54 Answer Prediction EvaluationThe answer prediction accuracy results of all testedpredictors macro-averaged over Dtest(C) of allcategories are presented in Figure 3 We inspectthe performance of the methods on different subsetsof the test data where each subset is determinedby all records having at least x twins x isin [0130]The horizontal axis indicates the minimal numberof twins in the subset and the percentage of the dataeach subset represents For example the results atx = 0 represent the entire test set while the resultsat x = 10 represents the subset of questions withat least 10 twins account for 402 of the test set

The performance of Majority begins with 66(the percent of lsquoyesrsquo questions in the entire popula-tion) and drops for questions with many twins We

16We also tried a few different splits on the development set

hypothesize that obvious questions for which theanswer is the same across many products are rarelyasked hence have fewer twins In contrast infor-mative questions for which the answer is variedacross products are frequently asked wrt manyproducts hence have many twins Therefore wesee a drop in accuracy of the Majority baseline asthe number of twins grows

The accuracy of QSO is significantly higher thanthe majority-vote baseline This demonstrates aninteresting phenomena in the data of similar ques-tions that tend to have the same answer over va-riety of products typically of the same type Afew examples are presented in Table 5 The QSOmethod successfully detects these groups of ques-tions and predicts the majority answer for eachsuch group We find that PSO method generallydoesnrsquot improve over QSO This is somewhat sur-prising as we expected that using product similar-ity information such as brand model or key fea-tures would increase the prediction accuracy Thisdemonstrates the importance of question-contextas used in SimBA in addition to the product infor-mation alone

Moving to SimBA we can see a large perfor-mance improvement over the QSO and PSO meth-ods which we attribute directly to the CPS modelWe also see consistent improvement in accuracywith the number of twins likely due to the largersupport the model has for predicting the answer

The APC method despite its relative simplic-ity performs very well and greatly outperformsthe majority-vote and the QSO and PSO baselinesFor the segment of questions with less than 10twins APC outperforms the SimBA method Thissegment represents roughly 60 of the questionsHowever for the segment of questions with 60 ormore twins which accounts for 136 of the ques-tions SimBA method consistently outperforms theinductive baseline by 1-2 When inspecting theresults by category as shown in Table 6 we cansee that considering all questions with at least 1twin the APC method dominates in 7 out of the11 categories while for questions with at least 60twins SimBA method dominates in 6 out of the 11categories

Finally we see that the two approaches com-pliment each other and can be effectively joinedas the SimBA+APC method outperforms both ofthem over all subsets

250

Questions with 1+ Twins Questions with 60+ Twins of data SimBA APC SimBA+APC of data SimBA APC SimBA+APC

Light bars 625 753 748 777 90 741 681 750Monitors 792 760 764 765 270 785 785 785

Smartwatches 883 773 793 790 319 799 817 808Receivers 575 701 703 720 48 832 779 832Backpacks 747 807 827 823 215 827 832 829

Jeans 633 674 659 674 134 748 709 757Beds 704 777 764 779 168 825 804 825

Home amp office desks 650 718 762 758 47 800 844 844Masks 765 709 742 730 42 804 826 804

Posters amp prints 601 734 734 744 93 759 696 759Accessories 717 781 790 792 72 823 816 827

Macro Average 699 744 753 759 136 795 781 802

Table 6 Answer prediction accuracy by category Left accuracy over records with at least one twin representing 699 of therecords on average Right accuracy over records with at least 60 twins representing 136 of the records The highest accuracybetween SimBA and APC is in bold

6 Conclusions

We presented SimBA a novel answer predictionapproach in the PQA domain which directly lever-ages similar questions answered with respect toother products Our empirical evaluation showsthat on some segments of questions namely thosewith roughly ten or more similar questions in thecorpus our method can outperform a strong induc-tive method that directly utilizes the question andthe textual product content We further show thatthe two approaches are complementary and can beintegrated to increase the overall answer predictionaccuracy

For future work we plan to explore how SimBAcan be extended and be applied beyond yes-no ques-tions eg for questions with numerical answers oropen-ended questions Another interesting researchdirection is combining additional voters to theMixture-of-Experts model such as a review-awareanswer predictor or a product details-based predic-tor Additionally our current evaluation considereda static view of the answered product-question cor-pus we plan to explore temporal aspects of ourmethod for example considering questions age orignoring answers of obsolete products that mightbe irrelevant

Acknowledgments

Ohad Rozen would like to express his gratitudeto Yanai Elazar Vered Shwartz and Ido Daganfor providing him valuable advice while he wasconducting this research during his internship atAmazon

ReferencesDavid Carmel Liane Lewin-Eytan and Yoelle Maarek

2018 Product question answering using customer

generated content - research challenges SIGIR rsquo18page 1349ndash1350 Association for Computing Ma-chinery

Daniel Cer Yinfei Yang Sheng-yi Kong Nan HuaNicole Limtiaco Rhomni St John Noah Con-stant Mario Guajardo-Cespedes Steve Yuan ChrisTar Yun-Hsuan Sung Brian Strope and RayKurzweil 2018 Universal sentence encoder CoRRabs180311175

Long Chen Ziyu Guan Wei Zhao Wanqing Zhao Xi-aopeng Wang Zhou Zhao and Huan Sun 2019 An-swer identification from product reviews for userquestions by multi-task attentive networks In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence volume 33 pages 45ndash52

Lei Cui Shaohan Huang Furu Wei Chuanqi TanChaoqun Duan and Ming Zhou 2017 SuperagentA customer service chatbot for e-commerce web-sites In Proceedings of ACL 2017 System Demon-strations pages 97ndash102

Manirupa Das Zhen Wang Evan Jaffe MadhujaChattopadhyay Eric Fosler-Lussier and Rajiv Ram-nath 2019 Learning to answer subjective spe-cific product-related queries using customer re-views by adversarial domain adaptation CoRRabs191008270

Yang Deng Wenxuan Zhang and Wai Lam 2020Opinion-aware answer generation for review-drivenquestion answering in e-commerce In Proceedingsof CIKM 2020 page 255ndash264 Association for Com-puting Machinery

Miao Fan Chao Feng Mingming Sun Ping Li andHaifeng Wang 2019 Reading customer reviews toanswer product-related questions In Proceedingsof the 2019 SIAM International Conference on DataMining pages 567ndash575 SIAM

Shen Gao Zhaochun Ren Yihong Eric Zhao DongyanZhao Dawei Yin and Rui Yan 2019 Product-aware answer generation in e-commerce question-answering In Proceedings WSDM 2019 pages 429ndash437 ACM

251

Jing He and Decheng Dai 2011 Summarization ofyesno questions using a feature function model vol-ume 20 of Proceedings of Machine Learning Re-search pages 351ndash366 South Garden Hotels andResorts Taoyuan Taiwain PMLR

Robert A Jacobs Michael I Jordan Steven J Nowlanand Geoffrey E Hinton 1991 Adaptive mixtures oflocal experts Neural Comput 3(1)79ndash87

Jiwoon Jeon W Bruce Croft and Joon Ho Lee 2005Finding similar questions in large question and an-swer archives In Proceedings CIKM 2005 pages84ndash90

T M Lai T Bui N Lipka and S Li 2018 Super-vised transfer learning for product information ques-tion answering In Proceedings of ICMLA 2018pages 1109ndash1114

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Julian McAuley 2016 Amazon questionanswerdata httpsjmcauleyucsdedudataamazonqa

Julian J McAuley and Alex Yang 2016 Address-ing complex and subjective product-related querieswith customer reviews In Proceedings WWW 2016pages 625ndash635 ACM

Dae Hoon Park Hyun Duk Kim ChengXiang Zhaiand Lifan Guo 2015 Retrieval of relevant opinionsentences for new products In Proceedings SIGIR2015 pages 393ndash402 ACM

Fatemeh Pourgholamali 2016 Mining information forthe cold-item problem In Proceedings of RecSys2016 page 451ndash454 ACM

Mengting Wan and Julian McAuley 2016 Modelingambiguity subjectivity and diverging viewpoints inopinion question answering systems In Proceedingof ICDM 2016 pages 489ndash498 IEEE

Jianfei Yu Minghui Qiu Jing Jiang Jun HuangShuangyong Song Wei Chu and Haiqing Chen2018 Modelling domain relationships for transferlearning on retrieval-based question answering sys-tems in e-commerce In Proceedings of WSDM2018 page 682ndash690 ACM

Jianxing Yu Zheng-Jun Zha and Tat-Seng Chua 2012Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviewsIn Proceedings of EMNLP 2012 pages 391ndash401ACL

Qian Yu and Wai Lam 2018 Review-aware answerprediction for product-related questions incorporat-ing aspects In Proceedings of WSDM 2018 pages691ndash699 ACM

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 2: Answering Product-Questions by Utilizing Questions from ...

243

(Similarity Based Answer Prediction) For exam-ple the answer for the question ldquoWill these jeansshrink after a washrdquo asked about a new pair ofjeans on the website may be predicted based onthe answers for similar questions asked about otherjeans that share properties such as fabric materialbrand or style An example is shown in Table 1The main hypothesis we explore in this work iswhether the answer to a product question can bepredicted based on the answers for similar ques-tions about similar products and how reliable thisprediction is

As our method relies on the existing PQA corpusit addresses the two mentioned shortcomings of theprevious approaches First it can address a vari-ety of product-related questions that are commonin PQA including subjective and usage questionsSecond our method can provide answers to new orless popular products as it leverages an existing setof similar questions from other similar products

A key element of our proposed method is a novelconcept that we refer to as Contextual Product Sim-ilarity which determines whether two products aresimilar in the context of a specific question Forexample two smart-watches may be similar withregards to their texting capability but different withregards to sleep monitoring In Section 3 we for-mally define this concept and propose a predictionmodel for measuring contextual similarity betweenproducts with respect to a given question Addi-tionally we describe an efficient method to trainthis model by leveraging an existing PQA corpus

Another appealing property of SimBA is its abil-ity to support the predicted answer by providingthe list of highly similar questions upon which theanswer was predicted hence increasing usersrsquo con-fidence and enhancing user engagement

Our main contributions are (a) A novel PQAmethod that overcomes several shortcomings ofprevious methods (b) A novel concept of Contex-tual Product Similarity and an effective way to au-tomatically collect annotations to train this model(c) Finally publishing two large scale datasets oneis a question similarity data set and the second is alarge-scale Amazon product questions and answersdataset details are provided in Section 4

Empirical evaluation of our method demon-strates that it outperforms a strong baseline in somequestion segments and that a hybrid model is ef-fective in all the vast majority of the questions

Product Dickies Menrsquos Jeans 100 CottonQ Will these shrink after a washPredicted answer No

Similar Product 1 Eddie Bauer Menrsquos Jeans 100Cotton

Q Do these shrink when you wash and dry themA No

Similar Product 2 Timberland PRO Menrsquos Jean 99Cotton 1 Polyurethane

Q Was there any shrinkage after washingA No shrinkage

Similar Product 3 Levirsquos Menrsquos Jeans 98 Cotton2 Elastane

Q Do these shrink at all during washingdryingA They have not shrunk

Table 1 Answer prediction example based on similar ques-tions asked about similar products The answer for allcontextually-similar products is lsquonorsquo therefore we predict theanswer lsquonorsquo for the target question

2 Related Work

Automatic aswering product related questions hasbecome a permanent service provided by many e-commerce websites and services (Cui et al 2017Carmel et al 2018) Questions are typically an-swered based on product details from the cata-log existing QampArsquos on the site and customer re-views Each of these resources used for answergeneration has been studied extensively by the re-search community recently probably due to thecomplexity of this task the availability of appro-priate datasets (McAuley 2016) and the emergentincrease in on-line shopping usage

Lai et al (2018) built a question answering sys-tem based on product facts and specifications Theytrained a question answering system by transferlearning from a large-scale Amazon dataset to theHome Depot domain Gao et al (2019) generatedan answer from product attributes and reviews us-ing adversarial learning model which is composedof three components a question-aware review rep-resentation module a key-value attribute graphand a seq2seq model for answer generation Yuet al (2012) answered opinion questions by ex-ploiting hierarchical organization of consumer re-views where reviews were organized according tothe product aspects

The publication of Amazon datasets of reviews3

and QampAs (McAuley 2016) triggered a flood ofstudies on review-aware answer prediction and gen-eration McAuley and Yang (2016) formulated thereview based question answering task as a mixture-of-experts framework mdash each review is an ldquoexpertrdquo

3httpsnijianmogithubioamazonindexhtml

244

that votes on the answer to a yesno question Theirmodel learns to identify lsquorelevantrsquo reviews basedon those that vote correctly In a following workWan and McAuley (2016) observed that questionshave multiple often divergent answers and thefull spectrum of answers should be further utilizedto train the answering system

Chen et al (2019) described a multi-task atten-tion mechanism which exploits large amounts ofQampAs and a few manually labeled reviews foranswer prediction Fan et al (2019) proposed aneural architecture directly fed by the raw text ofthe question and reviews to mark review segmentas the final answer in a reading comprehensionfashion Das et al (2019) learned an adversarialnetwork for inferring reviews which best answer aquestion or augment a given answer Deng et al(2020) incorporated opinion mining into the review-based answer generation Yu and Lam (2018) gen-erated aspect-specific representation for questionsand reviews for answer prediction for yes-no ques-tions Yu et al (2018) used transfer learning froma resource-rich source domain to a resource-poortarget domain by simultaneously learning sharedrepresentations of questions and reviews in a uni-fied framework of both domains

All this line of works assume the existence ofrich set of product reviews to be used for questionanswering This solution fails when no reviews areavailable The challenge of review generation fora given product while utilizing similar productsrsquoreviews was addressed by Park et al (2015) For agiven product they extracted useful sentences fromthe reviews of other similar products Similarly(Pourgholamali 2016) mined relevant content fora product from various content resources availablefor similar products Both works focused on the ex-traction of general useful product related informa-tion rather than answering a specific product ques-tion as in our case Second the product-similaritymethods they considered rely on product specifi-cations and description and do not depend on thequestion to be answered while our method con-siders a specific question at hand when estimatingcontextual product similarity

3 Similarity-Based Answer Prediction

In this section we introduce the Similarity-BasedAnswer-prediction (SimBA) method for predictingthe answer for a product question based on theanswers for other similar product questions We

Figure 1 Overview of SimBA answer prediction framework(1) K siblings to the product question are retrieved from corpusby AKNN (2) Siblings are filtered by the Q2Q model keepingonly twins (3) Twins are scored by the CPS model (4)A Mixture of Experts model uses these votes to predict theanswer

restrict our study to yesno questions only due totheir popularity in the PQA domain (54 on ourPQA dataset) and following common practicesin answer prediction studies (McAuley and Yang2016 Yu and Lam 2018) Figure 1 presents ourprediction framework and its main components

Formally a question-product-answer tuple isdenoted by rj = (qj pj aj) where aj isinprimeyesprimeprime noprime C = rjNj=1 is the set of N tuplesof a given product category rt = (qt pt )

4 is thetarget record of an unanswered question qt askedabout product pt We treat C as the knowledge-base we use for answering qt

Given a target record rt in order to predict itsanswer at we first retrieve a set of records fromC with the most similar questions to qt (Figure 1stage 1) We denote the retrieved records as sib-lings of rt We then filter the siblings by applyinga Question-to-Question similarity (Q2Q) modelkeeping only records with highly similar questionswhich are expected to have the same question in-tent as of qt (Figure 1 stage 2) We denote theserecords as twins of rt We then apply our Contex-tual Product Similarity (CPS) model to measurethe contextual similarity between rt and its twins(Figure 1 stage 3) The CPS similarity score isused to weight the twins by considering them asvoters applying a mixture-of-experts model overtheir answers for the final answer prediction (Fig-ure 1 stage 4) More details about the modelrsquoscomponents the training processes and other spec-ifications are described in the following

31 Sibling Retrieval

Given a target record rt and a corpus of product-question-answer records C our first goal is to re-

4The answer for the target record is unknown

245

trieve all records with a question having the sameintent as of qt As C might be very large applyinga complex neural model to measure the similarityof each question in C to qt is often infeasible Wetherefore apply a two step retrieval process In apreliminary offline step we index the records in Cby creating embedding vectors for their questionsusing a pre-trained encoder For retrieval doneboth during training and inference we similarlyembed the question qt into vector et We then usea fast Approximate K Nearest Neighbors (AKNN)search to retrieve K records with the most similarquestions based on the cosine similarity betweenet and the embedding vectors of the questions inC We denote the set of retrieved siblings of rt byS(rt)

32 Twin detectionThe retrieved sibling records are those with themost similar questions to the target question Inthe second step of the retrieval process we en-hance our record selection by applying a highlyaccurate transformer-based Question-to-Question(Q2Q) classifier (See Section 51) which we trainover our question to question similarity dataset(Section 41) The Q2Q(qt qk) classifier predictsthe similarity between a target question qt and eachof the questions qk in S(rt) A record rk is con-sidered a twin of rt if Q2Q(qt qk) gt γ where05 le γ le 10 is a hyper-parameter of the systemWe denote the set of twins of rt by T (rt)

33 Contextual Product Similarity (CPS)We consider products p1 and p2 to be contextuallysimilar with respect to a yesno question q if theanswer to q on both products is the same5 Givena pair of twin records (r1 r2) our CPS model isaims to predict the contextual similarity betweenthem ie whether their (highly similar) questionshave the same answer

Since r1 and r2 are twins their questions areexpected to have the same intent yet they mightbe phrased differently To avoid losing any infor-mation we provide both questions as input to theCPS model during training and during inferencetime

CPS Model Architecture Figure 2 depicts theCPS model for predicting the contextual similarity

5By design both products belong to the same productcategory C which prevents comparing unrelated productsFor example comparing an airhorn and a computer fan in thecontext of the question is it loud is therefore prevented

Figure 2 The Contextual Product Similarity (CPS) modelThe target question-product pair (qt pt) and the twin question-product pair (qj pj) are encoded using a transformer encoderwhile the questions attend the product text The texts of bothproducts are coupled and also encoded allowing the two prod-uct text attend each other The three output vectors are thenconcatenated and classified using an MLP classifier

between a target record rt and one of its twins -record rj For each record the question-productpair is embedded using a pre-trained transformerencoder allowing the product textual content andthe question text attend each other6

Ht = Encoder(qt pt) Hj = Encoder(qj pj)

The two models share weights to avoid over-fittingand for more efficient learning A second encoderembeds the textual content of both products encap-sulating the similarity between them

Htj = Encoder(pt pj)

Then a one hidden MLP layer takes the concatena-tion of the three embedding vectors to predict theprobability of at = aj

ψtj = CPS(rt rj) = P (at=aj |rt rj)=MLP (Ht oplusHj oplusHtj)

(1)

Another key advantage of the CPS model is itsability to be trained on a large scale without humanannotations by simply yielding the training labelsdirectly from the polarity between the answers oftwin pairs extracted from our training data For anypair of twins (ri rj)

label(ri rj) =

similar ai = aj

different ai 6= aj(2)

6The product textual content can be accumulated fromseveral resources In our experiments we restrict the productcontent to its title and bullet points

246

Question 1 Question 2 Label

Can this be used with regular light bulbs Can i put a regular light bulb in this SimilarCan i use these labels in a laser printer Can this be used in laser printer SimilarDoes the hat have an adjustable strap inside Does the hat adjust inside SimilarCan this organizer hold sleeved cards Can it hold cards with sleeves on them SimilarDoes this phone have bluetooth Does the phone have gps DifferentCan just two player play this game Whats the length of play of this game DifferentIs there a diffuser for this dryer Can this go in the dryer DifferentWhat material is the neck strap made of Does this come with a neck strap Different

Table 2 Examples from Amazon-PQSim Dataset Each example consists of a user-generated question pair and ahuman-annotated label for their similarity

34 Mixture of ExpertsA mixture of experts is a widely-used method tocombine the outputs of several classifiers by as-sociating a weighted confidence score with eachclassifier (Jacobs et al 1991) In our setting ex-perts are individual twins that lend support for oragainst a particular answer for a question Eachtwin is weighted by its contextual similarity to thetarget record rt as predicted by the CPS model

Given a target record rt the weight of each ofits twins rj isin T (rt) is determined by

λ(rj) = max(ψ2tj wmin)

where ψtj = CPS(rt rj) and 0 le wmin le 05is a lower weight-limit a hyper-parameter that wetune on the development set7

The predicted class of at is therefore derived by

Pred(at|rt) = sign

sumrjisinT (rt)

λ(rj)δ(aj)

(3)

where positivenegative Pred indicates lsquoyesrsquolsquonorsquo

respectively and δ(a) =

+1 a = lsquoyesrsquominus1 a = lsquonorsquo

Our methodology can be easily expanded to in-corporate more answer predictors (voters) of dif-ferent types into SimBA An example for such anexpansion is described at Section 53

4 Datasets

We introduce two new datasets to experiment withour answer prediction approach 1) The Ama-zon Product Question Similarity (Amazon-PQSim)dataset which is used to train our Q2Q model 2)The Amazon Product Question Answers (Amazon-PQA) dataset of product related QampAs used fortraining the SimBA model

7We tried using the CPS raw score for all twins iewmin = 0 however using a fine-tuned minimal weightyielded better results

41 Amazon-PQSim DatasetWe collected a first-of-a-kind question-to-questionsimilarity dataset of product-question pairs fromthe Amazon website (Amazon-PQSim See Table 2for examples) Unlike the Quora dataset of generalquestion pairs8 product questions are asked in thecontext of a designated product page This makesthem unique and different from questions askedin other domains For example the question Isit waterproof when appears on the Fitbit Flyerdetailed page should implicitly be interpreted as IsFitbit Flyer waterproof

The following steps were taken for the data col-lection (a) randomly sampling product-questionsfrom the Amazon website (b) filtering out someof these questions (eg non-English questions formore details see Appendix A) (c) For each ofthe remaining questions we retrieved up to threecandidate similar questions from the collectionA question is paired with the original question ifthe Jaccard similarity among them is in the rangeof [03 05] We ignore highly similar questions(gt 05) since we donrsquot want nearly verbatim pairsin our dataset as well as dissimilar pairs (lt 03)(d) Finally we used the Appen crowd-sourcing plat-form9 for manual annotation of question pairs sim-ilarity 10 Each question pair was labeled by atleast three judges and up to seven until reachingagreement of 70 or more

The above steps resulted in a nearly balanceddataset (108 positive-negative ratio) of more than180K product question pairs with judges agree-ment of 70 or more and among them about 90Kquestion pairs have perfect judges agreement (114

8httpswwwkagglecomcquora-question-pairs9httpsappencom

10As the questions are asked in context of a specific productthey are often written in an anaphoric form (eg Is it water-proof) To keep our dataset general we instructed the judgesto accept such questions as if they included the actual relatedproduct name For example the pair Is it waterproof and Isthis Fitbit waterproof were labeled as similar

247

positive-negative ratio)

42 Amazon-PQA DatasetWe collected a large corpus of product questionsand answers from the Amazon website similarto the popular Amazon QampA dataset (McAuley2016) Since our answer prediction method directlyutilizes an existing corpus of resolved questionswe aim to collect all available questions per nar-row sub-category instead of a sample of questionsacross broad categories by the popular AmazonQampA dataset For example instead of samplingfrom the broad Electronics category we collectall questions under the narrower Monitors and Re-ceivers categories

Raw Data Extraction We collected all prod-uct questions with their answers from 100 sub-categories available on the Amazon website inAugust 2020 Overall 10M questions were col-lected with 207M answers on 15M products Forfull statistics of the raw data see Table 7 in Ap-pendix A

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics See Appendix A for de-tails

YesNo Answer Labeling Questions are typi-cally answered by free-text answers posted in-dependently by multiple users In order to con-vert these answers into a single yesno answerwe first classified each answer into one of threeclasses yes no and maybe and then used major-ity vote among the classified answers We used apre-trained RoBERTa-based classifier and trainedthe model on McAuleyrsquos dataset (McAuley 2016)taking only yesno questions See Appendix A fordetails

5 Experiments

We experiment with eleven product categories cov-ered by our Amazon-PQA dataset (Section 42)training a SimBA answer prediction model for eachof the categories independently Next we describethe data preparation steps for each of the SimBAcomponents

51 Data PreparationSibling Retrieval Using AKNN For each recordr isin C (C is the category dataset) we use AKNNto retrieve the top-K similar siblings fromC while

Majority Jaccard USE RoBERTa Q2Qvote similarity cosine cosine531 620 696 707 832

Table 3 Classification accuracy of question similarity mod-els

making sure that neither of them share the sameproduct with r We collect training example pairsby coupling each record r with each of its siblingsDprime(C) =

⋃riisinC(ri rj)|rj isin S(ri)

For retrieval we use Universal Sentence Encoder(USE) (Cer et al 2018) to embed each question qiinto a 512-length vector ei We use the Annoy11

python library for the implementation of efficientAKNN retrieval In all experiments for each recordwe retrieve the top-K (K = 500) similar recordsbased on the cosine-similarity between the embed-ding vectors

Twin Detection Using the Q2Q Model Foreach sibling pair (ri rj) isin Dprime(C) we use ourQ2Q model to score their question-similarity andkeep only those with Q2Q(qi qj) gt γ to yield acollection of twin pairs D(C) We use γ = 09 toensure only highly similar question pairs

For our Q2Q model we apply a standard pre-trained RoBERTa (Liu et al 2019) classifierSpecifically we use Hugging-Face base-uncasedpre-trained model12 and fine-tune13 it for the classi-fication task on our Q2Q dataset14 while splittingthe data into train dev and test sets with 80-10-10 partition respectively For γ = 05 (itsminimal value) the model achieves test accuracyof 832 with a precision of 813 and a recallof 877 When setting the twin confidence levelthreshold to γ = 09 the precision of the Q2Qmodel raises to 899 with a recall of 695

We compare the performance of the Q2Q simi-larity classifier with several unsupervised baselinesnamely (a) Jaccard similarity (b) cosine similar-ity over USE embedding and (c) cosine similarityover RoBERTa15 embedding The results are sum-marized in Table 3 showing that the Q2Q modelsignificantly outperforms these baselines

11httpsgithubcomspotifyannoy12httpsgithubcomhuggingfacetransformers13We use batch size 32 maximum sequence length of 128

learning rate 5e-5 and 3 epochs14We only used the examples with full agreement15Hugging-Face sentence-transformers roberta-large-nli-

stsb-mean-tokens model

248

Category Acc Majority ∆

Light Bars 739 611 +128Monitors 782 682 +99Smartwatches 800 656 +144Receivers 775 676 +99Backpacks 839 760 +79Jeans 713 593 +119Beds 846 720 +126Home Office Desks 739 634 +105Masks 751 669 +82Posters amp Prints 723 609 +115Accessories 791 724 +66

Macro Average 772 667 +97

Table 4 CPS model test set results on the CPS auxiliarydatasets and the majority baseline of each category

52 CPS Model

Training The CPS model predicts the contextualsimilarity between a pair of twin records In our ex-periments the textual content of a product consistsof the product title concatenated with the productbullet points separated by semicolons The ques-tion text is the original query as appeared in theAmazon PQA-dataset For the encoding modulesof the CPS model we use a standard pre-trainedRoBERTa-based model as well while using the[SEP ] token for separating the two inputs to eachencoder For training twin pairs are labeled accord-ing to their contextual similarity using Equation 2

We train fine-tune and test an independentCPS model for each category set C using D(C)Ddev(C) and Dtest(C) (details of the data splitdescribed in Appendix A) The training set D(C)is created as described in Section 51 Ddev(C)and Dtest(C) are created the same with one mod-ification ndash rather than retrieving the siblings for arecord from the dataset it belongs to the siblingsare retrieved from D(C) for both Ddev(C) andDtest(C) This represents a real-world scenariowhere existing products with their related questionsare used as a corpus for predicting the answer to aquestion about a new product Each product withall related questions appear only in one of thesesets

Evaluation We evaluate the CPS model by mea-suring the accuracy of its contextual similarity pre-diction over Dtest(C) The accuracy per categoryis presented in Table 4 The model achieves a rela-tively high accuracy with a macro average of 772over all categories presenting a significant lift of97 over the majority decision baseline This isan encouraging result considering the fact that the

answers for many questions cannot be directly in-ferred from the product textual information Weconjecture that the model is able to learn the affinitybetween different products in the context of a givenquestion for predicting their contextual similarityFor example the two backpacks Ranvoo LaptopBackpack and Swiss Gear Bungee Backpack werecorrectly classified by the CPS model as similar(ψ ge 05) in context of the question ldquoWill this fitunder a plane seatrdquo and classified as different(ψ lt 05) in context of the question ldquoDoes it havea separate laptop sleeverdquo

53 Answer Prediction MethodsWe experiment with our SimBA model and witha few baselines over the test set of all categoriesThe first one is Majority which returns the majorityanswer among all records in the category Othermethods are described next

SimBA Given a target record rt SimBA scoreseach of its twins by the CPS model and predictsthe answer for qt using Equation 3 wmin was fine-tuned on the combined dev set of all categories andwas set to 038

Question Similarity Only (QSO) We modifythe SimBA model to ignore the CPS classifica-tion score when implementing the Mixture-of-Experts model (Eq 3) by setting an equalweight of 10 to all twin votes Pred(at|rt) =

sign(sum

rjisinT (rt)δ(aj)

)

Product Similarity Only (PSO) We modify theSimBA model by setting qt and qj to empty stringsat the input of the CPS model both during train-ing and during inference forcing it to rely on theproductsrsquo textual content alone The twin retrievalprocess remains untouched

Answer Prediction Classifier (APC) We exper-iment with a direct prediction approach that onlyconsiders the product textual content and the ques-tion for answer prediction For each category Cwe fine-tune a pre-trained RoBERTa-based clas-sifier over all records rj isin C using qj and pj(separated by the [SEP ] token) as input and δ(aj)as the training label

SimBA+APC The experimental results showthat different answer-prediction methods (egSimBA vs APC) may be preferable for differentproduct categories Therefore we combine bothmethods for achieving optimal results by mixing

249

Twins Answer(Monitors) Does this require WiFi 51 91 No(Backpacks) Will it fit under a plane seat 213 90 Yes(Smartwatches) Can it measure blood sugar level 34 97 No(Smartwatches) Does it come with a charger 269 96 Yes

Table 5 Examples for popular questions with high answeragreement over different products

Figure 3 Macro-average test accuracy over all categoriesThe horizontal axis indicate the minimal number of twins andthe percentages of test questions each subset represents

the vote of APC with the twin votes using theMixture-of-Experts approach

Pred(at|rt) =sign

(η(rt)δ(αt) +

sumrjisinT (rt)

λ(rj)δ(aj))

where αt is the APC predicted answer and η(rt) =η1 η2 and η3 for |T (rt)| le 10 10 lt |T (rt)| lt 50and |T (rt)| ge 50 respectively16 All η values(η gt 0) are fine-tuned on the development set foreach category separately The values we used aredetailed in Table 10 in Appendix A

54 Answer Prediction EvaluationThe answer prediction accuracy results of all testedpredictors macro-averaged over Dtest(C) of allcategories are presented in Figure 3 We inspectthe performance of the methods on different subsetsof the test data where each subset is determinedby all records having at least x twins x isin [0130]The horizontal axis indicates the minimal numberof twins in the subset and the percentage of the dataeach subset represents For example the results atx = 0 represent the entire test set while the resultsat x = 10 represents the subset of questions withat least 10 twins account for 402 of the test set

The performance of Majority begins with 66(the percent of lsquoyesrsquo questions in the entire popula-tion) and drops for questions with many twins We

16We also tried a few different splits on the development set

hypothesize that obvious questions for which theanswer is the same across many products are rarelyasked hence have fewer twins In contrast infor-mative questions for which the answer is variedacross products are frequently asked wrt manyproducts hence have many twins Therefore wesee a drop in accuracy of the Majority baseline asthe number of twins grows

The accuracy of QSO is significantly higher thanthe majority-vote baseline This demonstrates aninteresting phenomena in the data of similar ques-tions that tend to have the same answer over va-riety of products typically of the same type Afew examples are presented in Table 5 The QSOmethod successfully detects these groups of ques-tions and predicts the majority answer for eachsuch group We find that PSO method generallydoesnrsquot improve over QSO This is somewhat sur-prising as we expected that using product similar-ity information such as brand model or key fea-tures would increase the prediction accuracy Thisdemonstrates the importance of question-contextas used in SimBA in addition to the product infor-mation alone

Moving to SimBA we can see a large perfor-mance improvement over the QSO and PSO meth-ods which we attribute directly to the CPS modelWe also see consistent improvement in accuracywith the number of twins likely due to the largersupport the model has for predicting the answer

The APC method despite its relative simplic-ity performs very well and greatly outperformsthe majority-vote and the QSO and PSO baselinesFor the segment of questions with less than 10twins APC outperforms the SimBA method Thissegment represents roughly 60 of the questionsHowever for the segment of questions with 60 ormore twins which accounts for 136 of the ques-tions SimBA method consistently outperforms theinductive baseline by 1-2 When inspecting theresults by category as shown in Table 6 we cansee that considering all questions with at least 1twin the APC method dominates in 7 out of the11 categories while for questions with at least 60twins SimBA method dominates in 6 out of the 11categories

Finally we see that the two approaches com-pliment each other and can be effectively joinedas the SimBA+APC method outperforms both ofthem over all subsets

250

Questions with 1+ Twins Questions with 60+ Twins of data SimBA APC SimBA+APC of data SimBA APC SimBA+APC

Light bars 625 753 748 777 90 741 681 750Monitors 792 760 764 765 270 785 785 785

Smartwatches 883 773 793 790 319 799 817 808Receivers 575 701 703 720 48 832 779 832Backpacks 747 807 827 823 215 827 832 829

Jeans 633 674 659 674 134 748 709 757Beds 704 777 764 779 168 825 804 825

Home amp office desks 650 718 762 758 47 800 844 844Masks 765 709 742 730 42 804 826 804

Posters amp prints 601 734 734 744 93 759 696 759Accessories 717 781 790 792 72 823 816 827

Macro Average 699 744 753 759 136 795 781 802

Table 6 Answer prediction accuracy by category Left accuracy over records with at least one twin representing 699 of therecords on average Right accuracy over records with at least 60 twins representing 136 of the records The highest accuracybetween SimBA and APC is in bold

6 Conclusions

We presented SimBA a novel answer predictionapproach in the PQA domain which directly lever-ages similar questions answered with respect toother products Our empirical evaluation showsthat on some segments of questions namely thosewith roughly ten or more similar questions in thecorpus our method can outperform a strong induc-tive method that directly utilizes the question andthe textual product content We further show thatthe two approaches are complementary and can beintegrated to increase the overall answer predictionaccuracy

For future work we plan to explore how SimBAcan be extended and be applied beyond yes-no ques-tions eg for questions with numerical answers oropen-ended questions Another interesting researchdirection is combining additional voters to theMixture-of-Experts model such as a review-awareanswer predictor or a product details-based predic-tor Additionally our current evaluation considereda static view of the answered product-question cor-pus we plan to explore temporal aspects of ourmethod for example considering questions age orignoring answers of obsolete products that mightbe irrelevant

Acknowledgments

Ohad Rozen would like to express his gratitudeto Yanai Elazar Vered Shwartz and Ido Daganfor providing him valuable advice while he wasconducting this research during his internship atAmazon

ReferencesDavid Carmel Liane Lewin-Eytan and Yoelle Maarek

2018 Product question answering using customer

generated content - research challenges SIGIR rsquo18page 1349ndash1350 Association for Computing Ma-chinery

Daniel Cer Yinfei Yang Sheng-yi Kong Nan HuaNicole Limtiaco Rhomni St John Noah Con-stant Mario Guajardo-Cespedes Steve Yuan ChrisTar Yun-Hsuan Sung Brian Strope and RayKurzweil 2018 Universal sentence encoder CoRRabs180311175

Long Chen Ziyu Guan Wei Zhao Wanqing Zhao Xi-aopeng Wang Zhou Zhao and Huan Sun 2019 An-swer identification from product reviews for userquestions by multi-task attentive networks In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence volume 33 pages 45ndash52

Lei Cui Shaohan Huang Furu Wei Chuanqi TanChaoqun Duan and Ming Zhou 2017 SuperagentA customer service chatbot for e-commerce web-sites In Proceedings of ACL 2017 System Demon-strations pages 97ndash102

Manirupa Das Zhen Wang Evan Jaffe MadhujaChattopadhyay Eric Fosler-Lussier and Rajiv Ram-nath 2019 Learning to answer subjective spe-cific product-related queries using customer re-views by adversarial domain adaptation CoRRabs191008270

Yang Deng Wenxuan Zhang and Wai Lam 2020Opinion-aware answer generation for review-drivenquestion answering in e-commerce In Proceedingsof CIKM 2020 page 255ndash264 Association for Com-puting Machinery

Miao Fan Chao Feng Mingming Sun Ping Li andHaifeng Wang 2019 Reading customer reviews toanswer product-related questions In Proceedingsof the 2019 SIAM International Conference on DataMining pages 567ndash575 SIAM

Shen Gao Zhaochun Ren Yihong Eric Zhao DongyanZhao Dawei Yin and Rui Yan 2019 Product-aware answer generation in e-commerce question-answering In Proceedings WSDM 2019 pages 429ndash437 ACM

251

Jing He and Decheng Dai 2011 Summarization ofyesno questions using a feature function model vol-ume 20 of Proceedings of Machine Learning Re-search pages 351ndash366 South Garden Hotels andResorts Taoyuan Taiwain PMLR

Robert A Jacobs Michael I Jordan Steven J Nowlanand Geoffrey E Hinton 1991 Adaptive mixtures oflocal experts Neural Comput 3(1)79ndash87

Jiwoon Jeon W Bruce Croft and Joon Ho Lee 2005Finding similar questions in large question and an-swer archives In Proceedings CIKM 2005 pages84ndash90

T M Lai T Bui N Lipka and S Li 2018 Super-vised transfer learning for product information ques-tion answering In Proceedings of ICMLA 2018pages 1109ndash1114

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Julian McAuley 2016 Amazon questionanswerdata httpsjmcauleyucsdedudataamazonqa

Julian J McAuley and Alex Yang 2016 Address-ing complex and subjective product-related querieswith customer reviews In Proceedings WWW 2016pages 625ndash635 ACM

Dae Hoon Park Hyun Duk Kim ChengXiang Zhaiand Lifan Guo 2015 Retrieval of relevant opinionsentences for new products In Proceedings SIGIR2015 pages 393ndash402 ACM

Fatemeh Pourgholamali 2016 Mining information forthe cold-item problem In Proceedings of RecSys2016 page 451ndash454 ACM

Mengting Wan and Julian McAuley 2016 Modelingambiguity subjectivity and diverging viewpoints inopinion question answering systems In Proceedingof ICDM 2016 pages 489ndash498 IEEE

Jianfei Yu Minghui Qiu Jing Jiang Jun HuangShuangyong Song Wei Chu and Haiqing Chen2018 Modelling domain relationships for transferlearning on retrieval-based question answering sys-tems in e-commerce In Proceedings of WSDM2018 page 682ndash690 ACM

Jianxing Yu Zheng-Jun Zha and Tat-Seng Chua 2012Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviewsIn Proceedings of EMNLP 2012 pages 391ndash401ACL

Qian Yu and Wai Lam 2018 Review-aware answerprediction for product-related questions incorporat-ing aspects In Proceedings of WSDM 2018 pages691ndash699 ACM

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 3: Answering Product-Questions by Utilizing Questions from ...

244

that votes on the answer to a yesno question Theirmodel learns to identify lsquorelevantrsquo reviews basedon those that vote correctly In a following workWan and McAuley (2016) observed that questionshave multiple often divergent answers and thefull spectrum of answers should be further utilizedto train the answering system

Chen et al (2019) described a multi-task atten-tion mechanism which exploits large amounts ofQampAs and a few manually labeled reviews foranswer prediction Fan et al (2019) proposed aneural architecture directly fed by the raw text ofthe question and reviews to mark review segmentas the final answer in a reading comprehensionfashion Das et al (2019) learned an adversarialnetwork for inferring reviews which best answer aquestion or augment a given answer Deng et al(2020) incorporated opinion mining into the review-based answer generation Yu and Lam (2018) gen-erated aspect-specific representation for questionsand reviews for answer prediction for yes-no ques-tions Yu et al (2018) used transfer learning froma resource-rich source domain to a resource-poortarget domain by simultaneously learning sharedrepresentations of questions and reviews in a uni-fied framework of both domains

All this line of works assume the existence ofrich set of product reviews to be used for questionanswering This solution fails when no reviews areavailable The challenge of review generation fora given product while utilizing similar productsrsquoreviews was addressed by Park et al (2015) For agiven product they extracted useful sentences fromthe reviews of other similar products Similarly(Pourgholamali 2016) mined relevant content fora product from various content resources availablefor similar products Both works focused on the ex-traction of general useful product related informa-tion rather than answering a specific product ques-tion as in our case Second the product-similaritymethods they considered rely on product specifi-cations and description and do not depend on thequestion to be answered while our method con-siders a specific question at hand when estimatingcontextual product similarity

3 Similarity-Based Answer Prediction

In this section we introduce the Similarity-BasedAnswer-prediction (SimBA) method for predictingthe answer for a product question based on theanswers for other similar product questions We

Figure 1 Overview of SimBA answer prediction framework(1) K siblings to the product question are retrieved from corpusby AKNN (2) Siblings are filtered by the Q2Q model keepingonly twins (3) Twins are scored by the CPS model (4)A Mixture of Experts model uses these votes to predict theanswer

restrict our study to yesno questions only due totheir popularity in the PQA domain (54 on ourPQA dataset) and following common practicesin answer prediction studies (McAuley and Yang2016 Yu and Lam 2018) Figure 1 presents ourprediction framework and its main components

Formally a question-product-answer tuple isdenoted by rj = (qj pj aj) where aj isinprimeyesprimeprime noprime C = rjNj=1 is the set of N tuplesof a given product category rt = (qt pt )

4 is thetarget record of an unanswered question qt askedabout product pt We treat C as the knowledge-base we use for answering qt

Given a target record rt in order to predict itsanswer at we first retrieve a set of records fromC with the most similar questions to qt (Figure 1stage 1) We denote the retrieved records as sib-lings of rt We then filter the siblings by applyinga Question-to-Question similarity (Q2Q) modelkeeping only records with highly similar questionswhich are expected to have the same question in-tent as of qt (Figure 1 stage 2) We denote theserecords as twins of rt We then apply our Contex-tual Product Similarity (CPS) model to measurethe contextual similarity between rt and its twins(Figure 1 stage 3) The CPS similarity score isused to weight the twins by considering them asvoters applying a mixture-of-experts model overtheir answers for the final answer prediction (Fig-ure 1 stage 4) More details about the modelrsquoscomponents the training processes and other spec-ifications are described in the following

31 Sibling Retrieval

Given a target record rt and a corpus of product-question-answer records C our first goal is to re-

4The answer for the target record is unknown

245

trieve all records with a question having the sameintent as of qt As C might be very large applyinga complex neural model to measure the similarityof each question in C to qt is often infeasible Wetherefore apply a two step retrieval process In apreliminary offline step we index the records in Cby creating embedding vectors for their questionsusing a pre-trained encoder For retrieval doneboth during training and inference we similarlyembed the question qt into vector et We then usea fast Approximate K Nearest Neighbors (AKNN)search to retrieve K records with the most similarquestions based on the cosine similarity betweenet and the embedding vectors of the questions inC We denote the set of retrieved siblings of rt byS(rt)

32 Twin detectionThe retrieved sibling records are those with themost similar questions to the target question Inthe second step of the retrieval process we en-hance our record selection by applying a highlyaccurate transformer-based Question-to-Question(Q2Q) classifier (See Section 51) which we trainover our question to question similarity dataset(Section 41) The Q2Q(qt qk) classifier predictsthe similarity between a target question qt and eachof the questions qk in S(rt) A record rk is con-sidered a twin of rt if Q2Q(qt qk) gt γ where05 le γ le 10 is a hyper-parameter of the systemWe denote the set of twins of rt by T (rt)

33 Contextual Product Similarity (CPS)We consider products p1 and p2 to be contextuallysimilar with respect to a yesno question q if theanswer to q on both products is the same5 Givena pair of twin records (r1 r2) our CPS model isaims to predict the contextual similarity betweenthem ie whether their (highly similar) questionshave the same answer

Since r1 and r2 are twins their questions areexpected to have the same intent yet they mightbe phrased differently To avoid losing any infor-mation we provide both questions as input to theCPS model during training and during inferencetime

CPS Model Architecture Figure 2 depicts theCPS model for predicting the contextual similarity

5By design both products belong to the same productcategory C which prevents comparing unrelated productsFor example comparing an airhorn and a computer fan in thecontext of the question is it loud is therefore prevented

Figure 2 The Contextual Product Similarity (CPS) modelThe target question-product pair (qt pt) and the twin question-product pair (qj pj) are encoded using a transformer encoderwhile the questions attend the product text The texts of bothproducts are coupled and also encoded allowing the two prod-uct text attend each other The three output vectors are thenconcatenated and classified using an MLP classifier

between a target record rt and one of its twins -record rj For each record the question-productpair is embedded using a pre-trained transformerencoder allowing the product textual content andthe question text attend each other6

Ht = Encoder(qt pt) Hj = Encoder(qj pj)

The two models share weights to avoid over-fittingand for more efficient learning A second encoderembeds the textual content of both products encap-sulating the similarity between them

Htj = Encoder(pt pj)

Then a one hidden MLP layer takes the concatena-tion of the three embedding vectors to predict theprobability of at = aj

ψtj = CPS(rt rj) = P (at=aj |rt rj)=MLP (Ht oplusHj oplusHtj)

(1)

Another key advantage of the CPS model is itsability to be trained on a large scale without humanannotations by simply yielding the training labelsdirectly from the polarity between the answers oftwin pairs extracted from our training data For anypair of twins (ri rj)

label(ri rj) =

similar ai = aj

different ai 6= aj(2)

6The product textual content can be accumulated fromseveral resources In our experiments we restrict the productcontent to its title and bullet points

246

Question 1 Question 2 Label

Can this be used with regular light bulbs Can i put a regular light bulb in this SimilarCan i use these labels in a laser printer Can this be used in laser printer SimilarDoes the hat have an adjustable strap inside Does the hat adjust inside SimilarCan this organizer hold sleeved cards Can it hold cards with sleeves on them SimilarDoes this phone have bluetooth Does the phone have gps DifferentCan just two player play this game Whats the length of play of this game DifferentIs there a diffuser for this dryer Can this go in the dryer DifferentWhat material is the neck strap made of Does this come with a neck strap Different

Table 2 Examples from Amazon-PQSim Dataset Each example consists of a user-generated question pair and ahuman-annotated label for their similarity

34 Mixture of ExpertsA mixture of experts is a widely-used method tocombine the outputs of several classifiers by as-sociating a weighted confidence score with eachclassifier (Jacobs et al 1991) In our setting ex-perts are individual twins that lend support for oragainst a particular answer for a question Eachtwin is weighted by its contextual similarity to thetarget record rt as predicted by the CPS model

Given a target record rt the weight of each ofits twins rj isin T (rt) is determined by

λ(rj) = max(ψ2tj wmin)

where ψtj = CPS(rt rj) and 0 le wmin le 05is a lower weight-limit a hyper-parameter that wetune on the development set7

The predicted class of at is therefore derived by

Pred(at|rt) = sign

sumrjisinT (rt)

λ(rj)δ(aj)

(3)

where positivenegative Pred indicates lsquoyesrsquolsquonorsquo

respectively and δ(a) =

+1 a = lsquoyesrsquominus1 a = lsquonorsquo

Our methodology can be easily expanded to in-corporate more answer predictors (voters) of dif-ferent types into SimBA An example for such anexpansion is described at Section 53

4 Datasets

We introduce two new datasets to experiment withour answer prediction approach 1) The Ama-zon Product Question Similarity (Amazon-PQSim)dataset which is used to train our Q2Q model 2)The Amazon Product Question Answers (Amazon-PQA) dataset of product related QampAs used fortraining the SimBA model

7We tried using the CPS raw score for all twins iewmin = 0 however using a fine-tuned minimal weightyielded better results

41 Amazon-PQSim DatasetWe collected a first-of-a-kind question-to-questionsimilarity dataset of product-question pairs fromthe Amazon website (Amazon-PQSim See Table 2for examples) Unlike the Quora dataset of generalquestion pairs8 product questions are asked in thecontext of a designated product page This makesthem unique and different from questions askedin other domains For example the question Isit waterproof when appears on the Fitbit Flyerdetailed page should implicitly be interpreted as IsFitbit Flyer waterproof

The following steps were taken for the data col-lection (a) randomly sampling product-questionsfrom the Amazon website (b) filtering out someof these questions (eg non-English questions formore details see Appendix A) (c) For each ofthe remaining questions we retrieved up to threecandidate similar questions from the collectionA question is paired with the original question ifthe Jaccard similarity among them is in the rangeof [03 05] We ignore highly similar questions(gt 05) since we donrsquot want nearly verbatim pairsin our dataset as well as dissimilar pairs (lt 03)(d) Finally we used the Appen crowd-sourcing plat-form9 for manual annotation of question pairs sim-ilarity 10 Each question pair was labeled by atleast three judges and up to seven until reachingagreement of 70 or more

The above steps resulted in a nearly balanceddataset (108 positive-negative ratio) of more than180K product question pairs with judges agree-ment of 70 or more and among them about 90Kquestion pairs have perfect judges agreement (114

8httpswwwkagglecomcquora-question-pairs9httpsappencom

10As the questions are asked in context of a specific productthey are often written in an anaphoric form (eg Is it water-proof) To keep our dataset general we instructed the judgesto accept such questions as if they included the actual relatedproduct name For example the pair Is it waterproof and Isthis Fitbit waterproof were labeled as similar

247

positive-negative ratio)

42 Amazon-PQA DatasetWe collected a large corpus of product questionsand answers from the Amazon website similarto the popular Amazon QampA dataset (McAuley2016) Since our answer prediction method directlyutilizes an existing corpus of resolved questionswe aim to collect all available questions per nar-row sub-category instead of a sample of questionsacross broad categories by the popular AmazonQampA dataset For example instead of samplingfrom the broad Electronics category we collectall questions under the narrower Monitors and Re-ceivers categories

Raw Data Extraction We collected all prod-uct questions with their answers from 100 sub-categories available on the Amazon website inAugust 2020 Overall 10M questions were col-lected with 207M answers on 15M products Forfull statistics of the raw data see Table 7 in Ap-pendix A

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics See Appendix A for de-tails

YesNo Answer Labeling Questions are typi-cally answered by free-text answers posted in-dependently by multiple users In order to con-vert these answers into a single yesno answerwe first classified each answer into one of threeclasses yes no and maybe and then used major-ity vote among the classified answers We used apre-trained RoBERTa-based classifier and trainedthe model on McAuleyrsquos dataset (McAuley 2016)taking only yesno questions See Appendix A fordetails

5 Experiments

We experiment with eleven product categories cov-ered by our Amazon-PQA dataset (Section 42)training a SimBA answer prediction model for eachof the categories independently Next we describethe data preparation steps for each of the SimBAcomponents

51 Data PreparationSibling Retrieval Using AKNN For each recordr isin C (C is the category dataset) we use AKNNto retrieve the top-K similar siblings fromC while

Majority Jaccard USE RoBERTa Q2Qvote similarity cosine cosine531 620 696 707 832

Table 3 Classification accuracy of question similarity mod-els

making sure that neither of them share the sameproduct with r We collect training example pairsby coupling each record r with each of its siblingsDprime(C) =

⋃riisinC(ri rj)|rj isin S(ri)

For retrieval we use Universal Sentence Encoder(USE) (Cer et al 2018) to embed each question qiinto a 512-length vector ei We use the Annoy11

python library for the implementation of efficientAKNN retrieval In all experiments for each recordwe retrieve the top-K (K = 500) similar recordsbased on the cosine-similarity between the embed-ding vectors

Twin Detection Using the Q2Q Model Foreach sibling pair (ri rj) isin Dprime(C) we use ourQ2Q model to score their question-similarity andkeep only those with Q2Q(qi qj) gt γ to yield acollection of twin pairs D(C) We use γ = 09 toensure only highly similar question pairs

For our Q2Q model we apply a standard pre-trained RoBERTa (Liu et al 2019) classifierSpecifically we use Hugging-Face base-uncasedpre-trained model12 and fine-tune13 it for the classi-fication task on our Q2Q dataset14 while splittingthe data into train dev and test sets with 80-10-10 partition respectively For γ = 05 (itsminimal value) the model achieves test accuracyof 832 with a precision of 813 and a recallof 877 When setting the twin confidence levelthreshold to γ = 09 the precision of the Q2Qmodel raises to 899 with a recall of 695

We compare the performance of the Q2Q simi-larity classifier with several unsupervised baselinesnamely (a) Jaccard similarity (b) cosine similar-ity over USE embedding and (c) cosine similarityover RoBERTa15 embedding The results are sum-marized in Table 3 showing that the Q2Q modelsignificantly outperforms these baselines

11httpsgithubcomspotifyannoy12httpsgithubcomhuggingfacetransformers13We use batch size 32 maximum sequence length of 128

learning rate 5e-5 and 3 epochs14We only used the examples with full agreement15Hugging-Face sentence-transformers roberta-large-nli-

stsb-mean-tokens model

248

Category Acc Majority ∆

Light Bars 739 611 +128Monitors 782 682 +99Smartwatches 800 656 +144Receivers 775 676 +99Backpacks 839 760 +79Jeans 713 593 +119Beds 846 720 +126Home Office Desks 739 634 +105Masks 751 669 +82Posters amp Prints 723 609 +115Accessories 791 724 +66

Macro Average 772 667 +97

Table 4 CPS model test set results on the CPS auxiliarydatasets and the majority baseline of each category

52 CPS Model

Training The CPS model predicts the contextualsimilarity between a pair of twin records In our ex-periments the textual content of a product consistsof the product title concatenated with the productbullet points separated by semicolons The ques-tion text is the original query as appeared in theAmazon PQA-dataset For the encoding modulesof the CPS model we use a standard pre-trainedRoBERTa-based model as well while using the[SEP ] token for separating the two inputs to eachencoder For training twin pairs are labeled accord-ing to their contextual similarity using Equation 2

We train fine-tune and test an independentCPS model for each category set C using D(C)Ddev(C) and Dtest(C) (details of the data splitdescribed in Appendix A) The training set D(C)is created as described in Section 51 Ddev(C)and Dtest(C) are created the same with one mod-ification ndash rather than retrieving the siblings for arecord from the dataset it belongs to the siblingsare retrieved from D(C) for both Ddev(C) andDtest(C) This represents a real-world scenariowhere existing products with their related questionsare used as a corpus for predicting the answer to aquestion about a new product Each product withall related questions appear only in one of thesesets

Evaluation We evaluate the CPS model by mea-suring the accuracy of its contextual similarity pre-diction over Dtest(C) The accuracy per categoryis presented in Table 4 The model achieves a rela-tively high accuracy with a macro average of 772over all categories presenting a significant lift of97 over the majority decision baseline This isan encouraging result considering the fact that the

answers for many questions cannot be directly in-ferred from the product textual information Weconjecture that the model is able to learn the affinitybetween different products in the context of a givenquestion for predicting their contextual similarityFor example the two backpacks Ranvoo LaptopBackpack and Swiss Gear Bungee Backpack werecorrectly classified by the CPS model as similar(ψ ge 05) in context of the question ldquoWill this fitunder a plane seatrdquo and classified as different(ψ lt 05) in context of the question ldquoDoes it havea separate laptop sleeverdquo

53 Answer Prediction MethodsWe experiment with our SimBA model and witha few baselines over the test set of all categoriesThe first one is Majority which returns the majorityanswer among all records in the category Othermethods are described next

SimBA Given a target record rt SimBA scoreseach of its twins by the CPS model and predictsthe answer for qt using Equation 3 wmin was fine-tuned on the combined dev set of all categories andwas set to 038

Question Similarity Only (QSO) We modifythe SimBA model to ignore the CPS classifica-tion score when implementing the Mixture-of-Experts model (Eq 3) by setting an equalweight of 10 to all twin votes Pred(at|rt) =

sign(sum

rjisinT (rt)δ(aj)

)

Product Similarity Only (PSO) We modify theSimBA model by setting qt and qj to empty stringsat the input of the CPS model both during train-ing and during inference forcing it to rely on theproductsrsquo textual content alone The twin retrievalprocess remains untouched

Answer Prediction Classifier (APC) We exper-iment with a direct prediction approach that onlyconsiders the product textual content and the ques-tion for answer prediction For each category Cwe fine-tune a pre-trained RoBERTa-based clas-sifier over all records rj isin C using qj and pj(separated by the [SEP ] token) as input and δ(aj)as the training label

SimBA+APC The experimental results showthat different answer-prediction methods (egSimBA vs APC) may be preferable for differentproduct categories Therefore we combine bothmethods for achieving optimal results by mixing

249

Twins Answer(Monitors) Does this require WiFi 51 91 No(Backpacks) Will it fit under a plane seat 213 90 Yes(Smartwatches) Can it measure blood sugar level 34 97 No(Smartwatches) Does it come with a charger 269 96 Yes

Table 5 Examples for popular questions with high answeragreement over different products

Figure 3 Macro-average test accuracy over all categoriesThe horizontal axis indicate the minimal number of twins andthe percentages of test questions each subset represents

the vote of APC with the twin votes using theMixture-of-Experts approach

Pred(at|rt) =sign

(η(rt)δ(αt) +

sumrjisinT (rt)

λ(rj)δ(aj))

where αt is the APC predicted answer and η(rt) =η1 η2 and η3 for |T (rt)| le 10 10 lt |T (rt)| lt 50and |T (rt)| ge 50 respectively16 All η values(η gt 0) are fine-tuned on the development set foreach category separately The values we used aredetailed in Table 10 in Appendix A

54 Answer Prediction EvaluationThe answer prediction accuracy results of all testedpredictors macro-averaged over Dtest(C) of allcategories are presented in Figure 3 We inspectthe performance of the methods on different subsetsof the test data where each subset is determinedby all records having at least x twins x isin [0130]The horizontal axis indicates the minimal numberof twins in the subset and the percentage of the dataeach subset represents For example the results atx = 0 represent the entire test set while the resultsat x = 10 represents the subset of questions withat least 10 twins account for 402 of the test set

The performance of Majority begins with 66(the percent of lsquoyesrsquo questions in the entire popula-tion) and drops for questions with many twins We

16We also tried a few different splits on the development set

hypothesize that obvious questions for which theanswer is the same across many products are rarelyasked hence have fewer twins In contrast infor-mative questions for which the answer is variedacross products are frequently asked wrt manyproducts hence have many twins Therefore wesee a drop in accuracy of the Majority baseline asthe number of twins grows

The accuracy of QSO is significantly higher thanthe majority-vote baseline This demonstrates aninteresting phenomena in the data of similar ques-tions that tend to have the same answer over va-riety of products typically of the same type Afew examples are presented in Table 5 The QSOmethod successfully detects these groups of ques-tions and predicts the majority answer for eachsuch group We find that PSO method generallydoesnrsquot improve over QSO This is somewhat sur-prising as we expected that using product similar-ity information such as brand model or key fea-tures would increase the prediction accuracy Thisdemonstrates the importance of question-contextas used in SimBA in addition to the product infor-mation alone

Moving to SimBA we can see a large perfor-mance improvement over the QSO and PSO meth-ods which we attribute directly to the CPS modelWe also see consistent improvement in accuracywith the number of twins likely due to the largersupport the model has for predicting the answer

The APC method despite its relative simplic-ity performs very well and greatly outperformsthe majority-vote and the QSO and PSO baselinesFor the segment of questions with less than 10twins APC outperforms the SimBA method Thissegment represents roughly 60 of the questionsHowever for the segment of questions with 60 ormore twins which accounts for 136 of the ques-tions SimBA method consistently outperforms theinductive baseline by 1-2 When inspecting theresults by category as shown in Table 6 we cansee that considering all questions with at least 1twin the APC method dominates in 7 out of the11 categories while for questions with at least 60twins SimBA method dominates in 6 out of the 11categories

Finally we see that the two approaches com-pliment each other and can be effectively joinedas the SimBA+APC method outperforms both ofthem over all subsets

250

Questions with 1+ Twins Questions with 60+ Twins of data SimBA APC SimBA+APC of data SimBA APC SimBA+APC

Light bars 625 753 748 777 90 741 681 750Monitors 792 760 764 765 270 785 785 785

Smartwatches 883 773 793 790 319 799 817 808Receivers 575 701 703 720 48 832 779 832Backpacks 747 807 827 823 215 827 832 829

Jeans 633 674 659 674 134 748 709 757Beds 704 777 764 779 168 825 804 825

Home amp office desks 650 718 762 758 47 800 844 844Masks 765 709 742 730 42 804 826 804

Posters amp prints 601 734 734 744 93 759 696 759Accessories 717 781 790 792 72 823 816 827

Macro Average 699 744 753 759 136 795 781 802

Table 6 Answer prediction accuracy by category Left accuracy over records with at least one twin representing 699 of therecords on average Right accuracy over records with at least 60 twins representing 136 of the records The highest accuracybetween SimBA and APC is in bold

6 Conclusions

We presented SimBA a novel answer predictionapproach in the PQA domain which directly lever-ages similar questions answered with respect toother products Our empirical evaluation showsthat on some segments of questions namely thosewith roughly ten or more similar questions in thecorpus our method can outperform a strong induc-tive method that directly utilizes the question andthe textual product content We further show thatthe two approaches are complementary and can beintegrated to increase the overall answer predictionaccuracy

For future work we plan to explore how SimBAcan be extended and be applied beyond yes-no ques-tions eg for questions with numerical answers oropen-ended questions Another interesting researchdirection is combining additional voters to theMixture-of-Experts model such as a review-awareanswer predictor or a product details-based predic-tor Additionally our current evaluation considereda static view of the answered product-question cor-pus we plan to explore temporal aspects of ourmethod for example considering questions age orignoring answers of obsolete products that mightbe irrelevant

Acknowledgments

Ohad Rozen would like to express his gratitudeto Yanai Elazar Vered Shwartz and Ido Daganfor providing him valuable advice while he wasconducting this research during his internship atAmazon

ReferencesDavid Carmel Liane Lewin-Eytan and Yoelle Maarek

2018 Product question answering using customer

generated content - research challenges SIGIR rsquo18page 1349ndash1350 Association for Computing Ma-chinery

Daniel Cer Yinfei Yang Sheng-yi Kong Nan HuaNicole Limtiaco Rhomni St John Noah Con-stant Mario Guajardo-Cespedes Steve Yuan ChrisTar Yun-Hsuan Sung Brian Strope and RayKurzweil 2018 Universal sentence encoder CoRRabs180311175

Long Chen Ziyu Guan Wei Zhao Wanqing Zhao Xi-aopeng Wang Zhou Zhao and Huan Sun 2019 An-swer identification from product reviews for userquestions by multi-task attentive networks In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence volume 33 pages 45ndash52

Lei Cui Shaohan Huang Furu Wei Chuanqi TanChaoqun Duan and Ming Zhou 2017 SuperagentA customer service chatbot for e-commerce web-sites In Proceedings of ACL 2017 System Demon-strations pages 97ndash102

Manirupa Das Zhen Wang Evan Jaffe MadhujaChattopadhyay Eric Fosler-Lussier and Rajiv Ram-nath 2019 Learning to answer subjective spe-cific product-related queries using customer re-views by adversarial domain adaptation CoRRabs191008270

Yang Deng Wenxuan Zhang and Wai Lam 2020Opinion-aware answer generation for review-drivenquestion answering in e-commerce In Proceedingsof CIKM 2020 page 255ndash264 Association for Com-puting Machinery

Miao Fan Chao Feng Mingming Sun Ping Li andHaifeng Wang 2019 Reading customer reviews toanswer product-related questions In Proceedingsof the 2019 SIAM International Conference on DataMining pages 567ndash575 SIAM

Shen Gao Zhaochun Ren Yihong Eric Zhao DongyanZhao Dawei Yin and Rui Yan 2019 Product-aware answer generation in e-commerce question-answering In Proceedings WSDM 2019 pages 429ndash437 ACM

251

Jing He and Decheng Dai 2011 Summarization ofyesno questions using a feature function model vol-ume 20 of Proceedings of Machine Learning Re-search pages 351ndash366 South Garden Hotels andResorts Taoyuan Taiwain PMLR

Robert A Jacobs Michael I Jordan Steven J Nowlanand Geoffrey E Hinton 1991 Adaptive mixtures oflocal experts Neural Comput 3(1)79ndash87

Jiwoon Jeon W Bruce Croft and Joon Ho Lee 2005Finding similar questions in large question and an-swer archives In Proceedings CIKM 2005 pages84ndash90

T M Lai T Bui N Lipka and S Li 2018 Super-vised transfer learning for product information ques-tion answering In Proceedings of ICMLA 2018pages 1109ndash1114

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Julian McAuley 2016 Amazon questionanswerdata httpsjmcauleyucsdedudataamazonqa

Julian J McAuley and Alex Yang 2016 Address-ing complex and subjective product-related querieswith customer reviews In Proceedings WWW 2016pages 625ndash635 ACM

Dae Hoon Park Hyun Duk Kim ChengXiang Zhaiand Lifan Guo 2015 Retrieval of relevant opinionsentences for new products In Proceedings SIGIR2015 pages 393ndash402 ACM

Fatemeh Pourgholamali 2016 Mining information forthe cold-item problem In Proceedings of RecSys2016 page 451ndash454 ACM

Mengting Wan and Julian McAuley 2016 Modelingambiguity subjectivity and diverging viewpoints inopinion question answering systems In Proceedingof ICDM 2016 pages 489ndash498 IEEE

Jianfei Yu Minghui Qiu Jing Jiang Jun HuangShuangyong Song Wei Chu and Haiqing Chen2018 Modelling domain relationships for transferlearning on retrieval-based question answering sys-tems in e-commerce In Proceedings of WSDM2018 page 682ndash690 ACM

Jianxing Yu Zheng-Jun Zha and Tat-Seng Chua 2012Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviewsIn Proceedings of EMNLP 2012 pages 391ndash401ACL

Qian Yu and Wai Lam 2018 Review-aware answerprediction for product-related questions incorporat-ing aspects In Proceedings of WSDM 2018 pages691ndash699 ACM

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 4: Answering Product-Questions by Utilizing Questions from ...

245

trieve all records with a question having the sameintent as of qt As C might be very large applyinga complex neural model to measure the similarityof each question in C to qt is often infeasible Wetherefore apply a two step retrieval process In apreliminary offline step we index the records in Cby creating embedding vectors for their questionsusing a pre-trained encoder For retrieval doneboth during training and inference we similarlyembed the question qt into vector et We then usea fast Approximate K Nearest Neighbors (AKNN)search to retrieve K records with the most similarquestions based on the cosine similarity betweenet and the embedding vectors of the questions inC We denote the set of retrieved siblings of rt byS(rt)

32 Twin detectionThe retrieved sibling records are those with themost similar questions to the target question Inthe second step of the retrieval process we en-hance our record selection by applying a highlyaccurate transformer-based Question-to-Question(Q2Q) classifier (See Section 51) which we trainover our question to question similarity dataset(Section 41) The Q2Q(qt qk) classifier predictsthe similarity between a target question qt and eachof the questions qk in S(rt) A record rk is con-sidered a twin of rt if Q2Q(qt qk) gt γ where05 le γ le 10 is a hyper-parameter of the systemWe denote the set of twins of rt by T (rt)

33 Contextual Product Similarity (CPS)We consider products p1 and p2 to be contextuallysimilar with respect to a yesno question q if theanswer to q on both products is the same5 Givena pair of twin records (r1 r2) our CPS model isaims to predict the contextual similarity betweenthem ie whether their (highly similar) questionshave the same answer

Since r1 and r2 are twins their questions areexpected to have the same intent yet they mightbe phrased differently To avoid losing any infor-mation we provide both questions as input to theCPS model during training and during inferencetime

CPS Model Architecture Figure 2 depicts theCPS model for predicting the contextual similarity

5By design both products belong to the same productcategory C which prevents comparing unrelated productsFor example comparing an airhorn and a computer fan in thecontext of the question is it loud is therefore prevented

Figure 2 The Contextual Product Similarity (CPS) modelThe target question-product pair (qt pt) and the twin question-product pair (qj pj) are encoded using a transformer encoderwhile the questions attend the product text The texts of bothproducts are coupled and also encoded allowing the two prod-uct text attend each other The three output vectors are thenconcatenated and classified using an MLP classifier

between a target record rt and one of its twins -record rj For each record the question-productpair is embedded using a pre-trained transformerencoder allowing the product textual content andthe question text attend each other6

Ht = Encoder(qt pt) Hj = Encoder(qj pj)

The two models share weights to avoid over-fittingand for more efficient learning A second encoderembeds the textual content of both products encap-sulating the similarity between them

Htj = Encoder(pt pj)

Then a one hidden MLP layer takes the concatena-tion of the three embedding vectors to predict theprobability of at = aj

ψtj = CPS(rt rj) = P (at=aj |rt rj)=MLP (Ht oplusHj oplusHtj)

(1)

Another key advantage of the CPS model is itsability to be trained on a large scale without humanannotations by simply yielding the training labelsdirectly from the polarity between the answers oftwin pairs extracted from our training data For anypair of twins (ri rj)

label(ri rj) =

similar ai = aj

different ai 6= aj(2)

6The product textual content can be accumulated fromseveral resources In our experiments we restrict the productcontent to its title and bullet points

246

Question 1 Question 2 Label

Can this be used with regular light bulbs Can i put a regular light bulb in this SimilarCan i use these labels in a laser printer Can this be used in laser printer SimilarDoes the hat have an adjustable strap inside Does the hat adjust inside SimilarCan this organizer hold sleeved cards Can it hold cards with sleeves on them SimilarDoes this phone have bluetooth Does the phone have gps DifferentCan just two player play this game Whats the length of play of this game DifferentIs there a diffuser for this dryer Can this go in the dryer DifferentWhat material is the neck strap made of Does this come with a neck strap Different

Table 2 Examples from Amazon-PQSim Dataset Each example consists of a user-generated question pair and ahuman-annotated label for their similarity

34 Mixture of ExpertsA mixture of experts is a widely-used method tocombine the outputs of several classifiers by as-sociating a weighted confidence score with eachclassifier (Jacobs et al 1991) In our setting ex-perts are individual twins that lend support for oragainst a particular answer for a question Eachtwin is weighted by its contextual similarity to thetarget record rt as predicted by the CPS model

Given a target record rt the weight of each ofits twins rj isin T (rt) is determined by

λ(rj) = max(ψ2tj wmin)

where ψtj = CPS(rt rj) and 0 le wmin le 05is a lower weight-limit a hyper-parameter that wetune on the development set7

The predicted class of at is therefore derived by

Pred(at|rt) = sign

sumrjisinT (rt)

λ(rj)δ(aj)

(3)

where positivenegative Pred indicates lsquoyesrsquolsquonorsquo

respectively and δ(a) =

+1 a = lsquoyesrsquominus1 a = lsquonorsquo

Our methodology can be easily expanded to in-corporate more answer predictors (voters) of dif-ferent types into SimBA An example for such anexpansion is described at Section 53

4 Datasets

We introduce two new datasets to experiment withour answer prediction approach 1) The Ama-zon Product Question Similarity (Amazon-PQSim)dataset which is used to train our Q2Q model 2)The Amazon Product Question Answers (Amazon-PQA) dataset of product related QampAs used fortraining the SimBA model

7We tried using the CPS raw score for all twins iewmin = 0 however using a fine-tuned minimal weightyielded better results

41 Amazon-PQSim DatasetWe collected a first-of-a-kind question-to-questionsimilarity dataset of product-question pairs fromthe Amazon website (Amazon-PQSim See Table 2for examples) Unlike the Quora dataset of generalquestion pairs8 product questions are asked in thecontext of a designated product page This makesthem unique and different from questions askedin other domains For example the question Isit waterproof when appears on the Fitbit Flyerdetailed page should implicitly be interpreted as IsFitbit Flyer waterproof

The following steps were taken for the data col-lection (a) randomly sampling product-questionsfrom the Amazon website (b) filtering out someof these questions (eg non-English questions formore details see Appendix A) (c) For each ofthe remaining questions we retrieved up to threecandidate similar questions from the collectionA question is paired with the original question ifthe Jaccard similarity among them is in the rangeof [03 05] We ignore highly similar questions(gt 05) since we donrsquot want nearly verbatim pairsin our dataset as well as dissimilar pairs (lt 03)(d) Finally we used the Appen crowd-sourcing plat-form9 for manual annotation of question pairs sim-ilarity 10 Each question pair was labeled by atleast three judges and up to seven until reachingagreement of 70 or more

The above steps resulted in a nearly balanceddataset (108 positive-negative ratio) of more than180K product question pairs with judges agree-ment of 70 or more and among them about 90Kquestion pairs have perfect judges agreement (114

8httpswwwkagglecomcquora-question-pairs9httpsappencom

10As the questions are asked in context of a specific productthey are often written in an anaphoric form (eg Is it water-proof) To keep our dataset general we instructed the judgesto accept such questions as if they included the actual relatedproduct name For example the pair Is it waterproof and Isthis Fitbit waterproof were labeled as similar

247

positive-negative ratio)

42 Amazon-PQA DatasetWe collected a large corpus of product questionsand answers from the Amazon website similarto the popular Amazon QampA dataset (McAuley2016) Since our answer prediction method directlyutilizes an existing corpus of resolved questionswe aim to collect all available questions per nar-row sub-category instead of a sample of questionsacross broad categories by the popular AmazonQampA dataset For example instead of samplingfrom the broad Electronics category we collectall questions under the narrower Monitors and Re-ceivers categories

Raw Data Extraction We collected all prod-uct questions with their answers from 100 sub-categories available on the Amazon website inAugust 2020 Overall 10M questions were col-lected with 207M answers on 15M products Forfull statistics of the raw data see Table 7 in Ap-pendix A

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics See Appendix A for de-tails

YesNo Answer Labeling Questions are typi-cally answered by free-text answers posted in-dependently by multiple users In order to con-vert these answers into a single yesno answerwe first classified each answer into one of threeclasses yes no and maybe and then used major-ity vote among the classified answers We used apre-trained RoBERTa-based classifier and trainedthe model on McAuleyrsquos dataset (McAuley 2016)taking only yesno questions See Appendix A fordetails

5 Experiments

We experiment with eleven product categories cov-ered by our Amazon-PQA dataset (Section 42)training a SimBA answer prediction model for eachof the categories independently Next we describethe data preparation steps for each of the SimBAcomponents

51 Data PreparationSibling Retrieval Using AKNN For each recordr isin C (C is the category dataset) we use AKNNto retrieve the top-K similar siblings fromC while

Majority Jaccard USE RoBERTa Q2Qvote similarity cosine cosine531 620 696 707 832

Table 3 Classification accuracy of question similarity mod-els

making sure that neither of them share the sameproduct with r We collect training example pairsby coupling each record r with each of its siblingsDprime(C) =

⋃riisinC(ri rj)|rj isin S(ri)

For retrieval we use Universal Sentence Encoder(USE) (Cer et al 2018) to embed each question qiinto a 512-length vector ei We use the Annoy11

python library for the implementation of efficientAKNN retrieval In all experiments for each recordwe retrieve the top-K (K = 500) similar recordsbased on the cosine-similarity between the embed-ding vectors

Twin Detection Using the Q2Q Model Foreach sibling pair (ri rj) isin Dprime(C) we use ourQ2Q model to score their question-similarity andkeep only those with Q2Q(qi qj) gt γ to yield acollection of twin pairs D(C) We use γ = 09 toensure only highly similar question pairs

For our Q2Q model we apply a standard pre-trained RoBERTa (Liu et al 2019) classifierSpecifically we use Hugging-Face base-uncasedpre-trained model12 and fine-tune13 it for the classi-fication task on our Q2Q dataset14 while splittingthe data into train dev and test sets with 80-10-10 partition respectively For γ = 05 (itsminimal value) the model achieves test accuracyof 832 with a precision of 813 and a recallof 877 When setting the twin confidence levelthreshold to γ = 09 the precision of the Q2Qmodel raises to 899 with a recall of 695

We compare the performance of the Q2Q simi-larity classifier with several unsupervised baselinesnamely (a) Jaccard similarity (b) cosine similar-ity over USE embedding and (c) cosine similarityover RoBERTa15 embedding The results are sum-marized in Table 3 showing that the Q2Q modelsignificantly outperforms these baselines

11httpsgithubcomspotifyannoy12httpsgithubcomhuggingfacetransformers13We use batch size 32 maximum sequence length of 128

learning rate 5e-5 and 3 epochs14We only used the examples with full agreement15Hugging-Face sentence-transformers roberta-large-nli-

stsb-mean-tokens model

248

Category Acc Majority ∆

Light Bars 739 611 +128Monitors 782 682 +99Smartwatches 800 656 +144Receivers 775 676 +99Backpacks 839 760 +79Jeans 713 593 +119Beds 846 720 +126Home Office Desks 739 634 +105Masks 751 669 +82Posters amp Prints 723 609 +115Accessories 791 724 +66

Macro Average 772 667 +97

Table 4 CPS model test set results on the CPS auxiliarydatasets and the majority baseline of each category

52 CPS Model

Training The CPS model predicts the contextualsimilarity between a pair of twin records In our ex-periments the textual content of a product consistsof the product title concatenated with the productbullet points separated by semicolons The ques-tion text is the original query as appeared in theAmazon PQA-dataset For the encoding modulesof the CPS model we use a standard pre-trainedRoBERTa-based model as well while using the[SEP ] token for separating the two inputs to eachencoder For training twin pairs are labeled accord-ing to their contextual similarity using Equation 2

We train fine-tune and test an independentCPS model for each category set C using D(C)Ddev(C) and Dtest(C) (details of the data splitdescribed in Appendix A) The training set D(C)is created as described in Section 51 Ddev(C)and Dtest(C) are created the same with one mod-ification ndash rather than retrieving the siblings for arecord from the dataset it belongs to the siblingsare retrieved from D(C) for both Ddev(C) andDtest(C) This represents a real-world scenariowhere existing products with their related questionsare used as a corpus for predicting the answer to aquestion about a new product Each product withall related questions appear only in one of thesesets

Evaluation We evaluate the CPS model by mea-suring the accuracy of its contextual similarity pre-diction over Dtest(C) The accuracy per categoryis presented in Table 4 The model achieves a rela-tively high accuracy with a macro average of 772over all categories presenting a significant lift of97 over the majority decision baseline This isan encouraging result considering the fact that the

answers for many questions cannot be directly in-ferred from the product textual information Weconjecture that the model is able to learn the affinitybetween different products in the context of a givenquestion for predicting their contextual similarityFor example the two backpacks Ranvoo LaptopBackpack and Swiss Gear Bungee Backpack werecorrectly classified by the CPS model as similar(ψ ge 05) in context of the question ldquoWill this fitunder a plane seatrdquo and classified as different(ψ lt 05) in context of the question ldquoDoes it havea separate laptop sleeverdquo

53 Answer Prediction MethodsWe experiment with our SimBA model and witha few baselines over the test set of all categoriesThe first one is Majority which returns the majorityanswer among all records in the category Othermethods are described next

SimBA Given a target record rt SimBA scoreseach of its twins by the CPS model and predictsthe answer for qt using Equation 3 wmin was fine-tuned on the combined dev set of all categories andwas set to 038

Question Similarity Only (QSO) We modifythe SimBA model to ignore the CPS classifica-tion score when implementing the Mixture-of-Experts model (Eq 3) by setting an equalweight of 10 to all twin votes Pred(at|rt) =

sign(sum

rjisinT (rt)δ(aj)

)

Product Similarity Only (PSO) We modify theSimBA model by setting qt and qj to empty stringsat the input of the CPS model both during train-ing and during inference forcing it to rely on theproductsrsquo textual content alone The twin retrievalprocess remains untouched

Answer Prediction Classifier (APC) We exper-iment with a direct prediction approach that onlyconsiders the product textual content and the ques-tion for answer prediction For each category Cwe fine-tune a pre-trained RoBERTa-based clas-sifier over all records rj isin C using qj and pj(separated by the [SEP ] token) as input and δ(aj)as the training label

SimBA+APC The experimental results showthat different answer-prediction methods (egSimBA vs APC) may be preferable for differentproduct categories Therefore we combine bothmethods for achieving optimal results by mixing

249

Twins Answer(Monitors) Does this require WiFi 51 91 No(Backpacks) Will it fit under a plane seat 213 90 Yes(Smartwatches) Can it measure blood sugar level 34 97 No(Smartwatches) Does it come with a charger 269 96 Yes

Table 5 Examples for popular questions with high answeragreement over different products

Figure 3 Macro-average test accuracy over all categoriesThe horizontal axis indicate the minimal number of twins andthe percentages of test questions each subset represents

the vote of APC with the twin votes using theMixture-of-Experts approach

Pred(at|rt) =sign

(η(rt)δ(αt) +

sumrjisinT (rt)

λ(rj)δ(aj))

where αt is the APC predicted answer and η(rt) =η1 η2 and η3 for |T (rt)| le 10 10 lt |T (rt)| lt 50and |T (rt)| ge 50 respectively16 All η values(η gt 0) are fine-tuned on the development set foreach category separately The values we used aredetailed in Table 10 in Appendix A

54 Answer Prediction EvaluationThe answer prediction accuracy results of all testedpredictors macro-averaged over Dtest(C) of allcategories are presented in Figure 3 We inspectthe performance of the methods on different subsetsof the test data where each subset is determinedby all records having at least x twins x isin [0130]The horizontal axis indicates the minimal numberof twins in the subset and the percentage of the dataeach subset represents For example the results atx = 0 represent the entire test set while the resultsat x = 10 represents the subset of questions withat least 10 twins account for 402 of the test set

The performance of Majority begins with 66(the percent of lsquoyesrsquo questions in the entire popula-tion) and drops for questions with many twins We

16We also tried a few different splits on the development set

hypothesize that obvious questions for which theanswer is the same across many products are rarelyasked hence have fewer twins In contrast infor-mative questions for which the answer is variedacross products are frequently asked wrt manyproducts hence have many twins Therefore wesee a drop in accuracy of the Majority baseline asthe number of twins grows

The accuracy of QSO is significantly higher thanthe majority-vote baseline This demonstrates aninteresting phenomena in the data of similar ques-tions that tend to have the same answer over va-riety of products typically of the same type Afew examples are presented in Table 5 The QSOmethod successfully detects these groups of ques-tions and predicts the majority answer for eachsuch group We find that PSO method generallydoesnrsquot improve over QSO This is somewhat sur-prising as we expected that using product similar-ity information such as brand model or key fea-tures would increase the prediction accuracy Thisdemonstrates the importance of question-contextas used in SimBA in addition to the product infor-mation alone

Moving to SimBA we can see a large perfor-mance improvement over the QSO and PSO meth-ods which we attribute directly to the CPS modelWe also see consistent improvement in accuracywith the number of twins likely due to the largersupport the model has for predicting the answer

The APC method despite its relative simplic-ity performs very well and greatly outperformsthe majority-vote and the QSO and PSO baselinesFor the segment of questions with less than 10twins APC outperforms the SimBA method Thissegment represents roughly 60 of the questionsHowever for the segment of questions with 60 ormore twins which accounts for 136 of the ques-tions SimBA method consistently outperforms theinductive baseline by 1-2 When inspecting theresults by category as shown in Table 6 we cansee that considering all questions with at least 1twin the APC method dominates in 7 out of the11 categories while for questions with at least 60twins SimBA method dominates in 6 out of the 11categories

Finally we see that the two approaches com-pliment each other and can be effectively joinedas the SimBA+APC method outperforms both ofthem over all subsets

250

Questions with 1+ Twins Questions with 60+ Twins of data SimBA APC SimBA+APC of data SimBA APC SimBA+APC

Light bars 625 753 748 777 90 741 681 750Monitors 792 760 764 765 270 785 785 785

Smartwatches 883 773 793 790 319 799 817 808Receivers 575 701 703 720 48 832 779 832Backpacks 747 807 827 823 215 827 832 829

Jeans 633 674 659 674 134 748 709 757Beds 704 777 764 779 168 825 804 825

Home amp office desks 650 718 762 758 47 800 844 844Masks 765 709 742 730 42 804 826 804

Posters amp prints 601 734 734 744 93 759 696 759Accessories 717 781 790 792 72 823 816 827

Macro Average 699 744 753 759 136 795 781 802

Table 6 Answer prediction accuracy by category Left accuracy over records with at least one twin representing 699 of therecords on average Right accuracy over records with at least 60 twins representing 136 of the records The highest accuracybetween SimBA and APC is in bold

6 Conclusions

We presented SimBA a novel answer predictionapproach in the PQA domain which directly lever-ages similar questions answered with respect toother products Our empirical evaluation showsthat on some segments of questions namely thosewith roughly ten or more similar questions in thecorpus our method can outperform a strong induc-tive method that directly utilizes the question andthe textual product content We further show thatthe two approaches are complementary and can beintegrated to increase the overall answer predictionaccuracy

For future work we plan to explore how SimBAcan be extended and be applied beyond yes-no ques-tions eg for questions with numerical answers oropen-ended questions Another interesting researchdirection is combining additional voters to theMixture-of-Experts model such as a review-awareanswer predictor or a product details-based predic-tor Additionally our current evaluation considereda static view of the answered product-question cor-pus we plan to explore temporal aspects of ourmethod for example considering questions age orignoring answers of obsolete products that mightbe irrelevant

Acknowledgments

Ohad Rozen would like to express his gratitudeto Yanai Elazar Vered Shwartz and Ido Daganfor providing him valuable advice while he wasconducting this research during his internship atAmazon

ReferencesDavid Carmel Liane Lewin-Eytan and Yoelle Maarek

2018 Product question answering using customer

generated content - research challenges SIGIR rsquo18page 1349ndash1350 Association for Computing Ma-chinery

Daniel Cer Yinfei Yang Sheng-yi Kong Nan HuaNicole Limtiaco Rhomni St John Noah Con-stant Mario Guajardo-Cespedes Steve Yuan ChrisTar Yun-Hsuan Sung Brian Strope and RayKurzweil 2018 Universal sentence encoder CoRRabs180311175

Long Chen Ziyu Guan Wei Zhao Wanqing Zhao Xi-aopeng Wang Zhou Zhao and Huan Sun 2019 An-swer identification from product reviews for userquestions by multi-task attentive networks In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence volume 33 pages 45ndash52

Lei Cui Shaohan Huang Furu Wei Chuanqi TanChaoqun Duan and Ming Zhou 2017 SuperagentA customer service chatbot for e-commerce web-sites In Proceedings of ACL 2017 System Demon-strations pages 97ndash102

Manirupa Das Zhen Wang Evan Jaffe MadhujaChattopadhyay Eric Fosler-Lussier and Rajiv Ram-nath 2019 Learning to answer subjective spe-cific product-related queries using customer re-views by adversarial domain adaptation CoRRabs191008270

Yang Deng Wenxuan Zhang and Wai Lam 2020Opinion-aware answer generation for review-drivenquestion answering in e-commerce In Proceedingsof CIKM 2020 page 255ndash264 Association for Com-puting Machinery

Miao Fan Chao Feng Mingming Sun Ping Li andHaifeng Wang 2019 Reading customer reviews toanswer product-related questions In Proceedingsof the 2019 SIAM International Conference on DataMining pages 567ndash575 SIAM

Shen Gao Zhaochun Ren Yihong Eric Zhao DongyanZhao Dawei Yin and Rui Yan 2019 Product-aware answer generation in e-commerce question-answering In Proceedings WSDM 2019 pages 429ndash437 ACM

251

Jing He and Decheng Dai 2011 Summarization ofyesno questions using a feature function model vol-ume 20 of Proceedings of Machine Learning Re-search pages 351ndash366 South Garden Hotels andResorts Taoyuan Taiwain PMLR

Robert A Jacobs Michael I Jordan Steven J Nowlanand Geoffrey E Hinton 1991 Adaptive mixtures oflocal experts Neural Comput 3(1)79ndash87

Jiwoon Jeon W Bruce Croft and Joon Ho Lee 2005Finding similar questions in large question and an-swer archives In Proceedings CIKM 2005 pages84ndash90

T M Lai T Bui N Lipka and S Li 2018 Super-vised transfer learning for product information ques-tion answering In Proceedings of ICMLA 2018pages 1109ndash1114

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Julian McAuley 2016 Amazon questionanswerdata httpsjmcauleyucsdedudataamazonqa

Julian J McAuley and Alex Yang 2016 Address-ing complex and subjective product-related querieswith customer reviews In Proceedings WWW 2016pages 625ndash635 ACM

Dae Hoon Park Hyun Duk Kim ChengXiang Zhaiand Lifan Guo 2015 Retrieval of relevant opinionsentences for new products In Proceedings SIGIR2015 pages 393ndash402 ACM

Fatemeh Pourgholamali 2016 Mining information forthe cold-item problem In Proceedings of RecSys2016 page 451ndash454 ACM

Mengting Wan and Julian McAuley 2016 Modelingambiguity subjectivity and diverging viewpoints inopinion question answering systems In Proceedingof ICDM 2016 pages 489ndash498 IEEE

Jianfei Yu Minghui Qiu Jing Jiang Jun HuangShuangyong Song Wei Chu and Haiqing Chen2018 Modelling domain relationships for transferlearning on retrieval-based question answering sys-tems in e-commerce In Proceedings of WSDM2018 page 682ndash690 ACM

Jianxing Yu Zheng-Jun Zha and Tat-Seng Chua 2012Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviewsIn Proceedings of EMNLP 2012 pages 391ndash401ACL

Qian Yu and Wai Lam 2018 Review-aware answerprediction for product-related questions incorporat-ing aspects In Proceedings of WSDM 2018 pages691ndash699 ACM

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 5: Answering Product-Questions by Utilizing Questions from ...

246

Question 1 Question 2 Label

Can this be used with regular light bulbs Can i put a regular light bulb in this SimilarCan i use these labels in a laser printer Can this be used in laser printer SimilarDoes the hat have an adjustable strap inside Does the hat adjust inside SimilarCan this organizer hold sleeved cards Can it hold cards with sleeves on them SimilarDoes this phone have bluetooth Does the phone have gps DifferentCan just two player play this game Whats the length of play of this game DifferentIs there a diffuser for this dryer Can this go in the dryer DifferentWhat material is the neck strap made of Does this come with a neck strap Different

Table 2 Examples from Amazon-PQSim Dataset Each example consists of a user-generated question pair and ahuman-annotated label for their similarity

34 Mixture of ExpertsA mixture of experts is a widely-used method tocombine the outputs of several classifiers by as-sociating a weighted confidence score with eachclassifier (Jacobs et al 1991) In our setting ex-perts are individual twins that lend support for oragainst a particular answer for a question Eachtwin is weighted by its contextual similarity to thetarget record rt as predicted by the CPS model

Given a target record rt the weight of each ofits twins rj isin T (rt) is determined by

λ(rj) = max(ψ2tj wmin)

where ψtj = CPS(rt rj) and 0 le wmin le 05is a lower weight-limit a hyper-parameter that wetune on the development set7

The predicted class of at is therefore derived by

Pred(at|rt) = sign

sumrjisinT (rt)

λ(rj)δ(aj)

(3)

where positivenegative Pred indicates lsquoyesrsquolsquonorsquo

respectively and δ(a) =

+1 a = lsquoyesrsquominus1 a = lsquonorsquo

Our methodology can be easily expanded to in-corporate more answer predictors (voters) of dif-ferent types into SimBA An example for such anexpansion is described at Section 53

4 Datasets

We introduce two new datasets to experiment withour answer prediction approach 1) The Ama-zon Product Question Similarity (Amazon-PQSim)dataset which is used to train our Q2Q model 2)The Amazon Product Question Answers (Amazon-PQA) dataset of product related QampAs used fortraining the SimBA model

7We tried using the CPS raw score for all twins iewmin = 0 however using a fine-tuned minimal weightyielded better results

41 Amazon-PQSim DatasetWe collected a first-of-a-kind question-to-questionsimilarity dataset of product-question pairs fromthe Amazon website (Amazon-PQSim See Table 2for examples) Unlike the Quora dataset of generalquestion pairs8 product questions are asked in thecontext of a designated product page This makesthem unique and different from questions askedin other domains For example the question Isit waterproof when appears on the Fitbit Flyerdetailed page should implicitly be interpreted as IsFitbit Flyer waterproof

The following steps were taken for the data col-lection (a) randomly sampling product-questionsfrom the Amazon website (b) filtering out someof these questions (eg non-English questions formore details see Appendix A) (c) For each ofthe remaining questions we retrieved up to threecandidate similar questions from the collectionA question is paired with the original question ifthe Jaccard similarity among them is in the rangeof [03 05] We ignore highly similar questions(gt 05) since we donrsquot want nearly verbatim pairsin our dataset as well as dissimilar pairs (lt 03)(d) Finally we used the Appen crowd-sourcing plat-form9 for manual annotation of question pairs sim-ilarity 10 Each question pair was labeled by atleast three judges and up to seven until reachingagreement of 70 or more

The above steps resulted in a nearly balanceddataset (108 positive-negative ratio) of more than180K product question pairs with judges agree-ment of 70 or more and among them about 90Kquestion pairs have perfect judges agreement (114

8httpswwwkagglecomcquora-question-pairs9httpsappencom

10As the questions are asked in context of a specific productthey are often written in an anaphoric form (eg Is it water-proof) To keep our dataset general we instructed the judgesto accept such questions as if they included the actual relatedproduct name For example the pair Is it waterproof and Isthis Fitbit waterproof were labeled as similar

247

positive-negative ratio)

42 Amazon-PQA DatasetWe collected a large corpus of product questionsand answers from the Amazon website similarto the popular Amazon QampA dataset (McAuley2016) Since our answer prediction method directlyutilizes an existing corpus of resolved questionswe aim to collect all available questions per nar-row sub-category instead of a sample of questionsacross broad categories by the popular AmazonQampA dataset For example instead of samplingfrom the broad Electronics category we collectall questions under the narrower Monitors and Re-ceivers categories

Raw Data Extraction We collected all prod-uct questions with their answers from 100 sub-categories available on the Amazon website inAugust 2020 Overall 10M questions were col-lected with 207M answers on 15M products Forfull statistics of the raw data see Table 7 in Ap-pendix A

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics See Appendix A for de-tails

YesNo Answer Labeling Questions are typi-cally answered by free-text answers posted in-dependently by multiple users In order to con-vert these answers into a single yesno answerwe first classified each answer into one of threeclasses yes no and maybe and then used major-ity vote among the classified answers We used apre-trained RoBERTa-based classifier and trainedthe model on McAuleyrsquos dataset (McAuley 2016)taking only yesno questions See Appendix A fordetails

5 Experiments

We experiment with eleven product categories cov-ered by our Amazon-PQA dataset (Section 42)training a SimBA answer prediction model for eachof the categories independently Next we describethe data preparation steps for each of the SimBAcomponents

51 Data PreparationSibling Retrieval Using AKNN For each recordr isin C (C is the category dataset) we use AKNNto retrieve the top-K similar siblings fromC while

Majority Jaccard USE RoBERTa Q2Qvote similarity cosine cosine531 620 696 707 832

Table 3 Classification accuracy of question similarity mod-els

making sure that neither of them share the sameproduct with r We collect training example pairsby coupling each record r with each of its siblingsDprime(C) =

⋃riisinC(ri rj)|rj isin S(ri)

For retrieval we use Universal Sentence Encoder(USE) (Cer et al 2018) to embed each question qiinto a 512-length vector ei We use the Annoy11

python library for the implementation of efficientAKNN retrieval In all experiments for each recordwe retrieve the top-K (K = 500) similar recordsbased on the cosine-similarity between the embed-ding vectors

Twin Detection Using the Q2Q Model Foreach sibling pair (ri rj) isin Dprime(C) we use ourQ2Q model to score their question-similarity andkeep only those with Q2Q(qi qj) gt γ to yield acollection of twin pairs D(C) We use γ = 09 toensure only highly similar question pairs

For our Q2Q model we apply a standard pre-trained RoBERTa (Liu et al 2019) classifierSpecifically we use Hugging-Face base-uncasedpre-trained model12 and fine-tune13 it for the classi-fication task on our Q2Q dataset14 while splittingthe data into train dev and test sets with 80-10-10 partition respectively For γ = 05 (itsminimal value) the model achieves test accuracyof 832 with a precision of 813 and a recallof 877 When setting the twin confidence levelthreshold to γ = 09 the precision of the Q2Qmodel raises to 899 with a recall of 695

We compare the performance of the Q2Q simi-larity classifier with several unsupervised baselinesnamely (a) Jaccard similarity (b) cosine similar-ity over USE embedding and (c) cosine similarityover RoBERTa15 embedding The results are sum-marized in Table 3 showing that the Q2Q modelsignificantly outperforms these baselines

11httpsgithubcomspotifyannoy12httpsgithubcomhuggingfacetransformers13We use batch size 32 maximum sequence length of 128

learning rate 5e-5 and 3 epochs14We only used the examples with full agreement15Hugging-Face sentence-transformers roberta-large-nli-

stsb-mean-tokens model

248

Category Acc Majority ∆

Light Bars 739 611 +128Monitors 782 682 +99Smartwatches 800 656 +144Receivers 775 676 +99Backpacks 839 760 +79Jeans 713 593 +119Beds 846 720 +126Home Office Desks 739 634 +105Masks 751 669 +82Posters amp Prints 723 609 +115Accessories 791 724 +66

Macro Average 772 667 +97

Table 4 CPS model test set results on the CPS auxiliarydatasets and the majority baseline of each category

52 CPS Model

Training The CPS model predicts the contextualsimilarity between a pair of twin records In our ex-periments the textual content of a product consistsof the product title concatenated with the productbullet points separated by semicolons The ques-tion text is the original query as appeared in theAmazon PQA-dataset For the encoding modulesof the CPS model we use a standard pre-trainedRoBERTa-based model as well while using the[SEP ] token for separating the two inputs to eachencoder For training twin pairs are labeled accord-ing to their contextual similarity using Equation 2

We train fine-tune and test an independentCPS model for each category set C using D(C)Ddev(C) and Dtest(C) (details of the data splitdescribed in Appendix A) The training set D(C)is created as described in Section 51 Ddev(C)and Dtest(C) are created the same with one mod-ification ndash rather than retrieving the siblings for arecord from the dataset it belongs to the siblingsare retrieved from D(C) for both Ddev(C) andDtest(C) This represents a real-world scenariowhere existing products with their related questionsare used as a corpus for predicting the answer to aquestion about a new product Each product withall related questions appear only in one of thesesets

Evaluation We evaluate the CPS model by mea-suring the accuracy of its contextual similarity pre-diction over Dtest(C) The accuracy per categoryis presented in Table 4 The model achieves a rela-tively high accuracy with a macro average of 772over all categories presenting a significant lift of97 over the majority decision baseline This isan encouraging result considering the fact that the

answers for many questions cannot be directly in-ferred from the product textual information Weconjecture that the model is able to learn the affinitybetween different products in the context of a givenquestion for predicting their contextual similarityFor example the two backpacks Ranvoo LaptopBackpack and Swiss Gear Bungee Backpack werecorrectly classified by the CPS model as similar(ψ ge 05) in context of the question ldquoWill this fitunder a plane seatrdquo and classified as different(ψ lt 05) in context of the question ldquoDoes it havea separate laptop sleeverdquo

53 Answer Prediction MethodsWe experiment with our SimBA model and witha few baselines over the test set of all categoriesThe first one is Majority which returns the majorityanswer among all records in the category Othermethods are described next

SimBA Given a target record rt SimBA scoreseach of its twins by the CPS model and predictsthe answer for qt using Equation 3 wmin was fine-tuned on the combined dev set of all categories andwas set to 038

Question Similarity Only (QSO) We modifythe SimBA model to ignore the CPS classifica-tion score when implementing the Mixture-of-Experts model (Eq 3) by setting an equalweight of 10 to all twin votes Pred(at|rt) =

sign(sum

rjisinT (rt)δ(aj)

)

Product Similarity Only (PSO) We modify theSimBA model by setting qt and qj to empty stringsat the input of the CPS model both during train-ing and during inference forcing it to rely on theproductsrsquo textual content alone The twin retrievalprocess remains untouched

Answer Prediction Classifier (APC) We exper-iment with a direct prediction approach that onlyconsiders the product textual content and the ques-tion for answer prediction For each category Cwe fine-tune a pre-trained RoBERTa-based clas-sifier over all records rj isin C using qj and pj(separated by the [SEP ] token) as input and δ(aj)as the training label

SimBA+APC The experimental results showthat different answer-prediction methods (egSimBA vs APC) may be preferable for differentproduct categories Therefore we combine bothmethods for achieving optimal results by mixing

249

Twins Answer(Monitors) Does this require WiFi 51 91 No(Backpacks) Will it fit under a plane seat 213 90 Yes(Smartwatches) Can it measure blood sugar level 34 97 No(Smartwatches) Does it come with a charger 269 96 Yes

Table 5 Examples for popular questions with high answeragreement over different products

Figure 3 Macro-average test accuracy over all categoriesThe horizontal axis indicate the minimal number of twins andthe percentages of test questions each subset represents

the vote of APC with the twin votes using theMixture-of-Experts approach

Pred(at|rt) =sign

(η(rt)δ(αt) +

sumrjisinT (rt)

λ(rj)δ(aj))

where αt is the APC predicted answer and η(rt) =η1 η2 and η3 for |T (rt)| le 10 10 lt |T (rt)| lt 50and |T (rt)| ge 50 respectively16 All η values(η gt 0) are fine-tuned on the development set foreach category separately The values we used aredetailed in Table 10 in Appendix A

54 Answer Prediction EvaluationThe answer prediction accuracy results of all testedpredictors macro-averaged over Dtest(C) of allcategories are presented in Figure 3 We inspectthe performance of the methods on different subsetsof the test data where each subset is determinedby all records having at least x twins x isin [0130]The horizontal axis indicates the minimal numberof twins in the subset and the percentage of the dataeach subset represents For example the results atx = 0 represent the entire test set while the resultsat x = 10 represents the subset of questions withat least 10 twins account for 402 of the test set

The performance of Majority begins with 66(the percent of lsquoyesrsquo questions in the entire popula-tion) and drops for questions with many twins We

16We also tried a few different splits on the development set

hypothesize that obvious questions for which theanswer is the same across many products are rarelyasked hence have fewer twins In contrast infor-mative questions for which the answer is variedacross products are frequently asked wrt manyproducts hence have many twins Therefore wesee a drop in accuracy of the Majority baseline asthe number of twins grows

The accuracy of QSO is significantly higher thanthe majority-vote baseline This demonstrates aninteresting phenomena in the data of similar ques-tions that tend to have the same answer over va-riety of products typically of the same type Afew examples are presented in Table 5 The QSOmethod successfully detects these groups of ques-tions and predicts the majority answer for eachsuch group We find that PSO method generallydoesnrsquot improve over QSO This is somewhat sur-prising as we expected that using product similar-ity information such as brand model or key fea-tures would increase the prediction accuracy Thisdemonstrates the importance of question-contextas used in SimBA in addition to the product infor-mation alone

Moving to SimBA we can see a large perfor-mance improvement over the QSO and PSO meth-ods which we attribute directly to the CPS modelWe also see consistent improvement in accuracywith the number of twins likely due to the largersupport the model has for predicting the answer

The APC method despite its relative simplic-ity performs very well and greatly outperformsthe majority-vote and the QSO and PSO baselinesFor the segment of questions with less than 10twins APC outperforms the SimBA method Thissegment represents roughly 60 of the questionsHowever for the segment of questions with 60 ormore twins which accounts for 136 of the ques-tions SimBA method consistently outperforms theinductive baseline by 1-2 When inspecting theresults by category as shown in Table 6 we cansee that considering all questions with at least 1twin the APC method dominates in 7 out of the11 categories while for questions with at least 60twins SimBA method dominates in 6 out of the 11categories

Finally we see that the two approaches com-pliment each other and can be effectively joinedas the SimBA+APC method outperforms both ofthem over all subsets

250

Questions with 1+ Twins Questions with 60+ Twins of data SimBA APC SimBA+APC of data SimBA APC SimBA+APC

Light bars 625 753 748 777 90 741 681 750Monitors 792 760 764 765 270 785 785 785

Smartwatches 883 773 793 790 319 799 817 808Receivers 575 701 703 720 48 832 779 832Backpacks 747 807 827 823 215 827 832 829

Jeans 633 674 659 674 134 748 709 757Beds 704 777 764 779 168 825 804 825

Home amp office desks 650 718 762 758 47 800 844 844Masks 765 709 742 730 42 804 826 804

Posters amp prints 601 734 734 744 93 759 696 759Accessories 717 781 790 792 72 823 816 827

Macro Average 699 744 753 759 136 795 781 802

Table 6 Answer prediction accuracy by category Left accuracy over records with at least one twin representing 699 of therecords on average Right accuracy over records with at least 60 twins representing 136 of the records The highest accuracybetween SimBA and APC is in bold

6 Conclusions

We presented SimBA a novel answer predictionapproach in the PQA domain which directly lever-ages similar questions answered with respect toother products Our empirical evaluation showsthat on some segments of questions namely thosewith roughly ten or more similar questions in thecorpus our method can outperform a strong induc-tive method that directly utilizes the question andthe textual product content We further show thatthe two approaches are complementary and can beintegrated to increase the overall answer predictionaccuracy

For future work we plan to explore how SimBAcan be extended and be applied beyond yes-no ques-tions eg for questions with numerical answers oropen-ended questions Another interesting researchdirection is combining additional voters to theMixture-of-Experts model such as a review-awareanswer predictor or a product details-based predic-tor Additionally our current evaluation considereda static view of the answered product-question cor-pus we plan to explore temporal aspects of ourmethod for example considering questions age orignoring answers of obsolete products that mightbe irrelevant

Acknowledgments

Ohad Rozen would like to express his gratitudeto Yanai Elazar Vered Shwartz and Ido Daganfor providing him valuable advice while he wasconducting this research during his internship atAmazon

ReferencesDavid Carmel Liane Lewin-Eytan and Yoelle Maarek

2018 Product question answering using customer

generated content - research challenges SIGIR rsquo18page 1349ndash1350 Association for Computing Ma-chinery

Daniel Cer Yinfei Yang Sheng-yi Kong Nan HuaNicole Limtiaco Rhomni St John Noah Con-stant Mario Guajardo-Cespedes Steve Yuan ChrisTar Yun-Hsuan Sung Brian Strope and RayKurzweil 2018 Universal sentence encoder CoRRabs180311175

Long Chen Ziyu Guan Wei Zhao Wanqing Zhao Xi-aopeng Wang Zhou Zhao and Huan Sun 2019 An-swer identification from product reviews for userquestions by multi-task attentive networks In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence volume 33 pages 45ndash52

Lei Cui Shaohan Huang Furu Wei Chuanqi TanChaoqun Duan and Ming Zhou 2017 SuperagentA customer service chatbot for e-commerce web-sites In Proceedings of ACL 2017 System Demon-strations pages 97ndash102

Manirupa Das Zhen Wang Evan Jaffe MadhujaChattopadhyay Eric Fosler-Lussier and Rajiv Ram-nath 2019 Learning to answer subjective spe-cific product-related queries using customer re-views by adversarial domain adaptation CoRRabs191008270

Yang Deng Wenxuan Zhang and Wai Lam 2020Opinion-aware answer generation for review-drivenquestion answering in e-commerce In Proceedingsof CIKM 2020 page 255ndash264 Association for Com-puting Machinery

Miao Fan Chao Feng Mingming Sun Ping Li andHaifeng Wang 2019 Reading customer reviews toanswer product-related questions In Proceedingsof the 2019 SIAM International Conference on DataMining pages 567ndash575 SIAM

Shen Gao Zhaochun Ren Yihong Eric Zhao DongyanZhao Dawei Yin and Rui Yan 2019 Product-aware answer generation in e-commerce question-answering In Proceedings WSDM 2019 pages 429ndash437 ACM

251

Jing He and Decheng Dai 2011 Summarization ofyesno questions using a feature function model vol-ume 20 of Proceedings of Machine Learning Re-search pages 351ndash366 South Garden Hotels andResorts Taoyuan Taiwain PMLR

Robert A Jacobs Michael I Jordan Steven J Nowlanand Geoffrey E Hinton 1991 Adaptive mixtures oflocal experts Neural Comput 3(1)79ndash87

Jiwoon Jeon W Bruce Croft and Joon Ho Lee 2005Finding similar questions in large question and an-swer archives In Proceedings CIKM 2005 pages84ndash90

T M Lai T Bui N Lipka and S Li 2018 Super-vised transfer learning for product information ques-tion answering In Proceedings of ICMLA 2018pages 1109ndash1114

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Julian McAuley 2016 Amazon questionanswerdata httpsjmcauleyucsdedudataamazonqa

Julian J McAuley and Alex Yang 2016 Address-ing complex and subjective product-related querieswith customer reviews In Proceedings WWW 2016pages 625ndash635 ACM

Dae Hoon Park Hyun Duk Kim ChengXiang Zhaiand Lifan Guo 2015 Retrieval of relevant opinionsentences for new products In Proceedings SIGIR2015 pages 393ndash402 ACM

Fatemeh Pourgholamali 2016 Mining information forthe cold-item problem In Proceedings of RecSys2016 page 451ndash454 ACM

Mengting Wan and Julian McAuley 2016 Modelingambiguity subjectivity and diverging viewpoints inopinion question answering systems In Proceedingof ICDM 2016 pages 489ndash498 IEEE

Jianfei Yu Minghui Qiu Jing Jiang Jun HuangShuangyong Song Wei Chu and Haiqing Chen2018 Modelling domain relationships for transferlearning on retrieval-based question answering sys-tems in e-commerce In Proceedings of WSDM2018 page 682ndash690 ACM

Jianxing Yu Zheng-Jun Zha and Tat-Seng Chua 2012Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviewsIn Proceedings of EMNLP 2012 pages 391ndash401ACL

Qian Yu and Wai Lam 2018 Review-aware answerprediction for product-related questions incorporat-ing aspects In Proceedings of WSDM 2018 pages691ndash699 ACM

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 6: Answering Product-Questions by Utilizing Questions from ...

247

positive-negative ratio)

42 Amazon-PQA DatasetWe collected a large corpus of product questionsand answers from the Amazon website similarto the popular Amazon QampA dataset (McAuley2016) Since our answer prediction method directlyutilizes an existing corpus of resolved questionswe aim to collect all available questions per nar-row sub-category instead of a sample of questionsacross broad categories by the popular AmazonQampA dataset For example instead of samplingfrom the broad Electronics category we collectall questions under the narrower Monitors and Re-ceivers categories

Raw Data Extraction We collected all prod-uct questions with their answers from 100 sub-categories available on the Amazon website inAugust 2020 Overall 10M questions were col-lected with 207M answers on 15M products Forfull statistics of the raw data see Table 7 in Ap-pendix A

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics See Appendix A for de-tails

YesNo Answer Labeling Questions are typi-cally answered by free-text answers posted in-dependently by multiple users In order to con-vert these answers into a single yesno answerwe first classified each answer into one of threeclasses yes no and maybe and then used major-ity vote among the classified answers We used apre-trained RoBERTa-based classifier and trainedthe model on McAuleyrsquos dataset (McAuley 2016)taking only yesno questions See Appendix A fordetails

5 Experiments

We experiment with eleven product categories cov-ered by our Amazon-PQA dataset (Section 42)training a SimBA answer prediction model for eachof the categories independently Next we describethe data preparation steps for each of the SimBAcomponents

51 Data PreparationSibling Retrieval Using AKNN For each recordr isin C (C is the category dataset) we use AKNNto retrieve the top-K similar siblings fromC while

Majority Jaccard USE RoBERTa Q2Qvote similarity cosine cosine531 620 696 707 832

Table 3 Classification accuracy of question similarity mod-els

making sure that neither of them share the sameproduct with r We collect training example pairsby coupling each record r with each of its siblingsDprime(C) =

⋃riisinC(ri rj)|rj isin S(ri)

For retrieval we use Universal Sentence Encoder(USE) (Cer et al 2018) to embed each question qiinto a 512-length vector ei We use the Annoy11

python library for the implementation of efficientAKNN retrieval In all experiments for each recordwe retrieve the top-K (K = 500) similar recordsbased on the cosine-similarity between the embed-ding vectors

Twin Detection Using the Q2Q Model Foreach sibling pair (ri rj) isin Dprime(C) we use ourQ2Q model to score their question-similarity andkeep only those with Q2Q(qi qj) gt γ to yield acollection of twin pairs D(C) We use γ = 09 toensure only highly similar question pairs

For our Q2Q model we apply a standard pre-trained RoBERTa (Liu et al 2019) classifierSpecifically we use Hugging-Face base-uncasedpre-trained model12 and fine-tune13 it for the classi-fication task on our Q2Q dataset14 while splittingthe data into train dev and test sets with 80-10-10 partition respectively For γ = 05 (itsminimal value) the model achieves test accuracyof 832 with a precision of 813 and a recallof 877 When setting the twin confidence levelthreshold to γ = 09 the precision of the Q2Qmodel raises to 899 with a recall of 695

We compare the performance of the Q2Q simi-larity classifier with several unsupervised baselinesnamely (a) Jaccard similarity (b) cosine similar-ity over USE embedding and (c) cosine similarityover RoBERTa15 embedding The results are sum-marized in Table 3 showing that the Q2Q modelsignificantly outperforms these baselines

11httpsgithubcomspotifyannoy12httpsgithubcomhuggingfacetransformers13We use batch size 32 maximum sequence length of 128

learning rate 5e-5 and 3 epochs14We only used the examples with full agreement15Hugging-Face sentence-transformers roberta-large-nli-

stsb-mean-tokens model

248

Category Acc Majority ∆

Light Bars 739 611 +128Monitors 782 682 +99Smartwatches 800 656 +144Receivers 775 676 +99Backpacks 839 760 +79Jeans 713 593 +119Beds 846 720 +126Home Office Desks 739 634 +105Masks 751 669 +82Posters amp Prints 723 609 +115Accessories 791 724 +66

Macro Average 772 667 +97

Table 4 CPS model test set results on the CPS auxiliarydatasets and the majority baseline of each category

52 CPS Model

Training The CPS model predicts the contextualsimilarity between a pair of twin records In our ex-periments the textual content of a product consistsof the product title concatenated with the productbullet points separated by semicolons The ques-tion text is the original query as appeared in theAmazon PQA-dataset For the encoding modulesof the CPS model we use a standard pre-trainedRoBERTa-based model as well while using the[SEP ] token for separating the two inputs to eachencoder For training twin pairs are labeled accord-ing to their contextual similarity using Equation 2

We train fine-tune and test an independentCPS model for each category set C using D(C)Ddev(C) and Dtest(C) (details of the data splitdescribed in Appendix A) The training set D(C)is created as described in Section 51 Ddev(C)and Dtest(C) are created the same with one mod-ification ndash rather than retrieving the siblings for arecord from the dataset it belongs to the siblingsare retrieved from D(C) for both Ddev(C) andDtest(C) This represents a real-world scenariowhere existing products with their related questionsare used as a corpus for predicting the answer to aquestion about a new product Each product withall related questions appear only in one of thesesets

Evaluation We evaluate the CPS model by mea-suring the accuracy of its contextual similarity pre-diction over Dtest(C) The accuracy per categoryis presented in Table 4 The model achieves a rela-tively high accuracy with a macro average of 772over all categories presenting a significant lift of97 over the majority decision baseline This isan encouraging result considering the fact that the

answers for many questions cannot be directly in-ferred from the product textual information Weconjecture that the model is able to learn the affinitybetween different products in the context of a givenquestion for predicting their contextual similarityFor example the two backpacks Ranvoo LaptopBackpack and Swiss Gear Bungee Backpack werecorrectly classified by the CPS model as similar(ψ ge 05) in context of the question ldquoWill this fitunder a plane seatrdquo and classified as different(ψ lt 05) in context of the question ldquoDoes it havea separate laptop sleeverdquo

53 Answer Prediction MethodsWe experiment with our SimBA model and witha few baselines over the test set of all categoriesThe first one is Majority which returns the majorityanswer among all records in the category Othermethods are described next

SimBA Given a target record rt SimBA scoreseach of its twins by the CPS model and predictsthe answer for qt using Equation 3 wmin was fine-tuned on the combined dev set of all categories andwas set to 038

Question Similarity Only (QSO) We modifythe SimBA model to ignore the CPS classifica-tion score when implementing the Mixture-of-Experts model (Eq 3) by setting an equalweight of 10 to all twin votes Pred(at|rt) =

sign(sum

rjisinT (rt)δ(aj)

)

Product Similarity Only (PSO) We modify theSimBA model by setting qt and qj to empty stringsat the input of the CPS model both during train-ing and during inference forcing it to rely on theproductsrsquo textual content alone The twin retrievalprocess remains untouched

Answer Prediction Classifier (APC) We exper-iment with a direct prediction approach that onlyconsiders the product textual content and the ques-tion for answer prediction For each category Cwe fine-tune a pre-trained RoBERTa-based clas-sifier over all records rj isin C using qj and pj(separated by the [SEP ] token) as input and δ(aj)as the training label

SimBA+APC The experimental results showthat different answer-prediction methods (egSimBA vs APC) may be preferable for differentproduct categories Therefore we combine bothmethods for achieving optimal results by mixing

249

Twins Answer(Monitors) Does this require WiFi 51 91 No(Backpacks) Will it fit under a plane seat 213 90 Yes(Smartwatches) Can it measure blood sugar level 34 97 No(Smartwatches) Does it come with a charger 269 96 Yes

Table 5 Examples for popular questions with high answeragreement over different products

Figure 3 Macro-average test accuracy over all categoriesThe horizontal axis indicate the minimal number of twins andthe percentages of test questions each subset represents

the vote of APC with the twin votes using theMixture-of-Experts approach

Pred(at|rt) =sign

(η(rt)δ(αt) +

sumrjisinT (rt)

λ(rj)δ(aj))

where αt is the APC predicted answer and η(rt) =η1 η2 and η3 for |T (rt)| le 10 10 lt |T (rt)| lt 50and |T (rt)| ge 50 respectively16 All η values(η gt 0) are fine-tuned on the development set foreach category separately The values we used aredetailed in Table 10 in Appendix A

54 Answer Prediction EvaluationThe answer prediction accuracy results of all testedpredictors macro-averaged over Dtest(C) of allcategories are presented in Figure 3 We inspectthe performance of the methods on different subsetsof the test data where each subset is determinedby all records having at least x twins x isin [0130]The horizontal axis indicates the minimal numberof twins in the subset and the percentage of the dataeach subset represents For example the results atx = 0 represent the entire test set while the resultsat x = 10 represents the subset of questions withat least 10 twins account for 402 of the test set

The performance of Majority begins with 66(the percent of lsquoyesrsquo questions in the entire popula-tion) and drops for questions with many twins We

16We also tried a few different splits on the development set

hypothesize that obvious questions for which theanswer is the same across many products are rarelyasked hence have fewer twins In contrast infor-mative questions for which the answer is variedacross products are frequently asked wrt manyproducts hence have many twins Therefore wesee a drop in accuracy of the Majority baseline asthe number of twins grows

The accuracy of QSO is significantly higher thanthe majority-vote baseline This demonstrates aninteresting phenomena in the data of similar ques-tions that tend to have the same answer over va-riety of products typically of the same type Afew examples are presented in Table 5 The QSOmethod successfully detects these groups of ques-tions and predicts the majority answer for eachsuch group We find that PSO method generallydoesnrsquot improve over QSO This is somewhat sur-prising as we expected that using product similar-ity information such as brand model or key fea-tures would increase the prediction accuracy Thisdemonstrates the importance of question-contextas used in SimBA in addition to the product infor-mation alone

Moving to SimBA we can see a large perfor-mance improvement over the QSO and PSO meth-ods which we attribute directly to the CPS modelWe also see consistent improvement in accuracywith the number of twins likely due to the largersupport the model has for predicting the answer

The APC method despite its relative simplic-ity performs very well and greatly outperformsthe majority-vote and the QSO and PSO baselinesFor the segment of questions with less than 10twins APC outperforms the SimBA method Thissegment represents roughly 60 of the questionsHowever for the segment of questions with 60 ormore twins which accounts for 136 of the ques-tions SimBA method consistently outperforms theinductive baseline by 1-2 When inspecting theresults by category as shown in Table 6 we cansee that considering all questions with at least 1twin the APC method dominates in 7 out of the11 categories while for questions with at least 60twins SimBA method dominates in 6 out of the 11categories

Finally we see that the two approaches com-pliment each other and can be effectively joinedas the SimBA+APC method outperforms both ofthem over all subsets

250

Questions with 1+ Twins Questions with 60+ Twins of data SimBA APC SimBA+APC of data SimBA APC SimBA+APC

Light bars 625 753 748 777 90 741 681 750Monitors 792 760 764 765 270 785 785 785

Smartwatches 883 773 793 790 319 799 817 808Receivers 575 701 703 720 48 832 779 832Backpacks 747 807 827 823 215 827 832 829

Jeans 633 674 659 674 134 748 709 757Beds 704 777 764 779 168 825 804 825

Home amp office desks 650 718 762 758 47 800 844 844Masks 765 709 742 730 42 804 826 804

Posters amp prints 601 734 734 744 93 759 696 759Accessories 717 781 790 792 72 823 816 827

Macro Average 699 744 753 759 136 795 781 802

Table 6 Answer prediction accuracy by category Left accuracy over records with at least one twin representing 699 of therecords on average Right accuracy over records with at least 60 twins representing 136 of the records The highest accuracybetween SimBA and APC is in bold

6 Conclusions

We presented SimBA a novel answer predictionapproach in the PQA domain which directly lever-ages similar questions answered with respect toother products Our empirical evaluation showsthat on some segments of questions namely thosewith roughly ten or more similar questions in thecorpus our method can outperform a strong induc-tive method that directly utilizes the question andthe textual product content We further show thatthe two approaches are complementary and can beintegrated to increase the overall answer predictionaccuracy

For future work we plan to explore how SimBAcan be extended and be applied beyond yes-no ques-tions eg for questions with numerical answers oropen-ended questions Another interesting researchdirection is combining additional voters to theMixture-of-Experts model such as a review-awareanswer predictor or a product details-based predic-tor Additionally our current evaluation considereda static view of the answered product-question cor-pus we plan to explore temporal aspects of ourmethod for example considering questions age orignoring answers of obsolete products that mightbe irrelevant

Acknowledgments

Ohad Rozen would like to express his gratitudeto Yanai Elazar Vered Shwartz and Ido Daganfor providing him valuable advice while he wasconducting this research during his internship atAmazon

ReferencesDavid Carmel Liane Lewin-Eytan and Yoelle Maarek

2018 Product question answering using customer

generated content - research challenges SIGIR rsquo18page 1349ndash1350 Association for Computing Ma-chinery

Daniel Cer Yinfei Yang Sheng-yi Kong Nan HuaNicole Limtiaco Rhomni St John Noah Con-stant Mario Guajardo-Cespedes Steve Yuan ChrisTar Yun-Hsuan Sung Brian Strope and RayKurzweil 2018 Universal sentence encoder CoRRabs180311175

Long Chen Ziyu Guan Wei Zhao Wanqing Zhao Xi-aopeng Wang Zhou Zhao and Huan Sun 2019 An-swer identification from product reviews for userquestions by multi-task attentive networks In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence volume 33 pages 45ndash52

Lei Cui Shaohan Huang Furu Wei Chuanqi TanChaoqun Duan and Ming Zhou 2017 SuperagentA customer service chatbot for e-commerce web-sites In Proceedings of ACL 2017 System Demon-strations pages 97ndash102

Manirupa Das Zhen Wang Evan Jaffe MadhujaChattopadhyay Eric Fosler-Lussier and Rajiv Ram-nath 2019 Learning to answer subjective spe-cific product-related queries using customer re-views by adversarial domain adaptation CoRRabs191008270

Yang Deng Wenxuan Zhang and Wai Lam 2020Opinion-aware answer generation for review-drivenquestion answering in e-commerce In Proceedingsof CIKM 2020 page 255ndash264 Association for Com-puting Machinery

Miao Fan Chao Feng Mingming Sun Ping Li andHaifeng Wang 2019 Reading customer reviews toanswer product-related questions In Proceedingsof the 2019 SIAM International Conference on DataMining pages 567ndash575 SIAM

Shen Gao Zhaochun Ren Yihong Eric Zhao DongyanZhao Dawei Yin and Rui Yan 2019 Product-aware answer generation in e-commerce question-answering In Proceedings WSDM 2019 pages 429ndash437 ACM

251

Jing He and Decheng Dai 2011 Summarization ofyesno questions using a feature function model vol-ume 20 of Proceedings of Machine Learning Re-search pages 351ndash366 South Garden Hotels andResorts Taoyuan Taiwain PMLR

Robert A Jacobs Michael I Jordan Steven J Nowlanand Geoffrey E Hinton 1991 Adaptive mixtures oflocal experts Neural Comput 3(1)79ndash87

Jiwoon Jeon W Bruce Croft and Joon Ho Lee 2005Finding similar questions in large question and an-swer archives In Proceedings CIKM 2005 pages84ndash90

T M Lai T Bui N Lipka and S Li 2018 Super-vised transfer learning for product information ques-tion answering In Proceedings of ICMLA 2018pages 1109ndash1114

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Julian McAuley 2016 Amazon questionanswerdata httpsjmcauleyucsdedudataamazonqa

Julian J McAuley and Alex Yang 2016 Address-ing complex and subjective product-related querieswith customer reviews In Proceedings WWW 2016pages 625ndash635 ACM

Dae Hoon Park Hyun Duk Kim ChengXiang Zhaiand Lifan Guo 2015 Retrieval of relevant opinionsentences for new products In Proceedings SIGIR2015 pages 393ndash402 ACM

Fatemeh Pourgholamali 2016 Mining information forthe cold-item problem In Proceedings of RecSys2016 page 451ndash454 ACM

Mengting Wan and Julian McAuley 2016 Modelingambiguity subjectivity and diverging viewpoints inopinion question answering systems In Proceedingof ICDM 2016 pages 489ndash498 IEEE

Jianfei Yu Minghui Qiu Jing Jiang Jun HuangShuangyong Song Wei Chu and Haiqing Chen2018 Modelling domain relationships for transferlearning on retrieval-based question answering sys-tems in e-commerce In Proceedings of WSDM2018 page 682ndash690 ACM

Jianxing Yu Zheng-Jun Zha and Tat-Seng Chua 2012Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviewsIn Proceedings of EMNLP 2012 pages 391ndash401ACL

Qian Yu and Wai Lam 2018 Review-aware answerprediction for product-related questions incorporat-ing aspects In Proceedings of WSDM 2018 pages691ndash699 ACM

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 7: Answering Product-Questions by Utilizing Questions from ...

248

Category Acc Majority ∆

Light Bars 739 611 +128Monitors 782 682 +99Smartwatches 800 656 +144Receivers 775 676 +99Backpacks 839 760 +79Jeans 713 593 +119Beds 846 720 +126Home Office Desks 739 634 +105Masks 751 669 +82Posters amp Prints 723 609 +115Accessories 791 724 +66

Macro Average 772 667 +97

Table 4 CPS model test set results on the CPS auxiliarydatasets and the majority baseline of each category

52 CPS Model

Training The CPS model predicts the contextualsimilarity between a pair of twin records In our ex-periments the textual content of a product consistsof the product title concatenated with the productbullet points separated by semicolons The ques-tion text is the original query as appeared in theAmazon PQA-dataset For the encoding modulesof the CPS model we use a standard pre-trainedRoBERTa-based model as well while using the[SEP ] token for separating the two inputs to eachencoder For training twin pairs are labeled accord-ing to their contextual similarity using Equation 2

We train fine-tune and test an independentCPS model for each category set C using D(C)Ddev(C) and Dtest(C) (details of the data splitdescribed in Appendix A) The training set D(C)is created as described in Section 51 Ddev(C)and Dtest(C) are created the same with one mod-ification ndash rather than retrieving the siblings for arecord from the dataset it belongs to the siblingsare retrieved from D(C) for both Ddev(C) andDtest(C) This represents a real-world scenariowhere existing products with their related questionsare used as a corpus for predicting the answer to aquestion about a new product Each product withall related questions appear only in one of thesesets

Evaluation We evaluate the CPS model by mea-suring the accuracy of its contextual similarity pre-diction over Dtest(C) The accuracy per categoryis presented in Table 4 The model achieves a rela-tively high accuracy with a macro average of 772over all categories presenting a significant lift of97 over the majority decision baseline This isan encouraging result considering the fact that the

answers for many questions cannot be directly in-ferred from the product textual information Weconjecture that the model is able to learn the affinitybetween different products in the context of a givenquestion for predicting their contextual similarityFor example the two backpacks Ranvoo LaptopBackpack and Swiss Gear Bungee Backpack werecorrectly classified by the CPS model as similar(ψ ge 05) in context of the question ldquoWill this fitunder a plane seatrdquo and classified as different(ψ lt 05) in context of the question ldquoDoes it havea separate laptop sleeverdquo

53 Answer Prediction MethodsWe experiment with our SimBA model and witha few baselines over the test set of all categoriesThe first one is Majority which returns the majorityanswer among all records in the category Othermethods are described next

SimBA Given a target record rt SimBA scoreseach of its twins by the CPS model and predictsthe answer for qt using Equation 3 wmin was fine-tuned on the combined dev set of all categories andwas set to 038

Question Similarity Only (QSO) We modifythe SimBA model to ignore the CPS classifica-tion score when implementing the Mixture-of-Experts model (Eq 3) by setting an equalweight of 10 to all twin votes Pred(at|rt) =

sign(sum

rjisinT (rt)δ(aj)

)

Product Similarity Only (PSO) We modify theSimBA model by setting qt and qj to empty stringsat the input of the CPS model both during train-ing and during inference forcing it to rely on theproductsrsquo textual content alone The twin retrievalprocess remains untouched

Answer Prediction Classifier (APC) We exper-iment with a direct prediction approach that onlyconsiders the product textual content and the ques-tion for answer prediction For each category Cwe fine-tune a pre-trained RoBERTa-based clas-sifier over all records rj isin C using qj and pj(separated by the [SEP ] token) as input and δ(aj)as the training label

SimBA+APC The experimental results showthat different answer-prediction methods (egSimBA vs APC) may be preferable for differentproduct categories Therefore we combine bothmethods for achieving optimal results by mixing

249

Twins Answer(Monitors) Does this require WiFi 51 91 No(Backpacks) Will it fit under a plane seat 213 90 Yes(Smartwatches) Can it measure blood sugar level 34 97 No(Smartwatches) Does it come with a charger 269 96 Yes

Table 5 Examples for popular questions with high answeragreement over different products

Figure 3 Macro-average test accuracy over all categoriesThe horizontal axis indicate the minimal number of twins andthe percentages of test questions each subset represents

the vote of APC with the twin votes using theMixture-of-Experts approach

Pred(at|rt) =sign

(η(rt)δ(αt) +

sumrjisinT (rt)

λ(rj)δ(aj))

where αt is the APC predicted answer and η(rt) =η1 η2 and η3 for |T (rt)| le 10 10 lt |T (rt)| lt 50and |T (rt)| ge 50 respectively16 All η values(η gt 0) are fine-tuned on the development set foreach category separately The values we used aredetailed in Table 10 in Appendix A

54 Answer Prediction EvaluationThe answer prediction accuracy results of all testedpredictors macro-averaged over Dtest(C) of allcategories are presented in Figure 3 We inspectthe performance of the methods on different subsetsof the test data where each subset is determinedby all records having at least x twins x isin [0130]The horizontal axis indicates the minimal numberof twins in the subset and the percentage of the dataeach subset represents For example the results atx = 0 represent the entire test set while the resultsat x = 10 represents the subset of questions withat least 10 twins account for 402 of the test set

The performance of Majority begins with 66(the percent of lsquoyesrsquo questions in the entire popula-tion) and drops for questions with many twins We

16We also tried a few different splits on the development set

hypothesize that obvious questions for which theanswer is the same across many products are rarelyasked hence have fewer twins In contrast infor-mative questions for which the answer is variedacross products are frequently asked wrt manyproducts hence have many twins Therefore wesee a drop in accuracy of the Majority baseline asthe number of twins grows

The accuracy of QSO is significantly higher thanthe majority-vote baseline This demonstrates aninteresting phenomena in the data of similar ques-tions that tend to have the same answer over va-riety of products typically of the same type Afew examples are presented in Table 5 The QSOmethod successfully detects these groups of ques-tions and predicts the majority answer for eachsuch group We find that PSO method generallydoesnrsquot improve over QSO This is somewhat sur-prising as we expected that using product similar-ity information such as brand model or key fea-tures would increase the prediction accuracy Thisdemonstrates the importance of question-contextas used in SimBA in addition to the product infor-mation alone

Moving to SimBA we can see a large perfor-mance improvement over the QSO and PSO meth-ods which we attribute directly to the CPS modelWe also see consistent improvement in accuracywith the number of twins likely due to the largersupport the model has for predicting the answer

The APC method despite its relative simplic-ity performs very well and greatly outperformsthe majority-vote and the QSO and PSO baselinesFor the segment of questions with less than 10twins APC outperforms the SimBA method Thissegment represents roughly 60 of the questionsHowever for the segment of questions with 60 ormore twins which accounts for 136 of the ques-tions SimBA method consistently outperforms theinductive baseline by 1-2 When inspecting theresults by category as shown in Table 6 we cansee that considering all questions with at least 1twin the APC method dominates in 7 out of the11 categories while for questions with at least 60twins SimBA method dominates in 6 out of the 11categories

Finally we see that the two approaches com-pliment each other and can be effectively joinedas the SimBA+APC method outperforms both ofthem over all subsets

250

Questions with 1+ Twins Questions with 60+ Twins of data SimBA APC SimBA+APC of data SimBA APC SimBA+APC

Light bars 625 753 748 777 90 741 681 750Monitors 792 760 764 765 270 785 785 785

Smartwatches 883 773 793 790 319 799 817 808Receivers 575 701 703 720 48 832 779 832Backpacks 747 807 827 823 215 827 832 829

Jeans 633 674 659 674 134 748 709 757Beds 704 777 764 779 168 825 804 825

Home amp office desks 650 718 762 758 47 800 844 844Masks 765 709 742 730 42 804 826 804

Posters amp prints 601 734 734 744 93 759 696 759Accessories 717 781 790 792 72 823 816 827

Macro Average 699 744 753 759 136 795 781 802

Table 6 Answer prediction accuracy by category Left accuracy over records with at least one twin representing 699 of therecords on average Right accuracy over records with at least 60 twins representing 136 of the records The highest accuracybetween SimBA and APC is in bold

6 Conclusions

We presented SimBA a novel answer predictionapproach in the PQA domain which directly lever-ages similar questions answered with respect toother products Our empirical evaluation showsthat on some segments of questions namely thosewith roughly ten or more similar questions in thecorpus our method can outperform a strong induc-tive method that directly utilizes the question andthe textual product content We further show thatthe two approaches are complementary and can beintegrated to increase the overall answer predictionaccuracy

For future work we plan to explore how SimBAcan be extended and be applied beyond yes-no ques-tions eg for questions with numerical answers oropen-ended questions Another interesting researchdirection is combining additional voters to theMixture-of-Experts model such as a review-awareanswer predictor or a product details-based predic-tor Additionally our current evaluation considereda static view of the answered product-question cor-pus we plan to explore temporal aspects of ourmethod for example considering questions age orignoring answers of obsolete products that mightbe irrelevant

Acknowledgments

Ohad Rozen would like to express his gratitudeto Yanai Elazar Vered Shwartz and Ido Daganfor providing him valuable advice while he wasconducting this research during his internship atAmazon

ReferencesDavid Carmel Liane Lewin-Eytan and Yoelle Maarek

2018 Product question answering using customer

generated content - research challenges SIGIR rsquo18page 1349ndash1350 Association for Computing Ma-chinery

Daniel Cer Yinfei Yang Sheng-yi Kong Nan HuaNicole Limtiaco Rhomni St John Noah Con-stant Mario Guajardo-Cespedes Steve Yuan ChrisTar Yun-Hsuan Sung Brian Strope and RayKurzweil 2018 Universal sentence encoder CoRRabs180311175

Long Chen Ziyu Guan Wei Zhao Wanqing Zhao Xi-aopeng Wang Zhou Zhao and Huan Sun 2019 An-swer identification from product reviews for userquestions by multi-task attentive networks In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence volume 33 pages 45ndash52

Lei Cui Shaohan Huang Furu Wei Chuanqi TanChaoqun Duan and Ming Zhou 2017 SuperagentA customer service chatbot for e-commerce web-sites In Proceedings of ACL 2017 System Demon-strations pages 97ndash102

Manirupa Das Zhen Wang Evan Jaffe MadhujaChattopadhyay Eric Fosler-Lussier and Rajiv Ram-nath 2019 Learning to answer subjective spe-cific product-related queries using customer re-views by adversarial domain adaptation CoRRabs191008270

Yang Deng Wenxuan Zhang and Wai Lam 2020Opinion-aware answer generation for review-drivenquestion answering in e-commerce In Proceedingsof CIKM 2020 page 255ndash264 Association for Com-puting Machinery

Miao Fan Chao Feng Mingming Sun Ping Li andHaifeng Wang 2019 Reading customer reviews toanswer product-related questions In Proceedingsof the 2019 SIAM International Conference on DataMining pages 567ndash575 SIAM

Shen Gao Zhaochun Ren Yihong Eric Zhao DongyanZhao Dawei Yin and Rui Yan 2019 Product-aware answer generation in e-commerce question-answering In Proceedings WSDM 2019 pages 429ndash437 ACM

251

Jing He and Decheng Dai 2011 Summarization ofyesno questions using a feature function model vol-ume 20 of Proceedings of Machine Learning Re-search pages 351ndash366 South Garden Hotels andResorts Taoyuan Taiwain PMLR

Robert A Jacobs Michael I Jordan Steven J Nowlanand Geoffrey E Hinton 1991 Adaptive mixtures oflocal experts Neural Comput 3(1)79ndash87

Jiwoon Jeon W Bruce Croft and Joon Ho Lee 2005Finding similar questions in large question and an-swer archives In Proceedings CIKM 2005 pages84ndash90

T M Lai T Bui N Lipka and S Li 2018 Super-vised transfer learning for product information ques-tion answering In Proceedings of ICMLA 2018pages 1109ndash1114

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Julian McAuley 2016 Amazon questionanswerdata httpsjmcauleyucsdedudataamazonqa

Julian J McAuley and Alex Yang 2016 Address-ing complex and subjective product-related querieswith customer reviews In Proceedings WWW 2016pages 625ndash635 ACM

Dae Hoon Park Hyun Duk Kim ChengXiang Zhaiand Lifan Guo 2015 Retrieval of relevant opinionsentences for new products In Proceedings SIGIR2015 pages 393ndash402 ACM

Fatemeh Pourgholamali 2016 Mining information forthe cold-item problem In Proceedings of RecSys2016 page 451ndash454 ACM

Mengting Wan and Julian McAuley 2016 Modelingambiguity subjectivity and diverging viewpoints inopinion question answering systems In Proceedingof ICDM 2016 pages 489ndash498 IEEE

Jianfei Yu Minghui Qiu Jing Jiang Jun HuangShuangyong Song Wei Chu and Haiqing Chen2018 Modelling domain relationships for transferlearning on retrieval-based question answering sys-tems in e-commerce In Proceedings of WSDM2018 page 682ndash690 ACM

Jianxing Yu Zheng-Jun Zha and Tat-Seng Chua 2012Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviewsIn Proceedings of EMNLP 2012 pages 391ndash401ACL

Qian Yu and Wai Lam 2018 Review-aware answerprediction for product-related questions incorporat-ing aspects In Proceedings of WSDM 2018 pages691ndash699 ACM

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 8: Answering Product-Questions by Utilizing Questions from ...

249

Twins Answer(Monitors) Does this require WiFi 51 91 No(Backpacks) Will it fit under a plane seat 213 90 Yes(Smartwatches) Can it measure blood sugar level 34 97 No(Smartwatches) Does it come with a charger 269 96 Yes

Table 5 Examples for popular questions with high answeragreement over different products

Figure 3 Macro-average test accuracy over all categoriesThe horizontal axis indicate the minimal number of twins andthe percentages of test questions each subset represents

the vote of APC with the twin votes using theMixture-of-Experts approach

Pred(at|rt) =sign

(η(rt)δ(αt) +

sumrjisinT (rt)

λ(rj)δ(aj))

where αt is the APC predicted answer and η(rt) =η1 η2 and η3 for |T (rt)| le 10 10 lt |T (rt)| lt 50and |T (rt)| ge 50 respectively16 All η values(η gt 0) are fine-tuned on the development set foreach category separately The values we used aredetailed in Table 10 in Appendix A

54 Answer Prediction EvaluationThe answer prediction accuracy results of all testedpredictors macro-averaged over Dtest(C) of allcategories are presented in Figure 3 We inspectthe performance of the methods on different subsetsof the test data where each subset is determinedby all records having at least x twins x isin [0130]The horizontal axis indicates the minimal numberof twins in the subset and the percentage of the dataeach subset represents For example the results atx = 0 represent the entire test set while the resultsat x = 10 represents the subset of questions withat least 10 twins account for 402 of the test set

The performance of Majority begins with 66(the percent of lsquoyesrsquo questions in the entire popula-tion) and drops for questions with many twins We

16We also tried a few different splits on the development set

hypothesize that obvious questions for which theanswer is the same across many products are rarelyasked hence have fewer twins In contrast infor-mative questions for which the answer is variedacross products are frequently asked wrt manyproducts hence have many twins Therefore wesee a drop in accuracy of the Majority baseline asthe number of twins grows

The accuracy of QSO is significantly higher thanthe majority-vote baseline This demonstrates aninteresting phenomena in the data of similar ques-tions that tend to have the same answer over va-riety of products typically of the same type Afew examples are presented in Table 5 The QSOmethod successfully detects these groups of ques-tions and predicts the majority answer for eachsuch group We find that PSO method generallydoesnrsquot improve over QSO This is somewhat sur-prising as we expected that using product similar-ity information such as brand model or key fea-tures would increase the prediction accuracy Thisdemonstrates the importance of question-contextas used in SimBA in addition to the product infor-mation alone

Moving to SimBA we can see a large perfor-mance improvement over the QSO and PSO meth-ods which we attribute directly to the CPS modelWe also see consistent improvement in accuracywith the number of twins likely due to the largersupport the model has for predicting the answer

The APC method despite its relative simplic-ity performs very well and greatly outperformsthe majority-vote and the QSO and PSO baselinesFor the segment of questions with less than 10twins APC outperforms the SimBA method Thissegment represents roughly 60 of the questionsHowever for the segment of questions with 60 ormore twins which accounts for 136 of the ques-tions SimBA method consistently outperforms theinductive baseline by 1-2 When inspecting theresults by category as shown in Table 6 we cansee that considering all questions with at least 1twin the APC method dominates in 7 out of the11 categories while for questions with at least 60twins SimBA method dominates in 6 out of the 11categories

Finally we see that the two approaches com-pliment each other and can be effectively joinedas the SimBA+APC method outperforms both ofthem over all subsets

250

Questions with 1+ Twins Questions with 60+ Twins of data SimBA APC SimBA+APC of data SimBA APC SimBA+APC

Light bars 625 753 748 777 90 741 681 750Monitors 792 760 764 765 270 785 785 785

Smartwatches 883 773 793 790 319 799 817 808Receivers 575 701 703 720 48 832 779 832Backpacks 747 807 827 823 215 827 832 829

Jeans 633 674 659 674 134 748 709 757Beds 704 777 764 779 168 825 804 825

Home amp office desks 650 718 762 758 47 800 844 844Masks 765 709 742 730 42 804 826 804

Posters amp prints 601 734 734 744 93 759 696 759Accessories 717 781 790 792 72 823 816 827

Macro Average 699 744 753 759 136 795 781 802

Table 6 Answer prediction accuracy by category Left accuracy over records with at least one twin representing 699 of therecords on average Right accuracy over records with at least 60 twins representing 136 of the records The highest accuracybetween SimBA and APC is in bold

6 Conclusions

We presented SimBA a novel answer predictionapproach in the PQA domain which directly lever-ages similar questions answered with respect toother products Our empirical evaluation showsthat on some segments of questions namely thosewith roughly ten or more similar questions in thecorpus our method can outperform a strong induc-tive method that directly utilizes the question andthe textual product content We further show thatthe two approaches are complementary and can beintegrated to increase the overall answer predictionaccuracy

For future work we plan to explore how SimBAcan be extended and be applied beyond yes-no ques-tions eg for questions with numerical answers oropen-ended questions Another interesting researchdirection is combining additional voters to theMixture-of-Experts model such as a review-awareanswer predictor or a product details-based predic-tor Additionally our current evaluation considereda static view of the answered product-question cor-pus we plan to explore temporal aspects of ourmethod for example considering questions age orignoring answers of obsolete products that mightbe irrelevant

Acknowledgments

Ohad Rozen would like to express his gratitudeto Yanai Elazar Vered Shwartz and Ido Daganfor providing him valuable advice while he wasconducting this research during his internship atAmazon

ReferencesDavid Carmel Liane Lewin-Eytan and Yoelle Maarek

2018 Product question answering using customer

generated content - research challenges SIGIR rsquo18page 1349ndash1350 Association for Computing Ma-chinery

Daniel Cer Yinfei Yang Sheng-yi Kong Nan HuaNicole Limtiaco Rhomni St John Noah Con-stant Mario Guajardo-Cespedes Steve Yuan ChrisTar Yun-Hsuan Sung Brian Strope and RayKurzweil 2018 Universal sentence encoder CoRRabs180311175

Long Chen Ziyu Guan Wei Zhao Wanqing Zhao Xi-aopeng Wang Zhou Zhao and Huan Sun 2019 An-swer identification from product reviews for userquestions by multi-task attentive networks In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence volume 33 pages 45ndash52

Lei Cui Shaohan Huang Furu Wei Chuanqi TanChaoqun Duan and Ming Zhou 2017 SuperagentA customer service chatbot for e-commerce web-sites In Proceedings of ACL 2017 System Demon-strations pages 97ndash102

Manirupa Das Zhen Wang Evan Jaffe MadhujaChattopadhyay Eric Fosler-Lussier and Rajiv Ram-nath 2019 Learning to answer subjective spe-cific product-related queries using customer re-views by adversarial domain adaptation CoRRabs191008270

Yang Deng Wenxuan Zhang and Wai Lam 2020Opinion-aware answer generation for review-drivenquestion answering in e-commerce In Proceedingsof CIKM 2020 page 255ndash264 Association for Com-puting Machinery

Miao Fan Chao Feng Mingming Sun Ping Li andHaifeng Wang 2019 Reading customer reviews toanswer product-related questions In Proceedingsof the 2019 SIAM International Conference on DataMining pages 567ndash575 SIAM

Shen Gao Zhaochun Ren Yihong Eric Zhao DongyanZhao Dawei Yin and Rui Yan 2019 Product-aware answer generation in e-commerce question-answering In Proceedings WSDM 2019 pages 429ndash437 ACM

251

Jing He and Decheng Dai 2011 Summarization ofyesno questions using a feature function model vol-ume 20 of Proceedings of Machine Learning Re-search pages 351ndash366 South Garden Hotels andResorts Taoyuan Taiwain PMLR

Robert A Jacobs Michael I Jordan Steven J Nowlanand Geoffrey E Hinton 1991 Adaptive mixtures oflocal experts Neural Comput 3(1)79ndash87

Jiwoon Jeon W Bruce Croft and Joon Ho Lee 2005Finding similar questions in large question and an-swer archives In Proceedings CIKM 2005 pages84ndash90

T M Lai T Bui N Lipka and S Li 2018 Super-vised transfer learning for product information ques-tion answering In Proceedings of ICMLA 2018pages 1109ndash1114

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Julian McAuley 2016 Amazon questionanswerdata httpsjmcauleyucsdedudataamazonqa

Julian J McAuley and Alex Yang 2016 Address-ing complex and subjective product-related querieswith customer reviews In Proceedings WWW 2016pages 625ndash635 ACM

Dae Hoon Park Hyun Duk Kim ChengXiang Zhaiand Lifan Guo 2015 Retrieval of relevant opinionsentences for new products In Proceedings SIGIR2015 pages 393ndash402 ACM

Fatemeh Pourgholamali 2016 Mining information forthe cold-item problem In Proceedings of RecSys2016 page 451ndash454 ACM

Mengting Wan and Julian McAuley 2016 Modelingambiguity subjectivity and diverging viewpoints inopinion question answering systems In Proceedingof ICDM 2016 pages 489ndash498 IEEE

Jianfei Yu Minghui Qiu Jing Jiang Jun HuangShuangyong Song Wei Chu and Haiqing Chen2018 Modelling domain relationships for transferlearning on retrieval-based question answering sys-tems in e-commerce In Proceedings of WSDM2018 page 682ndash690 ACM

Jianxing Yu Zheng-Jun Zha and Tat-Seng Chua 2012Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviewsIn Proceedings of EMNLP 2012 pages 391ndash401ACL

Qian Yu and Wai Lam 2018 Review-aware answerprediction for product-related questions incorporat-ing aspects In Proceedings of WSDM 2018 pages691ndash699 ACM

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 9: Answering Product-Questions by Utilizing Questions from ...

250

Questions with 1+ Twins Questions with 60+ Twins of data SimBA APC SimBA+APC of data SimBA APC SimBA+APC

Light bars 625 753 748 777 90 741 681 750Monitors 792 760 764 765 270 785 785 785

Smartwatches 883 773 793 790 319 799 817 808Receivers 575 701 703 720 48 832 779 832Backpacks 747 807 827 823 215 827 832 829

Jeans 633 674 659 674 134 748 709 757Beds 704 777 764 779 168 825 804 825

Home amp office desks 650 718 762 758 47 800 844 844Masks 765 709 742 730 42 804 826 804

Posters amp prints 601 734 734 744 93 759 696 759Accessories 717 781 790 792 72 823 816 827

Macro Average 699 744 753 759 136 795 781 802

Table 6 Answer prediction accuracy by category Left accuracy over records with at least one twin representing 699 of therecords on average Right accuracy over records with at least 60 twins representing 136 of the records The highest accuracybetween SimBA and APC is in bold

6 Conclusions

We presented SimBA a novel answer predictionapproach in the PQA domain which directly lever-ages similar questions answered with respect toother products Our empirical evaluation showsthat on some segments of questions namely thosewith roughly ten or more similar questions in thecorpus our method can outperform a strong induc-tive method that directly utilizes the question andthe textual product content We further show thatthe two approaches are complementary and can beintegrated to increase the overall answer predictionaccuracy

For future work we plan to explore how SimBAcan be extended and be applied beyond yes-no ques-tions eg for questions with numerical answers oropen-ended questions Another interesting researchdirection is combining additional voters to theMixture-of-Experts model such as a review-awareanswer predictor or a product details-based predic-tor Additionally our current evaluation considereda static view of the answered product-question cor-pus we plan to explore temporal aspects of ourmethod for example considering questions age orignoring answers of obsolete products that mightbe irrelevant

Acknowledgments

Ohad Rozen would like to express his gratitudeto Yanai Elazar Vered Shwartz and Ido Daganfor providing him valuable advice while he wasconducting this research during his internship atAmazon

ReferencesDavid Carmel Liane Lewin-Eytan and Yoelle Maarek

2018 Product question answering using customer

generated content - research challenges SIGIR rsquo18page 1349ndash1350 Association for Computing Ma-chinery

Daniel Cer Yinfei Yang Sheng-yi Kong Nan HuaNicole Limtiaco Rhomni St John Noah Con-stant Mario Guajardo-Cespedes Steve Yuan ChrisTar Yun-Hsuan Sung Brian Strope and RayKurzweil 2018 Universal sentence encoder CoRRabs180311175

Long Chen Ziyu Guan Wei Zhao Wanqing Zhao Xi-aopeng Wang Zhou Zhao and Huan Sun 2019 An-swer identification from product reviews for userquestions by multi-task attentive networks In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence volume 33 pages 45ndash52

Lei Cui Shaohan Huang Furu Wei Chuanqi TanChaoqun Duan and Ming Zhou 2017 SuperagentA customer service chatbot for e-commerce web-sites In Proceedings of ACL 2017 System Demon-strations pages 97ndash102

Manirupa Das Zhen Wang Evan Jaffe MadhujaChattopadhyay Eric Fosler-Lussier and Rajiv Ram-nath 2019 Learning to answer subjective spe-cific product-related queries using customer re-views by adversarial domain adaptation CoRRabs191008270

Yang Deng Wenxuan Zhang and Wai Lam 2020Opinion-aware answer generation for review-drivenquestion answering in e-commerce In Proceedingsof CIKM 2020 page 255ndash264 Association for Com-puting Machinery

Miao Fan Chao Feng Mingming Sun Ping Li andHaifeng Wang 2019 Reading customer reviews toanswer product-related questions In Proceedingsof the 2019 SIAM International Conference on DataMining pages 567ndash575 SIAM

Shen Gao Zhaochun Ren Yihong Eric Zhao DongyanZhao Dawei Yin and Rui Yan 2019 Product-aware answer generation in e-commerce question-answering In Proceedings WSDM 2019 pages 429ndash437 ACM

251

Jing He and Decheng Dai 2011 Summarization ofyesno questions using a feature function model vol-ume 20 of Proceedings of Machine Learning Re-search pages 351ndash366 South Garden Hotels andResorts Taoyuan Taiwain PMLR

Robert A Jacobs Michael I Jordan Steven J Nowlanand Geoffrey E Hinton 1991 Adaptive mixtures oflocal experts Neural Comput 3(1)79ndash87

Jiwoon Jeon W Bruce Croft and Joon Ho Lee 2005Finding similar questions in large question and an-swer archives In Proceedings CIKM 2005 pages84ndash90

T M Lai T Bui N Lipka and S Li 2018 Super-vised transfer learning for product information ques-tion answering In Proceedings of ICMLA 2018pages 1109ndash1114

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Julian McAuley 2016 Amazon questionanswerdata httpsjmcauleyucsdedudataamazonqa

Julian J McAuley and Alex Yang 2016 Address-ing complex and subjective product-related querieswith customer reviews In Proceedings WWW 2016pages 625ndash635 ACM

Dae Hoon Park Hyun Duk Kim ChengXiang Zhaiand Lifan Guo 2015 Retrieval of relevant opinionsentences for new products In Proceedings SIGIR2015 pages 393ndash402 ACM

Fatemeh Pourgholamali 2016 Mining information forthe cold-item problem In Proceedings of RecSys2016 page 451ndash454 ACM

Mengting Wan and Julian McAuley 2016 Modelingambiguity subjectivity and diverging viewpoints inopinion question answering systems In Proceedingof ICDM 2016 pages 489ndash498 IEEE

Jianfei Yu Minghui Qiu Jing Jiang Jun HuangShuangyong Song Wei Chu and Haiqing Chen2018 Modelling domain relationships for transferlearning on retrieval-based question answering sys-tems in e-commerce In Proceedings of WSDM2018 page 682ndash690 ACM

Jianxing Yu Zheng-Jun Zha and Tat-Seng Chua 2012Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviewsIn Proceedings of EMNLP 2012 pages 391ndash401ACL

Qian Yu and Wai Lam 2018 Review-aware answerprediction for product-related questions incorporat-ing aspects In Proceedings of WSDM 2018 pages691ndash699 ACM

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 10: Answering Product-Questions by Utilizing Questions from ...

251

Jing He and Decheng Dai 2011 Summarization ofyesno questions using a feature function model vol-ume 20 of Proceedings of Machine Learning Re-search pages 351ndash366 South Garden Hotels andResorts Taoyuan Taiwain PMLR

Robert A Jacobs Michael I Jordan Steven J Nowlanand Geoffrey E Hinton 1991 Adaptive mixtures oflocal experts Neural Comput 3(1)79ndash87

Jiwoon Jeon W Bruce Croft and Joon Ho Lee 2005Finding similar questions in large question and an-swer archives In Proceedings CIKM 2005 pages84ndash90

T M Lai T Bui N Lipka and S Li 2018 Super-vised transfer learning for product information ques-tion answering In Proceedings of ICMLA 2018pages 1109ndash1114

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Julian McAuley 2016 Amazon questionanswerdata httpsjmcauleyucsdedudataamazonqa

Julian J McAuley and Alex Yang 2016 Address-ing complex and subjective product-related querieswith customer reviews In Proceedings WWW 2016pages 625ndash635 ACM

Dae Hoon Park Hyun Duk Kim ChengXiang Zhaiand Lifan Guo 2015 Retrieval of relevant opinionsentences for new products In Proceedings SIGIR2015 pages 393ndash402 ACM

Fatemeh Pourgholamali 2016 Mining information forthe cold-item problem In Proceedings of RecSys2016 page 451ndash454 ACM

Mengting Wan and Julian McAuley 2016 Modelingambiguity subjectivity and diverging viewpoints inopinion question answering systems In Proceedingof ICDM 2016 pages 489ndash498 IEEE

Jianfei Yu Minghui Qiu Jing Jiang Jun HuangShuangyong Song Wei Chu and Haiqing Chen2018 Modelling domain relationships for transferlearning on retrieval-based question answering sys-tems in e-commerce In Proceedings of WSDM2018 page 682ndash690 ACM

Jianxing Yu Zheng-Jun Zha and Tat-Seng Chua 2012Answering opinion questions on products by exploit-ing hierarchical organization of consumer reviewsIn Proceedings of EMNLP 2012 pages 391ndash401ACL

Qian Yu and Wai Lam 2018 Review-aware answerprediction for product-related questions incorporat-ing aspects In Proceedings of WSDM 2018 pages691ndash699 ACM

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 11: Answering Product-Questions by Utilizing Questions from ...

252

A Supplemental Material

A1 Amazon-PQSim dataset

The Amazon-PQSim dataset includes questionpairs where all questions are published on Amazonwebsite Each pairs has a corresponding label 1 forsimilar else 0 The labels were collected via Ap-pen crowd sourcing service We took the followingfiltering steps (step b in 41) for each question

bull Removed any question with less than fivewords

bull Removed any question with more than 15words

bull Removed any none-English questions

bull Removed any question with multiple question-marks (may indicate multiple questions)

bull Removed questions with rare words (any wordwhich is not in the top 2000 most frequentwords)

A2 Amazon-PQA dataset

The Amazon-PQA dataset includes questions andtheir answers that are published on Amazon web-site along with the public product information andcategory (Amazon Browse Node name) The dataincludes the following fields

bull lsquoquestion_idlsquo

bull lsquoasin_idlsquo

bull lsquoquestion_textlsquo

bull lsquoanswer_textlsquo

bull lsquobullet_pointslsquo

bull lsquoproduct_descriptionlsquo

bull lsquobrand_namelsquo

bull lsquoitem_name lsquo

In addition we augment this data with fields de-rived from our current work

bull lsquois_yes-no_questionlsquo

bull lsquoyes-no_answerlsquo (yes no maybe)

YesNo Question Classification We followed(He and Dai 2011) for detecting YesNo questionsusing simple heuristics such as checking if thequestion starts with a Be verb (am is are beenbeing was were) Modal verb (can could shallshould will would may might) or an Auxiliaryverb (do did does have had has) and addition-ally ends with a question mark We tested theclassifier on McAuleyrsquos dataset (McAuley 2016)identified yesno questions with 984 precisionat 965 recall while considering McAuley asground truth17

YesNo Answer Labeling As described in Sec-tion A2 we used the McAuley dataset (McAuley2016) to train a RoBERTa-based classifier takingonly yesno questions For testing we used 5 an-notator to annotate 583 question-answer pairs ran-domly sampled from our raw data The modelachieved 97 and 88 precision for the yes andno labels respectively and a recall of 65 and51 on the entire test set18

Next to determine each questionrsquos final yesnoanswer we first omitted answers classified asmaybe When a question is answered by a veri-fied seller we considered it as most reliable andused it as the final label Otherwise we used themajority votes among the remaining answers Inour experiments we ignore questions with an equalnumber of yes and no answers

Dataset Split Each item in our dataset is a (prod-uct question answer) triplet We split the labeledtriplets into train (80) dev (10) and test (10)sets for each category relating to the number ofproducts Each product with all related questionsappear only in one of these sets The statistics forthis dataset are given in Table 8

A3 CPS Model DetailsThe CPS has a total of 2546M parameters Forall incorporated RoBERTa models we use a maxi-mum sequence length of 256 dropout of 01 anda 32 batch size for training We applied differ-ent learning rates and number of epochs for eachproduct-category The specific values we used aftertuning are shown in Table 9

17McAuley and Yang reported identifying yesno questionswith 97 precision at 82 recall on their dataset

18McAuley and Yang reported 98 accuracy after keepingonly the 50 of instances about which their classifier was themost confident

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote

Page 12: Answering Product-Questions by Utilizing Questions from ...

253

Category Products Questions YN Questions Answers Q Answered Yes Q Answered No

Light Bars 6151 48736 23956 95853 10146 5243Monitors 6246 170529 86837 316126 35353 22947Smartwatches 8611 166400 94936 289945 41683 22033Receivers 3442 58618 33511 135700 14488 7364Backpacks 12096 68598 38914 138996 19902 6090Jeans 38008 61908 17518 129346 7708 5548Beds 17202 108723 46722 238786 17757 13917Home Office Desks 6986 55303 23202 112958 9523 5971Masks 13384 51295 24989 100789 9757 5759Posters amp Prints 33803 53939 20737 99926 8171 5450Accessories 38825 238603 159272 438447 60990 23772

Rest of 89 Categories 1288754 8906362 4833639 18565933 2219022 1055816

Total 1473508 9989014 5404233 20662805 2454500 1179910

Table 7 Statistics of the Amazon-PQA dataset extracted from Amazoncom YN Questions as detected by our YesNoQuestion detector Answers is the total number of answers before any filtering Q Answers Yes (No) is the number of YesNoquestions with answers labeled as Yes (No) (See Section 42)

Train Set Dev Set Test Set TotalCategory P Q Yes P Q Yes P Q Yes P Q Yes

Light bars 2552 8675 681 319 1080 687 319 1296 692 3190 11051 683Monitors 3421 29886 633 427 3890 647 427 4260 630 4275 38036 634Smartwatches 4128 34734 685 516 3730 664 516 3778 678 5160 42242 683Receivers 1725 11991 692 215 1827 681 215 2356 657 2155 16174 685Backpacks 4834 14740 784 604 1397 759 604 1908 773 6042 18045 780Jeans 5365 6056 613 670 773 598 670 769 581 6705 7598 608Beds 5912 16792 591 739 2017 583 739 2276 583 7390 21085 589Home Office Desks 2576 8637 627 322 1059 643 322 962 629 3220 10658 629Masks 4332 8541 648 541 1180 640 541 1099 631 5414 10820 645Posters amp Prints 5739 7226 627 717 868 624 717 850 660 7173 8944 630Accessories 14422 54125 735 1802 6038 737 1802 6706 745 18026 66869 736

Total 55006 201403 665 6872 23859 660 6872 26260 660 68750 251522 664

Table 8 Statistics of the yes-no questions subset from the Amazon-PQA dataset and the train dev test splits used in ourexperiments Only categories used for our experiments are displayed Macro average

Category Epochs Learning Rate

Light Bars 3 30E-5Monitors 4 30E-5Smartwatches 3 30E-5Receivers 4 30E-5Backpacks 4 30E-5Jeans 3 20E-5Beds 4 40E-5Home Office Desks 3 20E-5Masks 3 30E-5Posters amp Prints 3 20E-5Accessories 3 20E-5

Table 9 Number of epochs and learning rates used for train-ing the CPS model (Section 52) on each category

Category η1 η2 η3

Light Bars 3 8 2Monitors 6 2 63Smartwatches 2 11 49Receivers 2 0 0Backpacks 1 4 17Jeans 7 8 22Beds 1 0 0Home Office Desks 4 1 38Masks 4 6 3Posters amp Prints 5 0 18Accessories 1 2 16

Table 10 η1 η2 and η3 values used after fine-tuning onour development set (Section 53) Larger η values givemore weight to the APC vote