Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian...
Transcript of Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian...
![Page 1: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/1.jpg)
Web-Search Ranking with Initialized Gradient Boosted
Regression Trees
Ananth Mohan Zheng Chen Kilian Weinberger
[email protected] [email protected] [email protected]
Department of Computer Science & EngineeringWashington University in St. Louis
St. Louis, MO 63130, USA
![Page 2: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/2.jpg)
Agenda
• Introduction
• Past Work
• Proposed Approach• Introduced RF
• Introduce GBRT.
• iGBRT
• Result for iGBRT
• Classification vs. Regression
• Statistics of the data sets
• Final Results
• Conclusion
2
![Page 3: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/3.jpg)
Introduction
• Learn to Rank Challenge• Given a query, documents have to be ranked according to their relevance to the query.
• Point-wise , light weight.
• A machine learning algorithm is trained to predict the relevance from the feature vector, and during test time the documents are ranked according to these predictions.
• We investigate Random Forests (RF) as a low-cost alternative algorithm to Gradient Boosted Regression Trees. Its yield surprisingly accurate ranking results comparable to or better than GBRT.
3
![Page 4: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/4.jpg)
Introduction (Cont.)
• We combine the two algorithms by first learning a ranking function with RF and using it as initialization for GBRT.
• We refer to this setting as iGBRT.
4
![Page 5: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/5.jpg)
Past Work
The past years have seen many different approaches to web search ranking
• Adaptations of support vector machines (Joachims, 2002; Chapelle and Keerthi, 2010)
• Neural networks (Burges et al., 2005)
• gradient boosted regression trees (GBRT) (Zheng et al., 2007b)
• learning paradigm (Li et al., 2007; Gao et al., 2009; Burges, 2010)
5
![Page 6: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/6.jpg)
Proposed Approach
• Notation and data set
• introduce RF.
• introduce GBRT.
• Check the results of RF and GBRT
• Both algorithms are combined as initialized gradient boosted regression trees (iGBRT).
• Check the results with iGBRT
6
![Page 7: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/7.jpg)
Notations
We assume data of triples D = {(x1; q1; y1),…..,(xn; qn; yn)}
x = documents , q = queries , y = label
D = {(x1; y1),…..,(xn; yn)}
T(.) = trained predictor
Cart(S, k, d) ≈ argmin ∑ (h(zi) – ri)2
h ∈ Td , ( zi , ri ) ∈ S
S ⊆ D , k < f , d > 0
Td = set of all CART trees
7
![Page 8: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/8.jpg)
Data Set
Yahoo Learning to Rank Challenge was based on two data sets
• Set 1 = 473134 documents
• Set 2 = 19944 documents
• Five folds of the Microsoft MSLR data set.
8
![Page 9: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/9.jpg)
Random forest
• The fundamental concept underlying Random Forests is bagging.
• In bagging, a learning algorithm is applied multiple times to a subset of D and the results are averaged.
• Random Forests is essentially bagging applied to CART with full depth (d = ∞), where at each split only K uniformly chosen features are evaluated to find the best splitting point.
• The construction of a single tree is independent from earlier trees.
• So making Random Forests an inherently parallel algorithm.
9
![Page 10: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/10.jpg)
Random forest
10
![Page 11: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/11.jpg)
Gradient Boosted Regression Trees
• Gradient Boosted Regression Trees is also based on tree averaging.
• GBRT sequentially adds small trees (d = 4).
• In each iteration, the new tree to be added that are responsible for the current remaining regression error.
• T(xi) = current prediction of sample xi.
• continuous loss function L(T(x1),…..,T(xn)) , which reaches at its minimum if T(xi) = yi
• Throughout the paper we use the square loss: L = ½ ∑ni=1 (T(xi) – yi)
2.
• T(xi) ← T(xi) - α (L / T(xi )) • α = learning rate, L = squared loss , gradient step
11
![Page 12: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/12.jpg)
Gradient Boosted Regression Trees
12
![Page 13: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/13.jpg)
GBRT vs RF various settings for α
13
![Page 14: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/14.jpg)
Why iGBRT
Why not GBRT only
• In each iteration the gradient is only approximated.
• for true convergence, the learning-rate αneeds to be infinitesimally small
• requiring an unrealistically large number of iterations MB >> 0.
Why initialized with RF
• RF is known to be very resistant towards overfitting and therefore makes a good optimization starting point.
• RF is insensitive to parameter settings and does not require additional parameter tuning.
14
![Page 15: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/15.jpg)
Initialized Gradient Boosted Regression Trees
15
![Page 16: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/16.jpg)
Results with iGBRT
16
![Page 17: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/17.jpg)
17
![Page 18: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/18.jpg)
Classification vs. Regression
• All our algorithms used regression to approximate the relevance of a document.
• Li et al. (2007) proposed a learning to rank paradigm that is based on classification instead of regression.
• Instead of learning a function T(xi) ≈ yi, the authors utilize the fact that the original relevance scores are discrete, yi∈ {0, 1, 2, 3, 4 }.
• Generate four binary classification problems indexed by c = 1,…., 4.
• The cth classification problem predicts if the document is less relevant than c.
• We carefully choose classifiers Tc(.) to return well defined probabilities (i.e. 0 < Tc(.) < 1).
• If we define the constant functions T0(.) = 0 and T5(.) = 1.
• we can combine all classifiers T0,….., T5 to compute the probability that a document xi has a relevance of r ∈ {0,….,4}: P ( rel(xi) = r) = P ( rel(xi) < r+1 ) - P ( rel(xi) < r)
18
![Page 19: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/19.jpg)
19
![Page 20: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/20.jpg)
Statistics of the data sets.
20
![Page 21: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/21.jpg)
Performance of GBRT, RF and iGBRT with ERR
21
![Page 22: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/22.jpg)
Performance of GBRT, RF and iGBRT with NDCG
22
![Page 23: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/23.jpg)
Conclusion
• We compared three algorithms with regression and classification settings.
• RF picked its same parameters through out paper and outperforms GBRT.
• For further refinement of results of RF we introduced iGBRT.
• we demonstrated that classification tends to be a better paradigm for web-search ranking than regression.
• iGBRT in a classification setting consistently achieves state-of-the-art performance on all publicly available web-search data sets that we are aware of.
23
![Page 24: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/24.jpg)
References
1. Breiman. Classication and regression trees. Chapman & Hall/CRC, 1984.
2. https://www.youtube.com/watch?v=D_2LkhMJcfY&t=223s
3. https://www.youtube.com/watch?v=DCZ3tsQIoGU&t=146s
4. http://proceedings.mlr.press/v14/chapelle11a/chapelle11a.pdf
5. https://www.youtube.com/watch?v=ErDgauqnTHk
24
![Page 25: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/25.jpg)
(Group J)
Seminar Data Analytics IInternational Masters Program in Data
AnalyticsUniversity of HildesheimSummer Semester 2018
Famakin Olawole Taiwo 25
![Page 26: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/26.jpg)
Mining Text SnippetsFor Images On The Web
Kannan, A., Baker, S., Ramnath, K., Fiss, J., Lin, D., Vanderwendem L., & Wang, X.J. (2014)
In the proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM
Famakin Olawole Taiwo 26
![Page 27: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/27.jpg)
Outline• Introduction
• Related Work
• Snippet Mining Algorithm
• Evaluation of Snippet
• Applications
• Conclusion
• References
Famakin Olawole Taiwo 27
![Page 28: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/28.jpg)
Introduction
What is an imageAn image refers to a binary representation of visual information such as drawings, pictures, graphs, logos, or individual video frames
Text miningThis is referred to as the process of examining massive collections of written resources to generate new information, and to transform the unstructured text into structured data for use in further analysis.-- It identifies :• Facts, Relationships, Assertionsthat would otherwise remain buried in the mass of textual big data.
Famakin Olawole Taiwo 28
![Page 29: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/29.jpg)
IntroductionMaking more sense of this
• Harness power of text mining
• Top k snippet algorithm (proposed)
• Gain relevant and interesting information regarding an image
Famakin Olawole Taiwo 29
![Page 30: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/30.jpg)
Introduction
Focus• Show and implement mining text algorithm
• Obtain relevant and useful text snippets regardingimages on the web.
• To also show off applications built with the use of this algorithm obtained.
*Note that these stories are generally not contained in the image captions (which are most often just descriptive), but the captions can help identify the most interesting stories.*
Famakin Olawole Taiwo 30
![Page 31: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/31.jpg)
Related Works
Image caption generation
• Y. Mori, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing and vector quantizing images with words.
• G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi,A. C. Berg, and T. L. Berg.
Baby talk: Understanding and generating simple image descriptions.
• R. Mason and E. Charniak. Annotation of online shopping images without labelled training examples.
Focuses on associating word tags with images.
Famakin Olawole Taiwo 31
![Page 32: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/32.jpg)
Related Works
Document Summarization
• O. Buyukkokten, H. Garcia-Molina, and A. Paepcke.Seeing the whole in parts: text summarization for web browsing on handheld devices.
• W. T. Chuang and J. Yang. Extracting sentence segments for text summarization: a machine learning approach.
• J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz. Multi-document summarization by sentence extraction
Focuses on summarizing documents, either by identifying key phrases and sentences that are reflective of the focus of the document
Famakin Olawole Taiwo 32
![Page 33: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/33.jpg)
Snippet Mining Algorithm
This algorithm has been based on the concept and notion that if an image Is interesting a lot of people would love to embed and write about it on their websites, blogs and articles.
For each image (worked upon) we mine the web for all the webpages containing it, in order to identify text snippets that are relevant and interesting and also form a diverse set of text.
This results to clustering of this images into near duplicate groups (image set or duplicate image set)
{ MURL, PURL, HTML}Famakin Olawole Taiwo 33
![Page 34: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/34.jpg)
Snippet Mining Algorithm
Scalable Image Set Identification
The goal here is to cluster images so that each cluster consists of images that are near duplicate to each other.
To achieve thisWe adopt a two step clustering method using hashing techniques within map reduce frameworks.
• To cover large variation within a duplicate image cluster while minimizing false positives.
• scalable for clustering billions of images on the web
Famakin Olawole Taiwo 34
![Page 35: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/35.jpg)
Snippet Mining Algorithm
Forming candidate snippetsAs stated earlier….
An Image set is represented by == {MURL, PURL, HTML}
In addition we parse HTML to obtain a linear ordering of the text and image nodes
(WPURL)
For each text node in WPURL, a candidate snippet is generated
------------
Famakin Olawole Taiwo 35
![Page 36: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/36.jpg)
Snippet Mining Algorithm
Forming candidate snippets
Representing images!!
For each image node which corresponds to MURL
We extract its associate Alt or Src text
<MPURL,LPURL >
Famakin Olawole Taiwo 36
![Page 37: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/37.jpg)
Snippet Mining Algorithm
Top K Snippet Selection
This is an objective function of the problem; given any image, the probability of the top snippets is the product of the relevance and the interestingness.
Famakin Olawole Taiwo 37
![Page 38: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/38.jpg)
Snippet Mining Algorithm
Top K Snippet Selection
What this does here is to regularize the objective function so as to reduce any overfitting.
Famakin Olawole Taiwo 38
![Page 39: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/39.jpg)
Snippet Mining Algorithm
Relevance and interestingness
Famakin Olawole Taiwo 39
![Page 40: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/40.jpg)
Snippet Mining Algorithm
Measure of spam
When a snippet contains a lot of repeated words, it is less likely to be relevant or interesting.
Linguistic Features: The interestingness of a sentence often depends on its linguistic structure. We use four linguistic features:
(1) The length of the sentence, with the intuition that longer sentences are more interesting.
(2) whether the sentence begins demonstrative (such as beginning with \this" or \these"
(3) whether the sentence is first person, beginning with \I" or\we" and
(4) whether the sentence is definitional, i.e., begins with a pronoun and then includes the word \is" or \are" afterwards.
Famakin Olawole Taiwo 40
![Page 41: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/41.jpg)
Evaluation of Snippet Mining Algorithm
For the purpose of evaluation, two baseline methods have been adopted to compare the results achieved from proposed algorithm.
• Query by image and Webpage summarization
• Img2Text using Visual Features
Reason for comparison
*No prior work on extracting a set of text snippets for an image on the web*
Famakin Olawole Taiwo 41
![Page 42: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/42.jpg)
Evaluation of Snippet Mining
Query-by-Image and Webpage Summarization (Qbl/WS)
• Finds all occurrences of an image on the web
• Adopts a webpage summarization to generate snippets
Specifically compared to adoption in approach (images.google.com)
Famakin Olawole Taiwo 42
![Page 43: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/43.jpg)
Evaluation of Snippet Mining
Im2 Text using Visual Features
• Matches image to a database of million flicker images with captions
• Transfer the captions from its best matches
Specifically compared to the adoption in im2Text: Describing Images Using 1 Million Captioned Photographs approach from (Ordonez, Kulkarni, and Berg in)
Famakin Olawole Taiwo 43
![Page 44: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/44.jpg)
Evaluation of Snippet Mining
Dataset
Selected popular images on the web :
• Top 10,000 textual queries were run in a popular search engine
• We picked the 50 images from top ranking results returned.
Famakin Olawole Taiwo 44
![Page 45: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/45.jpg)
Evaluation of Snippet Mining Algorithm (Results)
Famakin Olawole Taiwo 45
![Page 46: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/46.jpg)
Evaluation of Snippet Mining Algorithm (Results)
Famakin Olawole Taiwo 46
![Page 47: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/47.jpg)
Evaluation of Snippet Mining Algorithm (Results)
Famakin Olawole Taiwo 47
![Page 48: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/48.jpg)
Coverage of images
• People
• Products
• Arts and Culture
• Music and Movies
• Travel
• Science
• Personal Photos
• Foreign Language
• Commercial
• Icons
Examples of common types of images for which our algorithm either finds or does not find enough high-quality text snippets.
Famakin Olawole Taiwo 48
![Page 49: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/49.jpg)
Applications
The database text snippets derived from the use of this algorithm can be implemented with possible applications.
• They can be implemented to improve image search relevance
• They might also be used to filter more interesting images from the less interesting ones.
Proposed Applications
• Web Image Augmentation
• Semantic Image Browsing
Famakin Olawole Taiwo 49
![Page 50: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/50.jpg)
ApplicationsWeb Image Augmentation :
*Bing bar Plugin*
Famakin Olawole Taiwo 50
![Page 51: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/51.jpg)
ApplicationsSemantic Image Browsing :
Famakin Olawole Taiwo 51
![Page 52: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/52.jpg)
Conclusion
1) We have presented a scalable mining algorithm to obtain
a set of text snippets for images on the web.
2) There is a possibility to display the snippets along with image search results.
3)Potential developed applications can feed of the snippets generated to enhance functionality.
4) Snippet data can be useful for improving image search relevance.
5)Algorithm cannot return relevant snippets in languages other than English.
Famakin Olawole Taiwo 52
![Page 53: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/53.jpg)
Future works
1) To analyse the snippets in more detail, for example by clustering, to and groups
of related images.
The results could be used to broaden the set of snippets and concepts associated
with an image, possibly leading to deeper understanding of the content of the
images, and more interesting browsing experiences.
2) This algorithm can be improved to return snippets regarding personal images of
people, to aid countries who do not have a system in place (database of people)
Famakin Olawole Taiwo 53
![Page 54: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/54.jpg)
References
Anon, 2017. linguamatics.com. [Online] Available at: https://www.linguamatics.com/what-is-text-mining-nlp-machine-learning [Accessed 22 03 2018].
Anon, 2018. computerhope.com. [Online] Available at: https://www.computerhope.com/jargon/i/image.htm [Accessed 24 01 2018].
Christopher J. O, B., G, B. & Jurisica, I., 2013. Data integration in the life sciences. Berlin: Springer.
Kannan, A. et al., 2014. Mining Text Snippets For Images On The Web.
Famakin Olawole Taiwo 54
![Page 55: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/55.jpg)
Smart Reply: Automated Response Suggestion for Email
Author: Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, Laszlo
Lukacs, Marina Ganea,Peter Young, Vivek Ramavajjala
55
![Page 56: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/56.jpg)
Agenda1. Introduction
2. Related work
3. Selecting Responses
3.1. LSTM model
3.2. Challenges
4. Response Set Generation
4.1.Semantic intent clustering
5. Suggestion Diversity
6. Results
7. Conclusions
8. References56
![Page 57: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/57.jpg)
1. Introduction
• Provide text assistance for email reply composition.
• Targeted at mobile.
• Responses can be sent on their own.
57
![Page 58: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/58.jpg)
2. Related Work
• Extracting meaning from previous message.
• Generating language.
• Grammatical transformation between response.
• Matching style/tone.
58
![Page 59: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/59.jpg)
3. Model
• Sequence to sequence Learning model.
• First proposed in the context of machine translation.
• Recurrent neural networks (encoder-decoder)
59
![Page 60: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/60.jpg)
RNN (encoder-decoder)
60
![Page 61: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/61.jpg)
4. Training
• Training data is a corpus of email reply pairs.
• Both encoder and decoder are trained together (end to end).
61
![Page 62: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/62.jpg)
Top Responses.
62
![Page 63: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/63.jpg)
5. Challenges.
• Response quality
• How to ensure that the individual response options are always high quality in language and content.
• Utility
• How to select multiple options to show a user so as to maximize the likelihood that one is chosen.
• Scalability
• How to efficiently process millions of messages per day while remaining within the latency requirements of an email delivery system.
• Privacy
• How to develop this system without ever inspecting the data except aggregate statistics.
63
![Page 64: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/64.jpg)
6. Semantic Intent Clustering
• Partition all response messages into “semantic” clusters.
• All messages within a cluster share the same semantic meaning.
• For Example:
• “Ha ha” and “oh that’s funny!” are associated with the funny cluster.
64
![Page 65: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/65.jpg)
Response Message.
65
![Page 66: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/66.jpg)
Diversity.
• LSTM first processes an incoming message and then select the best responses.
• Responses are highly rated together.
• The job of diversity component is to select a more varied set of suggestions.
66
![Page 67: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/67.jpg)
Diversity Selection
67
![Page 68: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/68.jpg)
Diversity Result.
68
![Page 69: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/69.jpg)
Deployment and Coverage.
• This feature is deployed in inbox by gmail.
• It is used to assist with more than 10% , of all mobile replies.
69
![Page 70: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/70.jpg)
Conclusions.
• Sequence-to-sequence produces plausible email replies in many common scenarios, when trained on an email corpus.
• Smart reply is deployed in inbox by Gmail and generates more than 10% of mobile replies.
• A novel end-to-end system for automatically generating short, complete email responses.
• The core of the system is a state-of-the-art deep LSTM model that can predict full responses, given an incoming email message.
70
![Page 71: Web-Search Ranking with Initialized Gradient Boosted ... · Ananth Mohan Zheng Chen Kilian Weinberger mohana@wustl.edu zheng.chen@wustl.edu kilian@wustl.edu Department of Computer](https://reader034.fdocuments.net/reader034/viewer/2022050600/5fa7ac774492157bf730e31b/html5/thumbnails/71.jpg)
References.
71