Www2012 tutorial content_aggregation

Integrating and Ranking AggregatedContent on the Web

Jaime Arguello1 Fernando Diaz2 Milad Shokouhi3

1UNC Chapel Hill 2Yahoo! Labs 3Microsoft Research

April 17, 2012

1 / 169

Outline

Introduction

History of Content Aggregation

Problem Definition

Sources of Evidence

Modeling

Evaluation

Special Topics in Aggregation

Future Directions

2 / 169

Introduction

3 / 169

Problem Definition

“Aggregating content from different sources.”

In aggregated search, content is retrieved from verticals inresponse to queries.

4 / 169

Examples: Aggregated Search

5 / 169

Examples: Personalized Search

6 / 169

Examples: Metasearch

7 / 169

Examples: Content Aggregation

8 / 169


9 / 169


10 / 169


11 / 169

Pre-Computer Content Aggregation

Were newspapers the first content aggregation media?12 / 169

Pre-Web Content Aggregation

Pre-Web systems were mostly used for content filtering ratherthan aggregation.Related reading: A. Jennings and H. Higuchi [14] 13 / 169

Web 1.0 Content Aggregation

Manual content aggregation since early days of theworld-wide-web.

14 / 169

Web 1.0 Content Aggregation

The begining of automatic content aggregation and newsrecommendation on the web.Related reading: Kamba et. al [15]

15 / 169

Content Aggregation TodayExploit/Explore

• Explore/Exploit. “What to exploit is easy, but what toexplore is where the secret sauce comes in.” saysRamakrishnan.

• Clickthrough rates on Yahoo! articles improved by 160%after personalized aggregation.

• Many content aggregators such as The New York Timesuse a hybrid approach.

Related reading: Krakovsky [16]16 / 169

Content Aggregation TodayReal-time and Geospatial

• real-time updates• geo-spatial signals

Related reading: Zhuang et. al [33]17 / 169

Content Aggregation TodayTemporal and Social

• “Semantic analysis of content is more useful when noearly click-through information is available.”

• Social signals are more influential for trending topics.Related reading: G. Szabo and B. Huberman [32]

18 / 169

History of Aggregated Search

19 / 169

Stone Age: Federated Search

Also known as distributed information retrieval (DIR), allowsusers to search multiple sources (collections, databases) atonce and receive merged results. Typically, in federated search:

• Collections contain text documents only.• There is no overlap among collections.• Still common in enterprise and digital libraries.

20 / 169

Federated Search Architecture

21 / 169

Example: The European Library

22 / 169

Federated Search environments

• Cooperative• The broker has comprehensive information about the

contents of each collection.• Collection sizes and lexicon statistics are usually known

to the broker.• Uncooperative

• Collections are not willing to publish their information.• The broker uses query-based sampling [7] to

approximate the lexicon statistics for each collection.

23 / 169

Federated Search Challenges

• Query translation• Source representation• Source selection• Result merging

Related reading: Shokouhi and Si [25]

24 / 169

Query TranslationExample: STARTS Protocol Query

Related reading: Gravano et. al [13]

25 / 169

Source RepresentationDocument Sampling

• Query-based Document Sampling• Content-driven Sampling

• Issue random term to the vertical• Sample top results• Update vertical-specific vocabulary representation• Sample new term from emerging representation• Repeat

• Demand-driven Sampling• Sample query from vertical-specific query-log• Sample top results• Repeat

Related Reading: Callan and Connell [7] and Shokouhi etal. [26]

26 / 169

Source SelectionExample: Relevant Document Distribution Estimation

verticals

docdocdocdocdocdocdocdocdoc



...

sample index

doc doc

doc

doc docdocdocdocdocdocdoc

docdocdoc

docdocdocdocdocdoc

query

sample retrieval

estimatedfull-dataset retrieval

.....

.

Related Reading: Si and Callan [27]27 / 169

Source SelectionExample: Relevant Document Distribution Estimation

• Assumption: each (predicted) relevant sample represents|v||Sv | relevant documents in the original vertical collection

ReDDE(v, q) = 1

Z∑d∈RN

|v||Sv|

× I(d ∈ v)

• Z normalizes across verticals• Z =

∑v∈V ReDDE(v, q)

Related Reading: Si and Callan [27]

28 / 169

Result MergingExample: Semi-supervised Learning for Merging

5

6

8

9

10

Central sample index

Broker1

2

3

4

7

0.39

0.230.80

0.82

0.87

0.88

0.64

0.55

0.42

0.95

0.61

0.78

• SSL [28] uses sampled data and regression for merging.

• SSL relies on overlap between the returned results andsampled documents.

• A regression function is trained based on the overlapdocuments.

29 / 169

Bronze Age: Metasearch engines

Metasearch engines submit the user query to multiple searchengines and merge the results. They date back toMetaCrawler (1994), that used to merge the results fromWebCrawler, Lycos and InfoSeek.

• In metasearch the query is often sent to all sources.• Result merging is usually based on position and overlap.

30 / 169

Example: MetaCrawler

31 / 169

Iron Age: Aggregated Search

In 2000, the Korean search engine (Naver) introducedcomprehensive search and started blendeding multimediaanswers in their default search results. Google introduceduniversal search in 2007.Motivation:

• Web data is highly heterogeneous.• Information needs and search tasks are similarly diverse.• Keeping a fresh index of real-time data is difficult.

32 / 169

Related ProblemsPeer-to-Peer Search

Related reading: Lu [21]33 / 169

Related ProblemsData fusion

• Several rankers on the same data collection• Each ranker is considered as a voter

# 51 voters 5 voters 23 voters 21 voters1st Andrew Catherine Brian David2nd Catherine Brian Catherine Catherine3rd Brian David David Brian4th David Andrew Andrew Andrew

Borda scores: Andrew 153, Catherine 205, Brian 151, David 91Related reading: http://bit.ly/HPBFoG

34 / 169

http://bit.ly/HPBFoG

Problem Definitions

35 / 169

Content AggregationExamples

36 / 169

Content AggregationExamples

37 / 169

Content AggregationDefinitions

• Core content: the content that is always presented andis the main focus on the page

• Vertical content: related content that is optional andsupplements the core content

38 / 169

Content Aggregation

core content

content aggregator

V4

V2

V1

V3

...

Vn

...

39 / 169

Problem Definition

• Given a particular context, predict which verticals topresent and where to present them

• A context is defined by the core content, the informationrequest (i.e., the query), and/or the user profile

40 / 169

Content Aggregation in WebSearch

41 / 169

Aggregated Web Search• Integrating vertical results into the core Web results

web

maps

books

news

local

...

images

...

42 / 169

What is a Vertical?• A specialized search service that focuses on a particular

type of information need

web

maps

books

news

local

...

images

...

43 / 169

Example Verticalslyon news Search

lyon restaurants Search

lyon map Search

44 / 169

Example Verticals

lyon pictures Search

lyon video Search

french cookbooks Search

45 / 169

Example Verticalslyon weather Search

lyon flight Search

bank of america Search

bonjour in english Search

46 / 169

What is a Vertical?

• A specialized search service• Different verticals retrieve different types of media (e.g.,

images, video, news)• Different verticals satisfy different types of information

needs (e.g., purchasing a product a product, finding alocal business, finding driving directions)

47 / 169

Aggregated Web Searchlyon Search

48 / 169

Aggregated Web Search

• Task: combining results from multiple specialized searchservices into a single presentation

• Goals• To provide access to various systems from a single

search interface• To satisfy the user with the aggregated results• To convey how the user’s goal might be better satisfied

by searching a particular vertical directly (if possible)

49 / 169

Aggregated Web SearchMotivations

• Users may not know that a particular vertical is relevant• Users may want results from multiple verticals at once• Users may prefer a single-point of access

50 / 169

Aggregated Web SearchTask Decomposition

• Vertical Selection• Vertical Results Presentation

51 / 169

Vertical Selection• Predicting which verticals to present (if any)

web

maps

books

news

local

...

“lyon”

images

52 / 169

Vertical Selection

• Given a query and a set of verticals, predict whichverticals are relevant

• In some situations, this decision must be made withoutissuing the query to the vertical

• Later on we’ll discuss sources of pre-retrieval evidence

53 / 169

Vertical Distribution25K queries sampled randomly from Web traffic

• Number of queries for which the vertical was consideredrelevant

AutosDirectoryFinanceGamesHealth

JobsImageLocalMaps

MoviesMusicNews

ReferenceShopping

SportsTV

TravelVideo

0 1250 2500 3750 500054 / 169

Vertical Results Presentation• Predicting where to present them (if at all)

web

maps

books

news

local

...

images

“lyon”

55 / 169

Vertical Results Presentation

• Given a query, a set of (selected) verticals, and a set oflayout constraints, predict where to present the verticalresults

• In some situations, it may be possible to suppress apredicted relevant vertical based on its results

• Later on we’ll discuss sources of post-retrieval evidence

56 / 169

Sources of Evidence

57 / 169

Content Aggregation

• Given a particular context, predict which verticals topresent and where to present them

• A context is defined by the core content, the informationrequest (i.e., the query), and/or the user profile

• Vertical selection: predicting which verticals to present(if any)

• Vertical results presentation: predicting where topresent each vertical selected (if at all)

• Different content aggregation environments may beassociated with different sources of evidence

58 / 169

Sources of Evidence• Relationship between core content and vertical content

59 / 169

Sources of Evidence• Relationship between explicit request and vertical content

60 / 169

Sources of Evidence• Relationship between user profile and vertical content

61 / 169

Sources of Evidence forAggregated Web Search

62 / 169

Sources of Evidence

web

“lyon”

...

query query-logscorpora

maps

books

news

local

...

images

...

feedback

63 / 169

Types of Features

• Pre-retrieval features: generated before the query isissued to the vertical

• Post-retrieval features: generated after the query isissued to the vertical and before it is presented

• Post-presentation features: generated after thevertical is presented

• possibly available from previous impressions of the query

64 / 169

Pre-retrieval Features

• Query features• the query’s topical category (e.g., travel)• key-words and regular expressions (e.g., lyon pictures)• named entity types (e.g., city name)

• Vertical corpus features• similarity between the query and sampled vertical results

• Vertical query-log features• similarity between the query and vertical query-traffic

65 / 169

Query Features• Derived from the query, independent of the vertical

web

“lyon pictures”

...

query

maps

books

news

local

...

images

...

66 / 169

Query Features

• Terms appearing in the query• Dictionary look-ups: “yhoo” → finance vertical• Regular expressions:

• “obama news” → news vertical• “ebay.com” → no vertical

• Named-entity types: “main st., pittsburgh” → mapsvertical

• Query category: “lyon attractions” → travel verticalRelated Reading: Arguello et al. [4], Ponnuswami et al. [22],and Li et al. [19]

67 / 169

Query Features

• Query-log-based co-occurrence between the query andvertical specific key-words

• Co-occurrence can be measured using the χ2 static• Example image vertical keywords: photo(s), pic(s),

picture(s), image(s)

Related Reading: Arguello et al. [2]

68 / 169

Query Category Features• Use a corpus with known document-to-category

assignments (binary or soft)

69 / 169

Query Category Features

• Query-category assignment based on document-categoryassignments of top-N results

P (c|q) = 1

Z∑d∈RN

P (c|d)× score(d, q)

• Z normalizes across categories• Z =

∑c∈C P (c|q)

Related Reading: Shen et al. [24]

70 / 169

Vertical Corpus Features• Derived from (sampled) vertical documents (e.g. ReDDE)

web

“lyon”

...

corpora

maps

books

news

local

...

images

...

71 / 169

Vertical Query-log Features• Derived from queries that were issued directly to the

vertical by users

web

...

query-logs

maps

books

news

local

...

images

...

“lyon”

72 / 169

Similarity to Vertical Query-Traffic

• Similarity between the query and queries issued directly tothe vertical by users

QLOG(v, q) = 1

Z∏w∈q

P (w|θqlogv )

• Z normalizes across verticals• Z =

∑v∈V QLOG(v, q)

Related Reading: Arguello et al. [4]

73 / 169

Types of Features





74 / 169

Post-Retrieval Features

• Derived from the vertical’s response to the query• Derived from the vertical’s full retrieval, or only the few

results that will potentially be presented• Results from different verticals are associated with

different meta-data• publication date: news, blog, micro-blog, books• geographical proximity: news, micro-blog, local, maps• reviews: community Q&A, local, shopping• price: shopping, books, flights

• Challenge: Post-retrieval features tend to be differentfor different verticals

75 / 169

Post-Retrieval Features

• Number of results• Retrieval score distribution• Text similarity features: cross-product of

• Extent: title, url, summary• Similarity Measure: clarity, cosine, jaccard,

query-likelihood• Aggregator: min, max, mean, std. dev.

• Recency features: cross-product of• Extent: creation date, last modified date• Recency Measure: time difference, exp. decay• Aggregator: min, max, mean, std. dev.

Related Reading: Arguello et al. [2], Diaz [10]

76 / 169

Post-Retrieval FeaturesExample: Clarity Score

Collection Queries Num. R P-valueAP88+89 101 ! 200 100 0.409 2.4 " 10!5

TREC-4 201 ! 250 50 0.298 0.019TREC-5 251 ! 300 50 0.289 0.022TREC-7 351 ! 400 50 0.467 5.4 " 10!4

TREC-8 401 ! 450 50 0.474 4.5 " 10!4

TREC-7+8 351 ! 450 100 0.449 4.0 " 10!6

Table 3: The correlation of the average IDF of queryterms with average precision in several TREC testcollections. The queries are the titles of the TRECtopics (usually a few words), except for TREC-4 where the description field is used, resulting inqueries of 16.1 words, on average.

low p-value since such an apparent correlation is extremelyunlikely to occur by chance over such a large data set.

5. ALTERNATIVE PREDICTORSIn order to better assess the significance of the clarity

measure, we compare it to various other predictors of queryperformance. In particular, we evaluate the average and to-tal term weights of query terms as predictors of performanceusing the methods of Kwok[12] and Wong[21], as well as thestandard inverse document frequency measure given by

IDF (w) = log10

number of docsnumber of docs containing w

. (5)

The results for the average IDF of query terms are givenin Table 3. Average Kwok score of query terms is less cor-related with performance than average IDF and averageWong weight performs similarly to IDF . Using the sum ofweights performs worse with all methods.

The results in Table 3 seem to show the average IDFof query terms as predicting the performance of queries tosome degree, though not as well as clarity scores, in general(cf. Table 1). On TREC-7 plus 8, for example, the resultswith average IDF are about 83 times as likely to occur bychance as the correlations with clarity scores.

It is important, however, to look at the correlations of theterm-weight-based predictors on the query track, the largestand most diverse test query set currently available. Table 4shows the best performance, which is for the average IDFpredictor.

All three term-weight-based methods show near randomperformance on the aggregate data. When looking at thecorrelations on a topic-by-topic basis the average IDF mea-sure does much better than it does on the aggregate. Theclarity results (Table 2, second row) are still about a milliontimes less likely to occur by chance, however. The disparitybetween the aggregate and by-topic performance of averageIDF of query terms as a performance predictor seems toindicate that it is particularly poor at comparing queriesacross topics.

In addition to the term-weight based predictors we con-sider the negative of the entropy of P (D|Q) as a predictorof query performance. For this predictor we find no appre-ciable rank correlation with average precision, except at ex-tremely high and low values of the entropy. This correlationmakes sense, for example, since the highest values of this en-tropy indicate the most uniform distributions P (D|Q) which

Queries Num. R P-valueaggregate 1804 0.025 0.14by topic 36.1 ± 3.7 0.220 ± 0.224 2.0 " 10!15

Table 4: The correlation of the average IDF of queryterms with average precision in the TREC Querytrack. “Aggregate” indicates all queries taken to-gether while “topic ave.” values are the averagesover each of the 50 query track topics.

0.162

ave. pr.

1.211.09 clarity score0 0.5 1 1.5 2

1

2.50

3

0.8

0.6

0.4

0.2

Figure 4: Average precision versus clarity score forthe 100 title queries from the TREC-7 and TREC-8adhoc tracks. The 0.162 threshold in average pre-cision divides the estimated probability density inhalf (see Figure 5). The threshold of 1.09 in clar-ity score is the Bayes optimal level for classifyingqueries as high or low precision based on their clar-ity scores, based on the estimated probability den-sities(see Figure 6). The threshold of 1.21 is the au-tomatic threshold set without relevance informationat level where 80% of one-term queries have lowerclarity scores (see Figure 7).

never leads to a high average precision since documents arevalued nearly evenly in such a case. Moreover, the over-all lack of correlation in this case shows that our inclusionin clarity score computation of language statistics from thedocuments beyond just their likelihood of generating queryterms is necessary for good prediction performance.

6. THRESHOLDINGWe plan to use clarity scores to make a binary decision

about each user query, namely, whether should it be singledout for special treatment on the basis of predicted poor per-formance, or not. We frame this task, in test collections, asclassifying whether the query is likely to score better than acertain average precision threshold, or worse. We show howto set the optimal threshold in order to use clarity scoresto make this classification. For the general case where norelevance information is available, we develop a heuristic forsetting a clarity score threshold that is reasonable and per-forms nearly as well as the optimal method in various testcollections.

• Measures query ambiguity with respect to collection.

• If the returned results are not topically similar, thepeformance might be poor.

clarity score =∑w∈V

P (w|Q) log2P (w|Q)

Pcoll(w)(1)

Related reading: Cronen-Townsend et al. [9]77 / 169

Types of Features





78 / 169

Post-Presentation Features

• Derived from implicit feedback• Click-through rates associated with query-vertical-position

triplets• Average dwell-times on vertical results• Other feedback signals have not been explored in

published work• Mouse movement• Scrolls• ....

Related Reading: Diaz [10], Ponnuswami et al. [22], Song etal. [29]

79 / 169

Post-Presentation FeaturesExample: Clickthrough

Related reading: Diaz [11]80 / 169

Post-Presentation Features

• Can be derived from previous impressions of the query• However, this assumes that the query is a head query• Post presentation features can also be derived from

similar queries• Assumption: semantically related queries are associated

with similar implicit feedback

click(v, q) = 1

Z∑

q′∈Q | sim(q,q′)>τ

sim(q, q′)× click(v, q′)

Related Reading: Diaz et al. [12]

81 / 169

Post-Presentation Featuresnuances

• Some verticals do not require being clicked• weather, finance, translation, calculator

• Visually appealing verticals may exhibit a presentationbias

• A previous study found a click-through bias in favor ofvideo results

• That is, users clicked on video results more oftenirrespective of position and relevance

• Suggests the need to model feedback differently fordifferent verticals

Related Reading: Sushmita et al. [30]

82 / 169

Feature Importance for Vertical SelectionWhich features help the most?

• Individually removing different types of features resultedin worse performance

• The most useful features corresponded to the topicalcategories of the query

all 0.583

no.geographical 0.577� -1.01%

no.redde 0.568� -2.60%

no.soft.redde 0.567� -2.67%

no.category 0.552� -5.33%

corpus query-log query

Related Reading: Arguello et al. [4]83 / 169

Feature Importance for Vertical SelectionWhich features help the most?

• Explains why performance was superior fortopically-focused verticals

travel 0.842health 0.788music 0.772games 0.771autos 0.730sports 0.726

tv 0.716movies 0.688finance 0.655

local 0.619jobs 0.570

shopping 0.563images 0.483

video 0.459news 0.456

reference 0.348maps 0.000

directory 0.000

Related Reading: Arguello et al. [4]84 / 169

Outline

Introduction


Problem Definition

Sources of Evidence

Modeling

Evaluation


Future Directions

85 / 169

Modeling

86 / 169

Content AggregationTasks

• Vertical Selection: predicting which verticals to present• Vertical Presentation: predicting where to present each

vertical selected• May include deciding whether to suppress a selected

vertical based on post-retrieval evidence

87 / 169

Content AggregationLayout Assumptions

• Content aggregation requires assuming a set of layoutconstraints

• Example layout constraints:• The core content is always presented and is presented in

the same position• If a vertical is presented, then its results must be

presented together (horizontally or vertically)• If a vertical is presented, then a minimum and maximum

number of its results must be presented• If a vertical is presented, it can only be presented in

certain positions• ....

88 / 169


89 / 169


corecontent

?

??

?

?

90 / 169

Aggregated Web SearchLayout Assumptions

• The core Web content (e.g., w1−10) is always presented• If a vertical is presented, then its results must be

presented together (horizontally or vertically)• If a vertical is presented, it can only be presented in

certain positions relative to the top Web results, forexample:

• above w1

• between w3−4

• between w6−7

• below w10

91 / 169

Aggregated Web SearchLayout Assumptions

• Because of these layout constraints, aggregated Websearch is sometimes referred to as slotting or blending

s1

s2

s3

s4

w1-3

w4-6

w7-10

92 / 169

Aggregated Web SearchTasks

• Vertical Selection: predicting which vertical(s) topresent

• Vertical Presentation: predicting where in the Webresults to present them

93 / 169

Aggregated Web SearchModeling

• Use machine learning to combine different types offeatures

• Vertical selection and presentation may be associatedwith different features

• Post-retrieval features may not be available for selection• Gold-standard Training/Test Data

• Editorial vertical-relevance judgements: a humanassessor determines that a particular vertical should bepresented for a given query

• User-generated clicks and skips collected by presentingthe vertical (at a specific location) for all queries, or arandom subset

94 / 169

Aggregated Web SearchChallenges

• Different verticals are associated with different types offeatures

• Some verticals retrieve non-textual results (novertical-corpus features)

• Some verticals do not have direct search capabilities (novertical query-log data)

• Results from different verticals are associated withdifferent meta-data (news articles have a publicationdate, local results have a geographical proximity to theuser)

• A feature that is common to multiple verticals may becorrelated differently with relevance (recency of resultsmay be more predictive for the news vertical than theimages vertical)

95 / 169

Aggregated Web SearchChallenges

• Requires methods that can handle different features fordifferent verticals

• Requires methods that can exploit a vertical-specificrelation between features and relevance

96 / 169

Vertical SelectionClassification Approach

• Learn independent vertical-specific binary classifiers• Use binary ground truth labels for training• Make independent binary predictions for each vertical

97 / 169

Classification ApproachLogistic Regression

P (v|q) = 1

1 + exp(wo +

∑iwi × ϕv(q)i

)• v = the vertical• ϕv = feature generator specific to v (may include

vertical-specific features)• LibLinear:

http://www.csie.ntu.edu.tw/~cjlin/liblinear/

98 / 169

http://www.csie.ntu.edu.tw/~cjlin/liblinear/

Classification ApproachLogistic Regression

• Advantages• Easy and fast to train• Regularization parameter balances the importance of

different features (helps improve generalizationperformance)

• Outputs a confidence value P (v|q), which can betreated as a hyper-parameter

• Disadvantages• Cannot exploit complex interactions between features

99 / 169

Vertical PresentationModeling

• Slotting: assume that vertical results can only bepresented into specific locations or slots.

s1

s2

s3

s4

w1-3

w4-6

w7-10

100 / 169

Vertical PresentationClassification Approach

• Learn independent vertical-selectors using binary labels• Present vertical v in slot si if P (v|q) > τj ∀ j ≥ i

• Tune parameters τ1−4 using validation data

s1

s2

s3

s4

w1-3

w4-6

w7-10

t1

t2

t4

t3

Related Reading: Arguello et al. [2], Ponnuswami et al. [22]101 / 169

Vertical PresentationRanking Approach

w1-3

w4-6

w7-10

s1

s2

s3

s4

travel

flights

weather

videos

images

102 / 169

Vertical PresentationBlock Ranking

103 / 169

Vertical PresentationLearning To Rank

• Maching learning algorithms that learn to order elementsbased on training data

• Features derived from the element or from thequery-element pair

• Point-wise methods: learn to predict an element’srelevance grade independent of other elements

• Pair-wise methods: learn to predict whether oneelement is more relevant than another

• List-wise methods: learn to maximize a metric thatevaluates the ranking as a whole (e.g., NDCG)

104 / 169

Block RankingChallenges

• LTR models require using a common featurerepresentation across all elements

• LTR models learn an element-type agnostic relationbetween features and relevance

• Vertical block ranking requires methods that can handledifferent features for different verticals

• Vertical block ranking requires methods that can exploit avertical-specific relation between features and relevance

105 / 169

SVM RankLearning To Rank

• These requirements can be satisfied by modifying thefeature representation

1. Use the union of all features2. If a feature is common to multiple verticals, make a

vertical specific copy3. Zero all vertical-specific copies that do not correspond to

the vertical in question

106 / 169

Query Similarity

• Similar queries should have similar predictions• Gold-standard labels for training can also be shared

between similar (labeled and unlabeled) queries.• Related to semi-supervised learning.

Related Reading: Li et al. [19], Chapelle et al. [8]

107 / 169

Summary

• Vertical Selection• classification problem• can have a disjoint feature set amongst verticals

• Vertical Presentation• ranking problem• a common for verticals feature set is desirable

108 / 169

Evaluation

109 / 169

Vertical Selection

No Vertical

inauguration search

Inauguration Day - WikipediaThe swearing-in of the President of the United States occurs upon the commencement of a new term of a President of the United States. The United States Constitution mandates that the President make the following oath or...http://en.wikipedia.org/wiki/United_States_presidential_inauguration

Joint Congressional Committee on Inaugural CeremoniesCharged with planning and conducting the inaugural activities at the Capitol: the swearing-in ceremony and the luncheon honoring the President and Vice President.http://inaugural.senate.gov

Inauguration Day 2009Official site for the 2009 Inauguration of Barack Obama. Provides information about events, tickets, and inaugural balls and parades.http::inaugural.senate.gov/2009

Inaugural Addresses of the Presidents of the United StatesFrom George Washington's first address in 1789 to the present. Includes a note on the presidents who took the oath of office without a formal inauguration.http://www.bartleby.com/124

News Vertical

inauguration search

News Results for Inauguration

•Online inauguration videos set records CNN - 3 hours ago

•Castro watched inauguration, Argentine leader says CNN - 3 hours ago

•Photographer: Inauguration like no moment I've ever witnessed CNN - 4 hours ago





110 / 169

Evaluation NotationVertical Selection

Q set of evaluation contexts (queries)V set of candidate verticals

e.g. {Web, news, . . .}Vq set of verticals relevant to context qvq predicted vertical for context q

111 / 169

Vertical SelectionAccuracy

• relevance: a vertical is relevant if satisfies some possibleintent.

• objective: predict appropriate vertical when relevant;otherwise, predict no relevant vertical.

• metric: accuracyRelated reading: Arguello et al. [5]

112 / 169

Vertical SelectionAccuracy

Aq =

{I(vq ∈ Vq) Vq = ∅I(vq = ∅) Vq = ∅

113 / 169

Vertical SelectionUtility

• relevance: a vertical is relevant if satisfies the intent of aparticular user at a particular time.

• objective: predict appropriate vertical when relevant;otherwise, predict no relevant vertical.

• metric: utility of whole page layoutRelated reading: Diaz and Arguello [12]

114 / 169

Vertical SelectionUtility

u(v∗q , vq) =

1 v∗q = vq

α (v∗q = Web) ∧ (vq = Web)0 otherwise

where 0 ≤ α ≤ 1 represents the user’s discounted utility by beingpresented a display above the desired web results.

115 / 169

Vertical PresentationPreference

• relevance: a presentation is good if the user can easilyfind more relevant content before less relevant content.

• objective: predict appropriate vertical preferences.• metric: similarity to ‘optimal’ ranking.

Related Reading: Diaz [11]

116 / 169

Candidate Slots

117 / 169

Candidate ModulesCq

118 / 169

Aggregationσq

119 / 169

MetricModule Preferences

120 / 169

Aggregationσ∗q

121 / 169

MetricEvaluation

σ∗q optimal ranking from preference judgments

σq predicted ranking by the systemK(σ∗

q , σq) similarity between predicted and optimal rankingse.g. Kendall τ , Spearman ρ

Related reading: Arguello et al. [3]

122 / 169

Evaluation• User study: small scale laboratory experiment resulting in

a deep, focused understanding of the user-level effects.

• advantages: fine-grained analysis of system behavioroften situated in a real world task.

• disadvantages: expensive; difficult to reuse results ondrastically different systems; synthetic environment.

• Batch study: medium scale laboratory experimentproducing data and metrics for comparing systems.

• advantages: repeatability; many metrics• disadvantages: expensive; synthetic environment.

• Production data: large scale production experimentgathering realistic user reactions to different systems.

• advantages: naturalistic experiment; large scale.• disadvantages: repeatability difficult.

123 / 169


a deep, focused understanding of the user-level effects.• advantages: fine-grained analysis of system behavior

often situated in a real world task.• disadvantages: expensive; difficult to reuse results on

drastically different systems; synthetic environment.• Batch study: medium scale laboratory experiment

producing data and metrics for comparing systems.

• advantages: repeatability; many metrics• disadvantages: expensive; synthetic environment.



123 / 169


a deep, focused understanding of the user-level effects.• advantages: fine-grained analysis of system behavior

often situated in a real world task.• disadvantages: expensive; difficult to reuse results on

drastically different systems; synthetic environment.• Batch study: medium scale laboratory experiment

producing data and metrics for comparing systems.• advantages: repeatability; many metrics• disadvantages: expensive; synthetic environment.



123 / 169

Batch Study

• editorial pool: sampling editors to assess relevance (e.g.in-house editorial pool, mechanical turk).

• query pool: sampling queries to assess performance.• editorial guidelines: defining precisely what is meant by

relevance.

124 / 169

Batch Study

• vertical selection: queries labeled with all possiblerelevant verticals [4].

• vertical presentation: queries labeled with preferencesbetween verticals [3].

125 / 169

Production Data

• user pool: sampling users to assess relevance (e.g.random, stratified).

• implicit feedback: defining user interactions correlatedwith relevance (e.g. clicks, hovers).

126 / 169

Production Data

• vertical selection: infer relevance from clicks on verticaldisplays.

• vertical presentation: infer preferences from clicks onvertical displays.

127 / 169

Production DataVertical Selection

inauguration search









• click on vertical contentsuggests relevance.

• skip over vertical contentsuggests non-relevance.

• click through ratesummarizes the inferredrelevance of a vertical.

128 / 169

Production DataVertical Ranking

inauguration search









• skip over a to click onresult b implies b ispreferred to a

129 / 169

Aggregating Performance

• Traffic-Weighted Average1|D|

∑⟨u,t,q⟩∈D perf(u, t, q)

• focuses evaluation on the frequent queries• Query-Stratified Average

1|Q|

∑q∈Q

1Dq

∑⟨u,t,q⟩∈Du

perf(u, t, q)• focuses evaluation on robustness across queries

• User-Stratified Average1|U|

∑u∈U

1Du

∑⟨u,t,q⟩∈Du

perf(u, t, q)• focuses evaluation on robustness across users

130 / 169

Summary

• Many ways to evaluate aggregation performance.• Most important metric should be correlated with user

satisfaction (e.g. whole-page relevance).• All metrics tell you something about the aggregation

system.

131 / 169


132 / 169

Special Topics

• Dealing with non-stationary intent• Dealing with new verticals• Explore-exploit methods

133 / 169

Dealing with non-stationaryintent

134 / 169

Dealing with non-stationary intent

• Vertical relevance depends on many factors.

• core context: relevance may depend on the user’simmediate intent

• geography: relevance may depend on the user’s location(e.g. country, city, neighborhood)

• time: relevance may depend on recent events (e.g.holidays, news events)

• Addressing most factors can be addressed with enoughdata or careful sampling.

• core context:

sample to include most expected contexts

• geography:

sample to include many locations

• time:

sample to include many events?

135 / 169


• Vertical relevance depends on many factors.• core context: relevance may depend on the user’s

immediate intent

• geography: relevance may depend on the user’s location(e.g. country, city, neighborhood)



• core context:


• geography:


• time:


135 / 169



immediate intent• geography: relevance may depend on the user’s location

(e.g. country, city, neighborhood)



• core context:


• geography:


• time:


135 / 169




(e.g. country, city, neighborhood)• time: relevance may depend on recent events (e.g.

holidays, news events)• Addressing most factors can be addressed with enough

data or careful sampling.

• core context:


• geography:


• time:


135 / 169






data or careful sampling.• core context:


• geography:


• time:


135 / 169






data or careful sampling.• core context: sample to include most expected contexts• geography:


• time:


135 / 169






data or careful sampling.• core context: sample to include most expected contexts• geography: sample to include many locations• time:


135 / 169






data or careful sampling.• core context: sample to include most expected contexts• geography: sample to include many locations• time: sample to include many events?

135 / 169


• Easy to predict appropriate news intent retrospectively• Difficult to predict appropriate news intent online• Possible solutions

• Online learning of event models (e.g. statistical model ofcurrent events) [20]

• Model with temporally-sensitive but context-independentfeatures (e.g. ‘rate of increase in document volume’) [11]

136 / 169

Dealing with non-stationary intentOnline Language Models

• Language model: a statistical model of text production(e.g. web documents, news articles, query logs, tweets).

• can compute the likelihood of a model having producedthe text of a particular context (e.g. query, news article)

• conjecture: relevance is correlated with likelihood• Online language model:

• Topic detection and tracking (TDT): model emergingtopics using clusters of documents in a news articlestream [1].

• Social media modeling: model dynamic topics in socialmedia (e.g. Twitter) [20, 23].

137 / 169

Dealing with non-stationary intentContext-Independent Features

• Approach: Instead of explicitly modeling the wordsassociated with topics, model the topic-independentsecond order effects.

• ‘how quickly is this query spiking in volume?’• ‘how quickly is the number documents retrieved spiking

in volume?’• Topic-independent features generalize across events and

into the future (as long as the new events behave similarto historic events)

138 / 169

Domain Adaptation for VerticalSelection

139 / 169

Problem• Supervised vertical selection requires training data (e.g.,

vertical-relevance judgements).

maps

books

news

local

...

images

web

training data

no training data

140 / 169

Problem

• A model trained to predict one vertical may notgeneralize to another

• Why?• A feature that is correlated with relevance for one

vertical may be uncorrelated or negatively correlated foranother (e.g., whether the query contains “news”)

141 / 169

Task Definition

• Task: Given a set of source verticals S with trainingdata, learn a predictive model of a target vertical tassociated with no training data.

• Objective: Maximize effectiveness on the target vertical

142 / 169

Learning Algorithm• Gradient Boosted Decision Trees (GBDT)

+ +...+

• Iteratively trains decision tree predictors fitting theresiduals of preceding trees

• Minimizes logistic loss

143 / 169

Generic Model• Train a model to maximize (average) performance for all

verticals in S and apply the model to the target t.

FINANCEMODEL

local

images

travel

news

...

finance

GENERICMODEL

144 / 169

Generic Model

• Assumption: if the model performs well across all sourceverticals S, it will generalize well to the target t.

145 / 169

Training a Generic Model

• Share a common feature representation across all verticals• Pool together each source vertical’s training data into a

single training set• Perform standard GBDT training on this training set

146 / 169


• Each training set query is represented by |S| instances(one per source vertical)

local feature vector +

travel feature vector +

movies feature vector +

video feature vector -

news feature vector -

images feature vector -

finance feature vector -

jobs feature vector -

training query

147 / 169


• Training set: union ofall query/source-verticalpairs

++

+-----+++-----+++-----+++-----

...

trainingquery

trainingquery

trainingquery

trainingquery

148 / 169


• Perform standard GBTDtraining on this trainingset

• The model shouldautomatically ignorefeatures that areinconsistently correlatedwith the positive class

...

++-++---++-++---++-++---++-++---

trainingquery

trainingquery

trainingquery

trainingquery

149 / 169

Portable Feature Selection

• Goal: Automatically identify features that may beuncorrelated (or negatively correlated) with relevanceacross verticals in S

• Method• Treat each feature as a single-evidence predictor• Measure its performance on each source vertical in S• Keep only the ones with the greatest (harmonic) average

performance• Assumption: if the feature is correlated with relevance

(in the same direction) for all verticals in S, it willgeneralize well to the target t.

150 / 169

Model Adaptation• Use the generic model’s most confident predictions on the

target t to “bootstrap” a vertical-specific model

FINANCEMODEL

local

images

travel

news

...

finance

GENERICMODEL

151 / 169

Tree-based Domain Adaptation[Chen et al. 2008]

• TRADA: adapting a source-trained model using a smallamount of target domain training data

• A GBDT model can be fit to new data (from whateverdomain) by simply appending new trees to the currentensemble

ADAPTATION

+ +...++

GENERIC MODEL

+ +...+

152 / 169

Training an Adapted Model

• Use a generic model to make target-vertical predictionson unlabeled data

• Consider the most confident N% predictions as positiveexamples and the remaining ones as negative examples

• Adapt the generic model by appending trees while fittingthe residuals of the generic model to its own predictions

ADAPTATION

+ +...++

GENERIC MODEL

+ +...+

153 / 169

Evaluation

GENERIC MODEL

+ +...+

GENERIC MODEL

+ +...+ +GENERIC MODEL

+ +...+ +

ADAPTATION

+ +...+

ADAPTATION

+ +...+

M

M!

M!,M

M!,M¬!

GENERIC MODEL

+ +...+

portable featuresnon-portable features

154 / 169

Generic Model ResultsAverage Precision

generic genericvertical (all feats.) (only portable feats.)finance 0.209 0.392N

games 0.636 0.683health 0.797 0.839

jobs 0.193 0.321N

images 0.365 0.390local 0.543 0.628N

movies 0.294 0.478N

music 0.673 0.780N

news 0.293 0.548N

travel 0.571 0.639N

video 0.449 0.691N

155 / 169

Model Adaptation ResultsAverage Precision

generic trada tradavertical (only portable feats.) (all feats.) (only non-portable feats.)finance 0.392 0.328 0.407games 0.683 0.660 0.817Nhealth 0.839 0.813 0.868

jobs 0.321 0.384 0.348images 0.390 0.370 0.499N

local 0.628 0.601 0.614movies 0.478 0.462 0.587Nmusic 0.780 0.778 0.866Nnews 0.548 0.556 0.665N

travel 0.639 0.573H 0.709Nvideo 0.691 0.648 0.722

156 / 169

Feature Portability• Some features were more portable than others (high

h-avg performance across source verticals)

0.05

0.10

0.15

0.20

havg

query−vertical featuresquery features

features

h-a

vg p

erfo

rman

ce

157 / 169

Feature Portability• The most portable features correspond to unsupervised

methods designed for homogeneous federated search

0.05

0.10

0.15

0.20

havg

query−vertical featuresquery features

features

h-a

vg p

erfo

rman

ce

soft.redde [Arguello et al., 2009]

redde [Si and Callan, 2003]qlog [Arguello et al., 2009]

gavg [Seo and Croft, 2008]

158 / 169

Summary

• A generic vertical selector can be trained with somesuccess

• Focusing on only the most portable features (identifiedautomatically) improves performance

• Vertical-specific non-portable evidence is useful butrequires training data

• Can be harnessed using pseudo-training examples from ageneric model’s predictions at no additional cost

159 / 169

Explore/exploit methods

160 / 169

Explore/Exploit

• Exploit: Choose articles/sources with highest expectedquality for short-term reward.

• Explore: Choose articles/sources with lower expectedreward for long-term reward.

• Typical solution: Multi-arm banditsRelated reading: Li et al. [17]

161 / 169

Multi-armed bandit

• Problem: Each news article/source can be considered asan arm.

• Task: Present stories (pull arms) to maximize long termreward (clicks).

Related reading: Li et al. [17, 18]

162 / 169

Multi-armed bandit

• Naïve strategy: with probability ϵ, present a randomarticle.

• Better strategy: sample according to confidence that thearticle is relevant (e.g. Exp4, Boltzmann sampling)

• Extension: incorporating prior information (a.k.a.contextual bandits)

Related reading: Auer et al. [6], Sutton and Barto [31]

163 / 169

Future Directions

164 / 169

Short Term

• Diversity in vertical presentation• verticals can be redundant (e.g. news, blogs)• verticals can be complementary (e.g. news, video)• should evaluate whole page relevance when aggregating

• Attention modeling• some verticals can visually attract a user’s attention

without being relevant (e.g. graphic advertisements)• need a better understanding of attention in response to

automatic page layout decisions

165 / 169

Medium Term

• Relaxed layout constraints• the majority of work has focused on aggregation into a

conservative ranked list layout.• can we consider arbitrary layouts?

• Detecting new verticals• currently the set of candidate verticals is manually

curated• can we automatically detect that we should begin

modeling a new vertical?

166 / 169

In Future, You are the Core Content

Related Reading: Kinecthttp://www.xbox.com/en-US/kinect

167 / 169

http://www.xbox.com/en-US/kinect

In Future, You are the Core Content

Related Reading: Project Glass http://bit.ly/HGqZcV

168 / 169

http://bit.ly/HGqZcV

References

169 / 169

[1] James Allan (ed.), Topic detection and tracking:Event-based information organization, The InformationRetrieval Series, vol. 12, Springer, New York, NY, USA,2002.

[2] Jaime Arguello, Fernando Diaz, and Jamie Callan,Learning to aggregate vertical results into web searchresults, CIKM 2011, ACM, 2011, pp. 201–210.

[3] Jaime Arguello, Fernando Diaz, Jamie Callan, and BenCarterette, A methodology for evaluating aggregatedsearch results, ECIR 2011, Springer Berlin / Heidelberg,2011, pp. 141–152.

[4] Jaime Arguello, Fernando Diaz, Jamie Callan, andJean-François Crespo, Sources of evidence for verticalselection, SIGIR 2009, ACM, 2009, pp. 315–322.

[5] Jaime Arguello, Fernando Diaz, Jamie Callan, andJean-François Crespo, Sources of evidence for vertical

169 / 169

selection, Proceedings of the 32nd international ACMSIGIR conference on Research and development ininformation retrieval, 2009, pp. 315–322.

[6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire,The non-stochastic multi-armed bandit problem, SIAMJournal on Computing 32 (2002), no. 1, 48–77.

[7] Jamie Callan and Margaret Connell, Query-basedsampling of text databases, TOIS 19 (2001), no. 2,97–130.

[8] O. Chapelle, B. Schölkopf, and A. Zien (eds.),Semi-supervised learning, MIT Press, Cambridge, MA,2006.

[9] Steve Cronen-Townsend, Yun Zhou, and W. Bruce Croft,Predicting query performance, Proceedings of the 25thannual international ACM SIGIR conference on Research

169 / 169

and development in information retrieval (Tampere,Finland), SIGIR ’02, ACM, 2002, pp. 299–306.

[10] Fernando Diaz, Integration of news content into webresults, WSDM 2009, ACM, 2009, pp. 182–191.

[11] Fernando Diaz, Integration of news content into webresults, Proceedings of the Second ACM InternationalConference on Web Search and Data Mining, 2009.

[12] Fernando Diaz and Jaime Arguello, Adaptation of offlinevertical selection predictions in the presence of userfeedback, SIGIR 2009, 2009.

[13] Luis Gravano, Chen-Chuan K. Chang, HectorGarcia-Molina, and Andreas Paepcke, Starts: Stanfordprotocol proposal for internet retrieval and search,Technical Report 1997-68, Stanford InfoLab, April 1997,Previous number = SIDL-WP-1996-0043.

169 / 169

[14] A. Jennings and H. Higuchi, A personal news servicebased on a user model neural network, IEICE Transactionson Information and Systems 75 (1992), no. 2, 192–209.

[15] Tomonari Kamba, Krishna Bharat, and Michael C.Albers, The krakatoa chronicle - an interactive,personalized, newspaper on the web, In Proceedings ofthe Fourth International World Wide Web Conference,1995, pp. 159–170.

[16] Marina Krakovsky, All the news that’s fit for you,Communications of the ACM 54 (2011), no. 6, 20–21.

[17] Lihong Li, Wei Chu, John Langford, and Robert E.Schapire, A contextual-bandit approach to personalizednews article recommendation, Proceedings of the 19thinternational conference on World wide web (Raleigh,North Carolina, USA), WWW ’10, ACM, 2010,pp. 661–670.

169 / 169

[18] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang,Unbiased offline evaluation of contextual-bandit-basednews article recommendation algorithms, Proceedings ofthe fourth ACM international conference on Web searchand data mining (Hong Kong, China), WSDM ’11, ACM,2011, pp. 297–306.

[19] Xiao Li, Ye-Yi Wang, and Alex Acero, Learning queryintent from regularized click graphs, Proceedings of the31st annual international ACM SIGIR conference onResearch and development in information retrieval (NewYork, NY, USA), SIGIR ’08, ACM, 2008, pp. 339–346.

[20] Jimmy Lin, Rion Snow, and William Morgan, Smoothingtechniques for adaptive online language models: topictracking in tweet streams, Proceedings of the 17th ACMSIGKDD international conference on Knowledge discovery

169 / 169

and data mining (New York, NY, USA), KDD ’11, ACM,2011, pp. 422–429.

[21] Jie Lu, Full-text federated search in peer-to-peernetworks, Ph.D. thesis, Language Technologies Institute,Carnegie Mellon University, 2007.

[22] Ashok Kumar Ponnuswami, Kumaresh Pattabiraman,Qiang Wu, Ran Gilad-Bachrach, and Tapas Kanungo, Oncomposition of a federated web search result page: Usingonline users to provide pairwise preference forheterogeneous verticals, WSDM 2011, ACM, 2011,pp. 715–724.

[23] Ankan Saha and Vikas Sindhwani, Learning evolving andemerging topics in social media: a dynamic nmf approachwith temporal regularization, WSDM, 2012, pp. 693–702.

169 / 169

[24] Dou Shen, Jian-Tao Sun, Qiang Yang, and Zheng Chen,Building bridges for web query classification, SIGIR 2006,ACM, 2006, pp. 131–138.

[25] Milad Shokouhi and Luo Si, Federated search,Foundations and Trends in Information Retrieval 5(2011), no. 1, 1–102.

[26] Milad Shokouhi, Justin Zobel, Saied Tahaghoghi, andFalk Scholer, Using query logs to establish vocabularies indistributed information retrieval, IPM 43 (2007), no. 1,169–180.

[27] Luo Si and Jamie Callan, Relevant document distributionestimation method for resource selection, SIGIR 2003,ACM, 2003, pp. 298–305.

169 / 169

[28] Luo Si and Jamie Callan, A semisupervised learningmethod to merge search engine results, ACM Trans. Inf.Syst. 21 (2003), no. 4, 457–491.

[29] Yang Song, Nam Nguyen, Li-wei He, Scott Imig, andRobert Rounthwaite, Searchable web sitesrecommendation, WSDM 2011, ACM, 2011, pp. 405–414.

[30] Shanu Sushmita, Hideo Joho, Mounia Lalmas, andRobert Villa, Factors affecting click-through behavior inaggregated search interfaces, CIKM 2010, ACM, 2010,pp. 519–528.

[31] Richard Sutton and Andrew Barto, Reinforcementlearning, MIT Press, 1998.

[32] Gabor Szabo and Bernardo A. Huberman, Predicting thepopularity of online content, Communications of theACM 53 (2010), no. 8, 80–88.

169 / 169

[33] Jinfeng Zhuang, Tao Mei, Steven C.H. Hoi, Ying-QingXu, and Shipeng Li, When recommendation meetsmobile: contextual and personalized recommendation onthe go, Proceedings of the 13th international conferenceon Ubiquitous computing (Beijing, China), UbiComp ’11,ACM, 2011, pp. 153–162.

169 / 169

Www2012 tutorial content_aggregation

Technology

Transcript of Www2012 tutorial content_aggregation