Optimized interleaving for online retrieval evaluation
-
Upload
han-jiang -
Category
Technology
-
view
112 -
download
2
Transcript of Optimized interleaving for online retrieval evaluation
Optimized Interleaving for Online Retrieval Optimized Interleaving for Online Retrieval Evaluation Evaluation
(Best paper in(Best paper in WSDM’13) WSDM’13)
Author: Author: Filip Radlinski,Filip Radlinski,
Nick CraswellNick CraswellSlides By:Slides By: Han Jiang Han Jiang
AgendaAgenda
Basic conceptsBasic concepts
Previous algorithmsPrevious algorithms
FrameworkFrameworkInvert ProblemInvert Problem
Refine ProblemRefine Problem
Theoretical benefitsTheoretical benefits
IllustrationIllustration
EvaluationEvaluation
DiscussionDiscussion
Basic conceptsBasic concepts
What is interleaving?What is interleaving?
Merge results from different retrieval algorithms.Merge results from different retrieval algorithms.
Only a combined list is shown to user.Only a combined list is shown to user.
The quality of algorithms can be infered with the help The quality of algorithms can be infered with the help of clickthrough data.of clickthrough data.
Interleaved list
Source List A
Search Engine A Query
Clicks
Search Engine B
Source List B
Interleaving Algorithm
Assignment
Credit function
Evaluation Result
Basic concepts +Basic concepts +
OK, then toss a coin instead, and OK, then toss a coin instead, and
Credit function = if Credit function = if ddii is clicked and higher in ranker A, is clicked and higher in ranker A, prefer A.prefer A.
Ah, that’s easy…how about:Ah, that’s easy…how about:
Interleaving method = pickup best results from each Interleaving method = pickup best results from each algorithms?algorithms?
Wait… how do we know whether d1 is better than d4?
Urgh… When a user randomly click on (d1,d2,d3), A is always preferred…
Basic concepts ++Basic concepts ++
So, what is a So, what is a good interleaving interleaving algorithm?algorithm?
[*] Joachims , Joachims , Optimizing Search Engines Using Clickthrough Data, KDD’02
Intuitively*, a good one should:Intuitively*, a good one should:
Be blind to user. Be Be blind to user. Be blind to retrieval functions. to retrieval functions.
Be robust to Be robust to biases in the user’s decision process in the user’s decision process (that do not relate to (that do not relate to retrieval quality)retrieval quality)
Not substantially alter the search experienceNot substantially alter the search experience
Lead to clicks that reflect the user’s preferenceLead to clicks that reflect the user’s preference
AgendaAgenda
Basic concepts √Basic concepts √
Previous algorithmsPrevious algorithms
FrameworkFrameworkInvert ProblemInvert Problem
Refine ProblemRefine Problem
Theoretical benefitsTheoretical benefits
IllustrationIllustration
EvaluationEvaluation
DiscussionDiscussion
Previous AlgorithmsPrevious Algorithms
Balanced InterleavingBalanced Interleavingtoss a coin once, pick up best items by turns.toss a coin once, pick up best items by turns.
Team Draft InterleavingTeam Draft Interleaving toss a coin every two timestoss a coin every two times, , pick up best item from winner pick up best item from winner
firstfirst
Probabilistic InterleavingProbabilistic Interleaving toss a coin every time, toss a coin every time, sample item from winneritem from winner
A weight function ensures that doc in higher rank has higher probability to be picked up
Previous Algorithms +Previous Algorithms +About credit functions, only documents that are About credit functions, only documents that are clicked by by
users are consideredusers are considered
Balanced Interleaving Balanced Interleaving (coin=A)(coin=A)
A: dA: d11 d d22 d d33 d d44
B: dB: d44 d d11 d d22 d d33
M: M: dd11 d d44 d d22 dd33
clicks on: dclicks on: d11 d d33
A: A: dd11 d d22 dd33 d d44
B: dB: d44 dd11 d d22 dd33
A: A: dd11 d d22 dd33
B: dB: d44 dd11 d d22
A wins
Team Draft Interleaving Team Draft Interleaving (coin=AA)(coin=AA)
A: dA: d11 d d22 d d33 d d44
B: dB: d44 d d11 d d22 d d33
M: M: dd11 d d44 d d22 dd33
clicks on: dclicks on: d11 d d33
A: A: dd11 d d22 d d33 d d44
B: dB: d44 d d11 d d22 dd33
tie
Probabilistic Interleaving (possible Probabilistic Interleaving (possible coin=AA, AB)coin=AA, AB)
A: dA: d11 d d22 d d33 d d4 4 A: dA: d11 d d22 d d33 d d44
B: dB: d44 d d11 d d22 d d33 B: d B: d44 d d11 d d22 d d33
M: M: dd11 d d44 d d22 dd33
clicks on: dclicks on: d11 d d33
A: A: dd11 d d22 d d33 d d4 4 A: A: dd11 d d22 dd33 d d44
B: dB: d44 d d11 d d22 dd33 B: d B: d44 d d11 d d22 dd33
A wins with p=100%
AgendaAgenda
Basic concepts √Basic concepts √
Previous algorithms √Previous algorithms √
FrameworkFrameworkInvert ProblemInvert Problem
Refine ProblemRefine Problem
Theoretical benefitsTheoretical benefits
IllustrationIllustration
EvaluationEvaluation
DiscussionDiscussion
Invert the problemInvert the problem
Why previous algorithms are not good enough:Why previous algorithms are not good enough:Balanced interleaving & Team Draft interleaving: Balanced interleaving & Team Draft interleaving: biasedbiased
Probabilistic interleaving: degrading the user Probabilistic interleaving: degrading the user experienceexperience
Even a random click on the document raises up a winner.
blah… A=(d1, d2), B=(d1,d2), but M = (d2, d1)
Therefore, the problem of interleaving should be more Therefore, the problem of interleaving should be more constrained constrained
A good way is to start from the principles…A good way is to start from the principles…
Again, what is a Again, what is a good interleaving algorithm?interleaving algorithm?
Be blind to user. Be Be blind to user. Be blind to retrieval functions. to retrieval functions.
Be robust to Be robust to biases in the user’s decision process in the user’s decision process (that do (that do not relate to retrieval quality)not relate to retrieval quality)
Not substantially alter the search experienceNot substantially alter the search experience
Lead to clicks that reflect the user’s preferenceLead to clicks that reflect the user’s preference
Refine the problemRefine the problem
Not substantially alter the search experience Not substantially alter the search experience (show one of the (show one of the rankings, or a ranking “in between” the two)rankings, or a ranking “in between” the two)
Lead to clicks that reflect the user’s preference:Lead to clicks that reflect the user’s preference:If document If document dd is clicked, the input ranker that ranked is clicked, the input ranker that ranked dd higher is given is given more creditmore credit
A randomly clicking user doesn’t create a preference for either rankerA randomly clicking user doesn’t create a preference for either ranker
Be sensitive to input data Be sensitive to input data (fewest user queries show significant (fewest user queries show significant preference)preference)
Again, what is a Again, what is a good interleaving algorithm?interleaving algorithm?
Be blind to user. Be Be blind to user. Be blind to retrieval functions. to retrieval functions.
Be robust to Be robust to biases in the user’s decision process in the user’s decision process (that do (that do not relate to retrieval quality)not relate to retrieval quality)
Refine the problem +Refine the problem +
Not substantially alter the search experience Not substantially alter the search experience (show one of (show one of the rankings, or a ranking “in between” the two)the rankings, or a ranking “in between” the two)
Lead to clicks that reflect the user’s preference:Lead to clicks that reflect the user’s preference:If document If document dd is clicked, the input ranker that ranked is clicked, the input ranker that ranked dd higher is given is given more creditmore credit
A randomly clicking user doesn’t create a preference for either rankerA randomly clicking user doesn’t create a preference for either ranker
Be sensitive to input data Be sensitive to input data (fewest user queries show significant (fewest user queries show significant preference)preference)
Refine the problem ++Refine the problem ++
Not substantially alter the search experience Not substantially alter the search experience (show one (show one of the rankings, or a ranking “in between” the two)of the rankings, or a ranking “in between” the two)
Lead to clicks that reflect the user’s preference:Lead to clicks that reflect the user’s preference:If document If document dd is clicked, the input ranker that ranked is clicked, the input ranker that ranked dd higher is given is given more creditmore credit
A=(d1, d2), B=(d1,d2), M = (d1, d2)
A randomly clicking user doesn’t create a preference for either A randomly clicking user doesn’t create a preference for either rankerranker
num of clicks
score function, when >0, assign score to A, otherwise to B
length of list
a possible interleaved list under previous constraints
Be sensitive to input data Be sensitive to input data (fewest user queries show significant (fewest user queries show significant preference)preference)
Refine the problem +++Refine the problem +++
Refine the problem ++++Refine the problem ++++
So the constraint So the constraint is:is:
And target is: And target is:
With variable: the definition of With variable: the definition of
Define predict Define predict function: function: δδ
Linear Rank difference:Linear Rank difference:
Inverse Rank:Inverse Rank:
Since it is a optimization problem, the existence of solution should be guaranteed theoretically. While in the paper it is only guaranteed empirically.
Theoretical BenefitsTheoretical Benefits
PROPERTY 1: Balanced interleaving Balanced interleaving ⊆ T⊆ This framework his framework
PROPERTY 2: Team Draft interleaving Team Draft interleaving ⊆ T⊆ This his framework framework PROPERTY 3: This framework This framework ⊆ ⊆ Probabilistic Probabilistic interleavinginterleaving
PROPERTY 4: The merged list is something “in between” The merged list is something “in between” the twothe two
Theoretical Benefits +Theoretical Benefits +
PROPERTY 5: Breaking case in Balanced interleaving Breaking case in Balanced interleaving is is omittedomittedPROPERTY 6: Insensitivity in Team Draft interleaving Insensitivity in Team Draft interleaving is is improvedimprovedPROPERTY 7: Probabilistic interleaving will degrade more user Probabilistic interleaving will degrade more user experience experience
IllustrationIllustration
L1 unbiased towards random user: 3*25% + (-1)*(35% + 40%) = 0
Note: the number of constraint is 5, but unknown factor is 6?
(it is a maximization problem, and the goal is to maximize sigma{pi * sensitivity(L_i)}
An option to pursue is sensitivity
AgendaAgenda
Basic concepts √Basic concepts √
Previous algorithms √Previous algorithms √
Framework √Framework √Invert Problem √Invert Problem √
Refine Problem √Refine Problem √
Theoretical benefits √Theoretical benefits √
Illustration Illustration √√
EvaluationEvaluation
DiscussionDiscussion
Evaluation: summaryEvaluation: summary
Construct a dataset to simulate interleaving and user Construct a dataset to simulate interleaving and user interactinteractEvaluate Pearson correlation between each two Evaluate Pearson correlation between each two algorithms.algorithms.Analyze cases that algorithms disagreeAnalyze cases that algorithms disagree
Evaluate result quality by expertsEvaluate result quality by experts
Analyze bias and sensitivity among algorithmsAnalyze bias and sensitivity among algorithms
Evaluation +: construction of Evaluation +: construction of datasetdataset
Collect all query as well as Collect all query as well as top-4 results from a search results from a search engineengineSince the web and algorithm is changing, there are many Since the web and algorithm is changing, there are many distinct rankings for the same query. distinct rankings for the same query.
For each query, make sure that there’re at least For each query, make sure that there’re at least 4 distinct rankings, each shown to user at least distinct rankings, each shown to user at least 10 times, times, with at least with at least 1 click. click.
The most frequent ranking sequence is regarded as A, a The most frequent ranking sequence is regarded as A, a most dissimilar one is regarded as B. one is regarded as B.
Further filter the log, so that results produced by either Further filter the log, so that results produced by either Balanced interleaving and Team Draft interleaving are Balanced interleaving and Team Draft interleaving are frequent.frequent.
Evaluation ++Evaluation ++
Evaluation +++Evaluation +++
Evaluation ++++Evaluation ++++
Bias comparison among different Bias comparison among different algorithmsalgorithms
Evaluation +++++Evaluation +++++
Sensitivity comparison among different Sensitivity comparison among different algorithmsalgorithms
AgendaAgenda
Basic concepts √Basic concepts √
Previous algorithms √Previous algorithms √
Framework √Framework √Invert Problem √Invert Problem √
Refine Problem √Refine Problem √
Theoretical benefits √Theoretical benefits √
Illustration Illustration √√
Evaluation √Evaluation √
DiscussionDiscussion
DiscussionDiscussion
Contribution in this paper:Contribution in this paper:
Invert the question of obtaining interleaving Invert the question of obtaining interleaving algorithms as a constrained optimization algorithms as a constrained optimization problemproblemThe solution is very intuitive, and The solution is very intuitive, and generalgeneral
Many interesting examples to illustrate the breaking Many interesting examples to illustrate the breaking cases for previous approachescases for previous approaches
The evaluation is simulated on logs from only one search The evaluation is simulated on logs from only one search engine.engine.For interleaving, we’re expecting an evaluation based on different search engines?
And that is why human evaluation result is not good among all algorithms.
Note:Note:
Discussion +Discussion +
““A and B are not shown to users as they have low A and B are not shown to users as they have low sensitivity”sensitivity”This is intuitive, however it violates the result shown in Table 1: (a,b,c,d) has sensitivity 0.83,
which is high?
Thank You !Thank You !