Search as Communication: Lessons from a Personal Journey
-
Upload
daniel-tunkelang -
Category
Technology
-
view
8.688 -
download
1
description
Transcript of Search as Communication: Lessons from a Personal Journey
Search as Communica/on: Lessons from a Personal Journey
Daniel Tunkelang Head of Query Understanding, LinkedIn
These are great textbooks on informa/on retrieval.
Unfortunately, I never read them in school.
But I did study graphs and stuff.
I found myself developing a search engine.
And the next thing I knew, I was a search guy.
So what did I learn along the way?
Search isn't a ranking problem. It's a communica/on problem.
Outline
1. Lessons from Library Science 2. Adventures with InformaAon ExtracAon 3. A Moment of Clarity
1. Lessons from Library Science
InformaAon need query select from results
rank using IR model
USER:
SYSTEM: M-‐idf PageRank
A birds-‐eye view of how search engines work.
Old school search: ask a librarian.
Search lives in an informa/on-‐seeking context.
[Pirolli and Card, 2005]
vs.
Recognize ambiguity and ask for clarifica/on.
Clarify, then refine.
Computers Books
Faceted search. It’s not just for e-‐commerce.
Give users transparency, guidance, and control.
Take-‐away for search engine developers:
Act like a librarian. Communicate with your user.
2. Adventures with Informa/on Extrac/on
String matching is great but has limits.
20 20
for i in [1..n]! s ← w1 w2 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!
People search for en//es. Recognize them!
Named en/ty recogni/on is free, as in free beer.
Problem: they process each document separately.
EnAty DetecAon System
Why not take advantage of corpus features?
Give your documents the right to vote!
Use a high-‐recall method to collect candidates. • e.g., all Atle-‐case spans of words other
than single word beginning a sentence. Process each document separately.
• Each candidate is assigned an enAty type, or no type at all.
If a candidate is mostly assigned a single enAty type, extrapolate to all its occurrences.
Looking for topics? Use idf, and its cousin ridf.
Inverse document frequency (idf) • Too low? Probably a stop word. • Too high? Could be noise. Residual inverse document frequency (ridf) • Predict idf using Poisson model. • Difference between idf and predicted idf.
“a good keyword is far from Poisson” [Church and Gale, 1995]
Terminology extrac/on? Try data recycling.
Obtain en//es by any means necessary.
Take-‐away for search engine developers:
En/ty detec/on is crucial. And it isn’t that hard.
3. A Moment of Clarity
informaAon Need query select from results
rank using IR model
USER:
SYSTEM: M-‐idf PageRank
Let’s go back to our pigeons for a moment.
What does this process look like to the system?
vs.
And here’s what it looks like to the user.
GOOD NOT SO GOOD
But can the system tell the difference?
User experience should reflect system confidence.
vs.
h^p://searchengineland.com/ge`ng-‐organized-‐paid-‐search-‐user-‐intent-‐the-‐search-‐funnel-‐116312 Derived from [Jansen et al, 2007].
Searches reflect a variety of informa/on needs.
34 34
for i in [1..n]! s ← w1 w2 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!
We can segment informa/on need from the query.
We can learn from analyzing user behavior.
And we can look at our relevance scores.
Naviga/onal Exploratory
Claudia Hauff, Query Difficulty for Digital Libraries [2009]
There are many pre-‐ and post-‐retrieval signals.
Take-‐away for search engine developers:
Queries vary in difficulty. Recognize and adapt.
Review
1. Lessons from Library Science • Act like a librarian. Communicate with users.
2. Adventures with InformaAon ExtracAon
• EnAty detecAon is crucial. And isn’t that hard. 3. A Moment of Clarity
• Queries vary in difficulty. Recognize and adapt.
Conclusion: Read the textbooks.
But treat search as a communica/on problem.
WE’RE HIRING! hbp://data.linkedin.com/search
Contact me: [email protected]
hbp://linkedin.com/in/dtunkelang