Dynamic Collective Entity Representations for Entity Ranking

20
Dynamic Collective Entity Representations for Entity Ranking David Graus, Manos Tsagkias, Wouter Weerkamp, Edgar Meij, Maarten de Rijke

Transcript of Dynamic Collective Entity Representations for Entity Ranking

Dynamic Collective Entity Representations for Entity Ranking

Dynamic Collective Entity Representations for Entity RankingDavid Graus, Manos Tsagkias, Wouter Weerkamp, Edgar Meij, Maarten de Rijke

2

first entities & structure, i get to show the mandatory entity search example

3

you are not interested in documentsbut in things: person/artist kendrick lamarreferring to him w/ his former stage name

4Entity search?Index = Knowledge Base (= Wikipedia)Documents = EntitiesReal world entities have a single representation (in KB)

so it is like web search, but the units of retrieval are real life entities, so we can collect data for them

5Representation is not staticPeople talk about entities all the timeAssociations between words and entities change over time

This is what we try to leverage in this work

6Example 1: News events

July 31st, after August 7th -> Added content, new words associations

7Example 2: Social media chatter

*****

this looks a bit extreme, because theres swearingbut theres a serious intuition here; vocabulary gap (formal KB, informal chatter)

8Dynamic Collective Entity RepresentationsUse collective intelligence to mine entity descriptions to enrich representation.Is like document expansion (add terms found through explicit links)Is not query expansion (terms found through predicted links)

our method aims to leverage thisenrich representation + close the gap

9AdvantagesCheap: Change document in index, leverage tried & tested retrieval algorithmsFree smoothing: (e.g., tweets) may capture newly evolving word associations (Ferguson shooting) and incorporate out-of-document termsmove relevant documents closer to queries (= close the gap between searcher vocabulary & docs in index)

of collective int/descr sources

10Havent we seen this before?Anchors & queries in particular have been shown to improve retrieval [1]Tweets have been shown to be similar to anchors [2]Social tags, same [3]But: in batch (i.e., add data, see how it affects retrieval)single source[1] T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, urls and anchors. TREC 2001[2] G. Mishne and J. Lin. Twanchor text: A preliminary study of the value of tweets as anchor text. SIGIR 12 [3] C.-J. Lee and W. B. Croft. Incorporating social anchors for ad hoc retrieval. OAIR 13

we look at a scenario where the expansions come in a streaming manner

11Description sources

Anthropornis nordenskjoeldiAnthropornisNordenskjoeld's Giant Penguin

EoceneOligoceneAnimalChordateAvesSphenisciformesSpheniscidae...emperor penguinNordenskjoeld's Giant PenguinAnthropornis nordenskjoeldiNordenskjoeld's giant penguinAnthropornisEocene birdsOligocene birdsExtinct penguinsOligocene extinctionsBird generaKB AnchorsKB CategoriesKB RedirectsKB Links

Anthropornis nordenskjoeldiAnthropornis nordenskjoeldiWeb Anchors

megafaunaTagsTweets

biggest penguinanthropornisextinct penguinprehistoric birdsQueries

Fielded document representation

12ChallengeHeterogeneityDescription sourcesEntities

Dynamic natureContent changes over time

You could do vanilla retrieval. But two challenges arise;description sources differ along several dimensions (e.g., volume, quality, novelty)head entities are likely to receive a larger number of external descriptions than tail entities.content changes over time, so expansions may accumulate and swamp the representation

13Method: Adaptive rankingSupervised single-field weighting modelFeatures: field similarity: retrieval score per field.field importance: length, novel terms, etc.entity importance: time since last update.(Re-)learn optimal weights from clicks

Our solution is to dynamically learn how to combine fields into single representation,Features (more detail in paper); field similarity features (per field) = queryfield similarity scores.field importance features (per field) to inform the ranker of the status of the field at that time (i.e., more and novel content)entity importance (to favor recently updated entities)(what about experimental setup?)

14Experimental setupData:MSN Query log (62,841 queries + clicks (on entities))

Each query is treated as a time unitFor each query:Produce rankingObserve clickEvaluate ranking (MAP/P@1)Expand entities (w/ dynamic descriptions)[re-train ranker]

Took all queries that yield Wiki clicks.Top-k retrieval, extract featuresAllows to track performance over time

15Main resultsComparing effectiveness of diff. description sourcesComparing adaptive vs. non-adaptive ranker performance

in this talk I focus on the contribution of sourcesand adaptive vs. static ranker

16Description sources

MAPNo. of queries

1. Each source contributes to better ranking; Tags/web anchors do best, tweets are significantly > KB2. Dynamic sources have higher learning rates (suggests that newly incoming data is successfully incorporated)3. Tags starts under web but approaches it; new tags improve[NEXT] To see the effect of incoming data, feature weights

17Feature weights over time

Relative feature importanceNo. of queries

- Static go down, dynamic go up (suggests retraining is important w/ dynamic expansions)- Tweets marginally, but as we know KB+Tweets > KB, the tweets do help- Not shown; static expansions stay roughly the same[NEXT] Increasing field weight + increased performance suggests retraining is needed, next;

18Non-adaptive vs. adaptive ranking

1. [LEFT] Lower performance overall (more data w/o more training queries)2. [LEFT] Dynamic ones higher slopes; so newly incoming data does help even in static3. [RIGHT] same patterns but tags+web do comparatively better (because of swamping?)[END] higher performance: retraining increases rankers ability in optimally combining descriptions into a single representation

19In summaryExpanding entity representations with different sources enables better matching of queries to entitiesAs new content comes in, it is beneficial to retrain the rankerInforming ranker of expansion state further improves performance

More data helps, but to optimally benefit you need to inform your ranker

20Thank you(Also, thank you WSDM & SIGIR travel grants)