Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features

Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features

Mianwei Zhou, Kevin Chen-Chuan ChangUniversity of Illinois at Urbana-ChampaignEntity-Centric Document Filtering:Boosting Feature Mapping through Meta-FeaturesMuch of the Information Sought on the Web nowadays is about Entities.2

The Web

A Huge Entity Database

FansWe love George!!OMG! IPad Air is coming out~~

BUSINIESSHow to improve our products quality?

EditorTREC-KBA TaskHow to help Wikipedia editors enrich Wikipedia?

Proposal: Entity-Centric Document Filtering System3Entity-Centric Document Filtering System: Automatically Identify Relevant Documents for EntitiesBillions of News, blogs, forums, tweets... entity-centric document filtering system

Interested Entities

Irrelevant Documents

Relevant Documents

4INPUT: Only Entity Name is Usually Insufficient.Michael Jordan

5INPUT: Use Identification Page to Characterize the Target Entity.

Entity Identification Pages

Resolve the ambiguity problem.Provide more information about the entity6OUTPUT: Relevant/Irrelevant Documents for Target Entities.Bill Gates

Michael Jordan(NBA Player)RelevantIrrelevantBill Gates, speaking as co-founder of Microsoft, will give a talk next Tuesday ...Steve Jobs story is completely different from Bill Gates ...Michael Jordan is considered by many the best basket player in NBA historyMichael Jordan is aLeading researcherin machinelearning and AI.7

Problem: Entity-Centric Learning to Filter8Problem: Entity-Centric Learning to FilterTraining PhaseTesting Phase

Wiki PageRelevantIrrelevantWiki PageRelevantIrrelevant

Entity-centric Document Filter

Wiki Page????9How to Predict Document Relevance for an Entity Characterized by an Identification Page?

RelevanceTraditional IR models such as BM25, language model do not work.Designed for Short QueriesEntity Pages contain many Noisy Keywords10Our Idea: Check if the document mentions about the most basic information of the entity.

MicrosoftWindowsSeattlePhilanthropist11Challenge: Learning Across Entities.12For an Entity with Labeled Documents, Learning its Important Keywords is Simple.Relevant DocumentIrrelevant Document

Bill Gates, speaking as co-founder of Microsoft, will give a talk next Tuesday ...Steve Jobs story is completely different from Bill Gates ...13Relevance of document d for entity eHowever, Such Keyword Importance is Not Adaptable to Other Entities.

MicrosoftWindowsSeattlePhilanthropist

NBAChicago BullMVPUNCTraining Entities (with Labeled Documents)New Entities (without Labeled Documents)Keyword ImportanceTransfer

14Insight: Meta-feature Based Keyword Mapping15

Keyword: MicrosoftKeyword: Chicago Bullare mentioned a lot in their Wiki Pages.are organization.appear in the info-box.....Similar Importance16Both of them...Meta-Feature -- Features of Features:Properties that are related to keyword importance17General Meta-FeatureIDF, IsNoun, InEntity, ...ID-Page-Related Meta-Feature

Wiki PageInInfobox, InOpenPara, ...

Amazon PageInSpec, InReview, ...Solution: Boosting Mapping Model18Clustering-based Keyword Mapping19Training Phase

MicrosoftHarvardCascade

HollywoodNKUCFR...

...theisthisahereasthe...Testing Phase

NBAUNCBobcatsWikithemustthere...

NBAUNCBobcats

Wikithemustthere...Document Relevance based on Keyword Clusters20Keyword ClustersKeyword Importance Traditional Clustering Algorithm Might Fail21

...

...theWAforOctoberprogrammerconsistentlyMSOscaractorisOccupationHollywoodscreenwriter1. Irrelevant Meta-Features might Lead to Useless Clusters 2. Different Possible Ways of Clustering. Which one is better?OR?21BoostMapping: Boosting Effective Clusters22MicrosoftHarvardCascadeHollywoodNKUCFR......theisthisahereastheDocument LabelsObjective of Clustering:Boosting the Prediction Accuracy of Relevance

Only Useful Clusters are Generated.BoostMapping:1. Initialization: Uniform Document Importance23

BoostMapping:2. Enumerate Conditions to Generate the Most Predictive Cluster.24

Achieve the Highest Prediction Accuracy25

BoostMapping:3. Update the Document DistributionBoostMapping:4. Generate the Next Cluster Under the Current Document Distribution26

27

Update the document distribution againBoostMapping:5. Repeat the Process Until the Predict Accuracy ConvergeExperiment28Three Datasets29TREC-KBA29 person entities, 52,238 documentsWikipedia pages as ID pages

Product39 product entities, 2,398 documents Amazon pages as ID pages

MilQuery (From Million Query Track)143 general entities, 8,208 documents.Wikipedia pages ad ID pages.

HostageRescue

Kodak

Dinosaur

Performance Comparison with Baselines30QueryByName: Use Entity Names As QueriesQBD-TFIDF: Use TFIDF to Select Important Keywords as Queries.VectorSim: Measure Relevance Based on Query-Document SimilarityLinearMapping: Keyword Mapping based on a Linear Function.31Thanks!Q&A31

Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features

Documents

Transcript of Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features