Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features
description
Transcript of Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features
Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features
Mianwei Zhou, Kevin Chen-Chuan ChangUniversity of Illinois at Urbana-ChampaignEntity-Centric Document Filtering:Boosting Feature Mapping through Meta-FeaturesMuch of the Information Sought on the Web nowadays is about Entities.2
The Web
A Huge Entity Database
FansWe love George!!OMG! IPad Air is coming out~~
BUSINIESSHow to improve our products quality?
EditorTREC-KBA TaskHow to help Wikipedia editors enrich Wikipedia?
Proposal: Entity-Centric Document Filtering System3Entity-Centric Document Filtering System: Automatically Identify Relevant Documents for EntitiesBillions of News, blogs, forums, tweets... entity-centric document filtering system
Interested Entities
Irrelevant Documents
Relevant Documents
4INPUT: Only Entity Name is Usually Insufficient.Michael Jordan
5INPUT: Use Identification Page to Characterize the Target Entity.
Entity Identification Pages
Resolve the ambiguity problem.Provide more information about the entity6OUTPUT: Relevant/Irrelevant Documents for Target Entities.Bill Gates
Michael Jordan(NBA Player)RelevantIrrelevantBill Gates, speaking as co-founder of Microsoft, will give a talk next Tuesday ...Steve Jobs story is completely different from Bill Gates ...Michael Jordan is considered by many the best basket player in NBA historyMichael Jordan is aLeading researcherin machinelearning and AI.7
Problem: Entity-Centric Learning to Filter8Problem: Entity-Centric Learning to FilterTraining PhaseTesting Phase
Wiki PageRelevantIrrelevantWiki PageRelevantIrrelevant
Entity-centric Document Filter
Wiki Page????9How to Predict Document Relevance for an Entity Characterized by an Identification Page?
RelevanceTraditional IR models such as BM25, language model do not work.Designed for Short QueriesEntity Pages contain many Noisy Keywords10Our Idea: Check if the document mentions about the most basic information of the entity.
MicrosoftWindowsSeattlePhilanthropist11Challenge: Learning Across Entities.12For an Entity with Labeled Documents, Learning its Important Keywords is Simple.Relevant DocumentIrrelevant Document
Bill Gates, speaking as co-founder of Microsoft, will give a talk next Tuesday ...Steve Jobs story is completely different from Bill Gates ...13Relevance of document d for entity eHowever, Such Keyword Importance is Not Adaptable to Other Entities.
MicrosoftWindowsSeattlePhilanthropist
NBAChicago BullMVPUNCTraining Entities (with Labeled Documents)New Entities (without Labeled Documents)Keyword ImportanceTransfer
14Insight: Meta-feature Based Keyword Mapping15
Keyword: MicrosoftKeyword: Chicago Bullare mentioned a lot in their Wiki Pages.are organization.appear in the info-box.....Similar Importance16Both of them...Meta-Feature -- Features of Features:Properties that are related to keyword importance17General Meta-FeatureIDF, IsNoun, InEntity, ...ID-Page-Related Meta-Feature
Wiki PageInInfobox, InOpenPara, ...
Amazon PageInSpec, InReview, ...Solution: Boosting Mapping Model18Clustering-based Keyword Mapping19Training Phase
MicrosoftHarvardCascade
HollywoodNKUCFR...
...theisthisahereasthe...Testing Phase
NBAUNCBobcatsWikithemustthere...
NBAUNCBobcats
Wikithemustthere...Document Relevance based on Keyword Clusters20Keyword ClustersKeyword Importance Traditional Clustering Algorithm Might Fail21
...
...theWAforOctoberprogrammerconsistentlyMSOscaractorisOccupationHollywoodscreenwriter1. Irrelevant Meta-Features might Lead to Useless Clusters 2. Different Possible Ways of Clustering. Which one is better?OR?21BoostMapping: Boosting Effective Clusters22MicrosoftHarvardCascadeHollywoodNKUCFR......theisthisahereastheDocument LabelsObjective of Clustering:Boosting the Prediction Accuracy of Relevance
Only Useful Clusters are Generated.BoostMapping:1. Initialization: Uniform Document Importance23
BoostMapping:2. Enumerate Conditions to Generate the Most Predictive Cluster.24
Achieve the Highest Prediction Accuracy25
BoostMapping:3. Update the Document DistributionBoostMapping:4. Generate the Next Cluster Under the Current Document Distribution26
27
Update the document distribution againBoostMapping:5. Repeat the Process Until the Predict Accuracy ConvergeExperiment28Three Datasets29TREC-KBA29 person entities, 52,238 documentsWikipedia pages as ID pages
Product39 product entities, 2,398 documents Amazon pages as ID pages
MilQuery (From Million Query Track)143 general entities, 8,208 documents.Wikipedia pages ad ID pages.
HostageRescue
Kodak
Dinosaur
Performance Comparison with Baselines30QueryByName: Use Entity Names As QueriesQBD-TFIDF: Use TFIDF to Select Important Keywords as Queries.VectorSim: Measure Relevance Based on Query-Document SimilarityLinearMapping: Keyword Mapping based on a Linear Function.31Thanks!Q&A31