Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science,...
-
Upload
brook-johns -
Category
Documents
-
view
214 -
download
0
Transcript of Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science,...
Learning to Link with Learning to Link with WikipediaWikipedia
David Milne and Ian H. WittenDepartment of Computer Science, University of Waikato
CIKM 2008 (Best Paper Award)
Presented by Dongjoo Lee, IDS Lab., CSE, SNU
Copyright 2009 by CEBT
IntroductionIntroduction
Wikification
Find significant topics and links them to Wiki documents.
2IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
Related WorkRelated Work
Not restricting documents for the destination of automatically identified links Smart-Tag Service (Microsoft), AutoLink (Google)
Many concerned that pages were being “surreptitiously” modified for commercial purposes
Automatic linking is most successful when restricted to safe domains such as cinema (Drenner et al. 2006)
Using Wikipedia as a destination for links Wikify (Mihalcea and Csomai, 2007)
– Detection involves identifying the terms and phrases from which links should be made.
– Disambiguation ensures that the detected phrases link to the appropriate article.
Topic indexing Identifying the most significant topics; those which the document was
written about
Maron, 1977, Medelyan et al., 2008
3IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
Learning to Link with WekipediaLearning to Link with Wekipedia
Learning to disambiguate links
Learning to detect links
Wikification in the wild
Examples and implications
Conclusions
4IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
Learning to disambiguate links - Learning to disambiguate links - commonnesscommonness
balancing the commonness of a sense with its relatedness to the surrounding context
commonness (prior probability): the number of times a wiki document is used as a destination in Wikipedia
5IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
Learning to disambiguate links - relatednessLearning to disambiguate links - relatedness
6IDS Lab. 2009 Spring Seminar
Comparing each possible sense with its surrounding context
Words consisting context also may be ambiguous
Use un ambiguous words that has only one sense
– ex) algorithm, uniformed search, LIFO stack
Reduced to selecting the sense article that has most in common with all of the context articles
a,b: articles of interest
A, B: sets of all articles that link to a and b
W: a set containing all articles in Wikipedia
some context terms are better than others
|)||,log(min(||)log(|
|)log(||))||,log(max(|),(
BAW
BABAbasrelatednes
Copyright 2009 by CEBT
Training – Configuration – TestTraining – Configuration – Test
7IDS Lab. 2009 Spring Seminar
Training Set(500)
Training Set(500)
ConfigurationSet
(500)
ConfigurationSet
(500)
Test Set(100)
Test Set(100)
TrainingTraining ConfigurationConfiguration TestTest
find an optimal classifier and variables
Training Evaluation
precision recall f-measure
Copyright 2009 by CEBT
Learning to disambiguate links Learning to disambiguate links – configuration and attribute selection– configuration and attribute selection
identifying the most suitable classification algorithm
setting minimum probability of senses that are considered by the algorithm
reduce the required time to compare relatedness between context and candidate senses
8IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
Learning to disambiguate links - evaluationLearning to disambiguate links - evaluation
9IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
Learning to detection linksLearning to detection links
Naïve approach (Mihalcea and Csomai 2008)
If probability that a word or phrase had been linked to an article exceeds a certain threshold, a link is attached to it
Presented approach
Machine learning link detector that uses various features
– Link probability
– Relatedness
– Disambiguation confidence
– Generality: the minimum depth at which it is located in Wikipedia’s category tree
– Location and Spread
first occurrence, last occurrence, spread (distance between them)
10IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
Learning to detection links (cont’d)Learning to detection links (cont’d)
11IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
Learning to detection links Learning to detection links - - training and configuration, and evaluationtraining and configuration, and evaluation
12IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
Wikification in the wildWikification in the wild
Experimental data
subset of 50 documents from the AQUAINT
Participants and tasks
Mechanical Turk (Barr and Cabrera, 2006)
– a crowd sourcing service hosted by Amazon provides a way for human judgment to be easily incorporated into software applications
Results
13IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
Examples and implicationsExamples and implications
14IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
ConclusionConclusion
The present paper’s contribution is a proven method of extracting key concepts from plain text that has been evaluated against an extensive body of human performance
15IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
DiscussionDiscussion
well written
clear motivation and contribution
clear presentation about the method they have done in order to accomplish their goal
but not much new idea
combination of existing features that are frequently used for text classification and so on
16IDS Lab. 2009 Spring Seminar