Collective annotation of wikipedia entities in web text
description
Transcript of Collective annotation of wikipedia entities in web text
![Page 1: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/1.jpg)
COLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT
- Presented by Avinash S Bharadwaj (1000663882)
![Page 2: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/2.jpg)
ABSTRACT The aim of the paper
Annotation of open domain unstructured web text with uniquely identified entities in a social media like Wikipedia.
Use of annotations for search and mining tasks
![Page 3: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/3.jpg)
WHAT IS ENTITY DISAMBIGUATION? An entity is something that is real and has a
distinct existence. Wikipedia articles can be considered as
entities. Entity disambiguation is the art of resolving
correspondence between mentions of entities in natural language and real world entities.
In this paper the disambiguation is carried out between annotations in web pages along and Wikipedia articles.
![Page 4: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/4.jpg)
ENTITY DISAMBIGUATION EXAMPLE
![Page 5: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/5.jpg)
PREVIOUS WORK IN DISAMBIGUATION SemTag:
First webscale disambiguation system. Annotated about 250 million web pages with IDs from the Stanford
TAP. SemTag preferred high precision over recall, with an average of two
annotations per page Wikify!
Wikify performed both keyword extraction and disambiguation. Wikify could not achieve collective disambiguation across spots
Milne and Witten (M&W): It’s a form of collective disambiguation which results better than
Wikify. M&W achieves a F1 measure of 0.53, unlike Wikify which has a F1
measure of 0.83 Cucerzan’s algorithm:
Each entity is represented as a high dimensional feature vector. Cucerzan annotates sparingly about 4.5% of all possible tokens are
annotated.
![Page 6: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/6.jpg)
TERMINOLOGIES Spots
Occurrence of text on a page that can be possibly linked to a Wikipedia article
Attachment Possible entities in Wikipedia to which a spot can
be linked Annotation
Process of making an attachment to spots on a page
Gama list List of all possible annotations
![Page 7: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/7.jpg)
TERMINOLOGIES ILLUSTRATED
Spots Attachment Gama list
![Page 8: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/8.jpg)
COLLECTIVE ENTITY DISAMBIGUATION Sometimes
disambiguation can not be carried out by using single spots in a page.
Multiple spots in a page are required to disambiguate an entity
All spots in an article are considered to be related
![Page 9: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/9.jpg)
COLLECTIVE ENTITY DISAMBIGUATION EXAMPLE
![Page 10: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/10.jpg)
CALCULATING RELATEDNESS BETWEEN WIKIPEDIA ENTITIES Relatedness between two entities is defined
as r(γ, γ’)= g(γ) · g(γ’). Cucerzan’s proposal defined relatedness
between entity based on cosine measure Milne et al. proposal: c = number of
Wikipedia pages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise.
![Page 11: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/11.jpg)
CONTRIBUTIONS OF THIS PAPER The paper proposes posing entity
disambiguation as an optimization problem. The paper provides a single optimization
objective. Using integer linear programs Using heuristics for approximate solutions
Paper also describes about rich node features with systematic learning
Paper also describes about back off strategy for controlled annotations
![Page 12: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/12.jpg)
MODELING COMPATIBILITY BETWEEN WIKIPEDIA ARTICLES Entities modeled using a feature vector defined as fs(γ). The feature vector expresses local textual compatibility
between (context of) spot s and candidate label γ. Components of the feature vector
Spot side Context of the spot
Wikipedia side Snippet Full text Anchor text Anchor text with context
Similarity Measures Dot product Cosine Similarity Jaccard Similarity
![Page 13: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/13.jpg)
METHODS FOR EVALUATING THE MODEL Authors use two ways for evaluating the
model, Node score and Clique Score Node Score
Defined by the function W is a training set obtained from linear
adaptation of rank SVM Clique score
Uses the related measure of Milne and Witten. Total objective
![Page 14: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/14.jpg)
BACK-OFF METHOD Not all spots in a web page may be tagged. Uses a special tag “NA” for articles that can’t
be tagged Spots in the webpage marked “NA” will not
contribute to the clique potential. A factor called “RNA” defines the
aggressiveness of the tagging algorithm.
![Page 15: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/15.jpg)
IMPLEMENTATION Integer linear program (ILP) based
formulation Casting as 0/1 integer linear program Relaxing it to an LP
Simpler heuristics Hill climbing for optimization
![Page 16: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/16.jpg)
EVALUATING THE ALGORITHM Evaluation measures used
Precision Number of spots tagged correctly out of total number
of spots tagged Recall
Number of spots tagged correctly out of total number of spots in ground truth
F1 F1 is described using the following formula
![Page 17: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/17.jpg)
DATASETS USED FOR EVALUATION The authors use WebPages crawled and
stored in the IITB database. Publicly available data from Cucerzan’s
experiments (CZ)
![Page 18: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/18.jpg)
EXPERIMENTAL RESULTS
![Page 19: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/19.jpg)
NAMED ENTITY DISAMBIGUATION IN WIKIPEDIA Named ambiguity problem has resulted in a
demand for efficient high quality disambiguation methods
Not a trivial task, the application should be capable of deciding whether the group of name occurrences belong to the same entity
Traditional methods of named entity disambiguation uses the Bag Of Words (BOW) method
![Page 20: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/20.jpg)
WIKIPEDIA AS A SEMANTIC NETWORK Wikipedia is an open database covering most
of the useful topics in the world. The title of Wikipedia article describes the
content within the article. The title may sometimes be noisy. These are
filtered using rules from Hu, et al.
![Page 21: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/21.jpg)
SEMANTIC RELATIONS BETWEEN WIKIPEDIA CONCEPTS Wikipedia contains rich relation structures
within the page The relatedness is represented by links
between the Wikipedia pages.
![Page 22: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/22.jpg)
WORKING OF NAMED ENTITY DISAMBIGUATION USING WIKIPEDIA Uses vectors as to represent a Wikipedia
entity. Similarity between each vector is measured
for named entity disambiguation.
![Page 23: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/23.jpg)
MEASURING SIMILARITY BETWEEN TWO WIKIPEDIA ENTITIES The similarity measure takes into account the
full semantic relations indicated by hyperlinks in Wikipedia.
The algorithm works in three steps. Described as follows
![Page 24: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/24.jpg)
STEP 1 In order to measure the similarity between
two vector representations, the correspondence between the concepts of one vector to another have to be defined
Semantic relations between articles is used to match the articles.
![Page 25: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/25.jpg)
STEP 2 Compute the semantic relatedness from one concept
vector representation to another
Using the alignments shown in previous step SR(MJ1→MJ2) is computed as (0.42×0.47×0.54 + 0.54×0.51×0.66 + 0.51×0.51×0.65)/(0.42×0.47 + 0.54×0.51 + 0.51×0.51)=0.62, and
SR(MJ2→MJ1) is computed as (0.47×0.42×0.54 + 0.52×0.54×0.58 + 0.52 × 0.51 × 0.60 + 0.51 × 0.54 × 0.66 )/(0.47×0.42 + 0.52×0.54 + 0.52 × 0.51 + 0.51 × 0.54)=0.60.
![Page 26: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/26.jpg)
STEP 3 Compute the similarity between two concept
vector representations.
Similarity SIM(MJ1, MJ2) is computed as (0.60 + 0.62)/2 = 0.61, SIM(MJ2, MJ3) is computed as 0.10 and SIM(MJ1, MJ3) is computed as 0.0.
![Page 27: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/27.jpg)
RESULTS
![Page 28: Collective annotation of wikipedia entities in web text](https://reader035.fdocuments.net/reader035/viewer/2022070423/56816766550346895ddc4a7d/html5/thumbnails/28.jpg)
QUESTIONS ????