C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj...
-
Upload
jayson-wilkins -
Category
Documents
-
view
224 -
download
1
Transcript of C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj...
COLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT
- Presented by Avinash S Bharadwaj (1000663882)
ABSTRACT
The aim of the paper Annotation of open domain unstructured web
text with uniquely identified entities in a social media like Wikipedia.
Use of annotations for search and mining tasks
WHAT IS ENTITY DISAMBIGUATION?
An entity is something that is real and has a distinct existence.
Wikipedia articles can be considered as entities.
Entity disambiguation is the art of resolving correspondence between mentions of entities in natural language and real world entities.
In this paper the disambiguation is carried out between annotations in web pages along and Wikipedia articles.
PREVIOUS WORK IN DISAMBIGUATION SemTag:
First webscale disambiguation system. Annotated about 250 million web pages with IDs from the Stanford
TAP. SemTag preferred high precision over recall, with an average of two
annotations per page Wikify!
Wikify performed both keyword extraction and disambiguation. Wikify could not achieve collective disambiguation across spots
Milne and Witten (M&W): It’s a form of collective disambiguation which results better than
Wikify. M&W achieves a F1 measure of 0.53, unlike Wikify which has a F1
measure of 0.83 Cucerzan’s algorithm:
Each entity is represented as a high dimensional feature vector. Cucerzan annotates sparingly about 4.5% of all possible tokens are
annotated.
TERMINOLOGIES
Spots Occurrence of text on a page that can be
possibly linked to a Wikipedia article Attachment
Possible entities in Wikipedia to which a spot can be linked
Annotation Process of making an attachment to spots on a
page Gama list
List of all possible annotations
COLLECTIVE ENTITY DISAMBIGUATION
Sometimes disambiguation can not be carried out by using single spots in a page.
Multiple spots in a page are required to disambiguate an entity
All spots in an article are considered to be related
CALCULATING RELATEDNESS BETWEEN WIKIPEDIA ENTITIES
Relatedness between two entities is defined as r(γ, γ’)= g(γ) · g(γ’).
Cucerzan’s proposal defined relatedness between entity based on cosine measure
Milne et al. proposal: c = number of Wikipedia pages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise.
CONTRIBUTIONS OF THIS PAPER
The paper proposes posing entity disambiguation as an optimization problem.
The paper provides a single optimization objective. Using integer linear programs Using heuristics for approximate solutions
Paper also describes about rich node features with systematic learning
Paper also describes about back off strategy for controlled annotations
MODELING COMPATIBILITY BETWEEN WIKIPEDIA ARTICLES
Entities modeled using a feature vector defined as fs(γ). The feature vector expresses local textual compatibility
between (context of) spot s and candidate label γ. Components of the feature vector
Spot side Context of the spot
Wikipedia side Snippet Full text Anchor text Anchor text with context
Similarity Measures Dot product Cosine Similarity Jaccard Similarity
METHODS FOR EVALUATING THE MODEL
Authors use two ways for evaluating the model, Node score and Clique Score
Node Score Defined by the function W is a training set obtained from linear
adaptation of rank SVM Clique score
Uses the related measure of Milne and Witten. Total objective
BACK-OFF METHOD
Not all spots in a web page may be tagged. Uses a special tag “NA” for articles that can’t
be tagged Spots in the webpage marked “NA” will not
contribute to the clique potential. A factor called “RNA” defines the
aggressiveness of the tagging algorithm.
IMPLEMENTATION
Integer linear program (ILP) based formulation Casting as 0/1 integer linear program Relaxing it to an LP
Simpler heuristics Hill climbing for optimization
EVALUATING THE ALGORITHM
Evaluation measures used Precision
Number of spots tagged correctly out of total number of spots tagged
Recall Number of spots tagged correctly out of total number
of spots in ground truth F1
F1 is described using the following formula
DATASETS USED FOR EVALUATION
The authors use WebPages crawled and stored in the IITB database.
Publicly available data from Cucerzan’s experiments (CZ)
NAMED ENTITY DISAMBIGUATION IN WIKIPEDIA
Named ambiguity problem has resulted in a demand for efficient high quality disambiguation methods
Not a trivial task, the application should be capable of deciding whether the group of name occurrences belong to the same entity
Traditional methods of named entity disambiguation uses the Bag Of Words (BOW) method
WIKIPEDIA AS A SEMANTIC NETWORK
Wikipedia is an open database covering most of the useful topics in the world.
The title of Wikipedia article describes the content within the article.
The title may sometimes be noisy. These are filtered using rules from Hu, et al.
SEMANTIC RELATIONS BETWEEN WIKIPEDIA CONCEPTS
Wikipedia contains rich relation structures within the page
The relatedness is represented by links between the Wikipedia pages.
WORKING OF NAMED ENTITY DISAMBIGUATION USING WIKIPEDIA
Uses vectors as to represent a Wikipedia entity.
Similarity between each vector is measured for named entity disambiguation.
MEASURING SIMILARITY BETWEEN TWO WIKIPEDIA ENTITIES
The similarity measure takes into account the full semantic relations indicated by hyperlinks in Wikipedia.
The algorithm works in three steps. Described as follows
STEP 1
In order to measure the similarity between two vector representations, the correspondence between the concepts of one vector to another have to be defined
Semantic relations between articles is used to match the articles.
STEP 2 Compute the semantic relatedness from one
concept vector representation to another
Using the alignments shown in previous step SR(MJ1→MJ2) is computed as (0.42×0.47×0.54 + 0.54×0.51×0.66 + 0.51×0.51×0.65)/(0.42×0.47 + 0.54×0.51 + 0.51×0.51)=0.62, and
SR(MJ2→MJ1) is computed as (0.47×0.42×0.54 + 0.52×0.54×0.58 + 0.52 × 0.51 × 0.60 + 0.51 × 0.54 × 0.66 )/(0.47×0.42 + 0.52×0.54 + 0.52 × 0.51 + 0.51 × 0.54)=0.60.
STEP 3
Compute the similarity between two concept vector representations.
Similarity SIM(MJ1, MJ2) is computed as (0.60 + 0.62)/2 = 0.61, SIM(MJ2, MJ3) is computed as 0.10 and SIM(MJ1, MJ3) is computed as 0.0.