Collective annotation of wikipedia entities in web text

COLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT

- Presented by Avinash S Bharadwaj (1000663882)

ABSTRACT The aim of the paper

Annotation of open domain unstructured web text with uniquely identified entities in a social media like Wikipedia.

Use of annotations for search and mining tasks

WHAT IS ENTITY DISAMBIGUATION? An entity is something that is real and has a

distinct existence. Wikipedia articles can be considered as

entities. Entity disambiguation is the art of resolving

correspondence between mentions of entities in natural language and real world entities.

In this paper the disambiguation is carried out between annotations in web pages along and Wikipedia articles.

ENTITY DISAMBIGUATION EXAMPLE

PREVIOUS WORK IN DISAMBIGUATION SemTag:

First webscale disambiguation system. Annotated about 250 million web pages with IDs from the Stanford

TAP. SemTag preferred high precision over recall, with an average of two

annotations per page Wikify!

Wikify performed both keyword extraction and disambiguation. Wikify could not achieve collective disambiguation across spots

Milne and Witten (M&W): It’s a form of collective disambiguation which results better than

Wikify. M&W achieves a F1 measure of 0.53, unlike Wikify which has a F1

measure of 0.83 Cucerzan’s algorithm:

Each entity is represented as a high dimensional feature vector. Cucerzan annotates sparingly about 4.5% of all possible tokens are

annotated.

TERMINOLOGIES Spots

Occurrence of text on a page that can be possibly linked to a Wikipedia article

Attachment Possible entities in Wikipedia to which a spot can

be linked Annotation

Process of making an attachment to spots on a page

Gama list List of all possible annotations

TERMINOLOGIES ILLUSTRATED

Spots Attachment Gama list

COLLECTIVE ENTITY DISAMBIGUATION Sometimes

disambiguation can not be carried out by using single spots in a page.

Multiple spots in a page are required to disambiguate an entity

All spots in an article are considered to be related

COLLECTIVE ENTITY DISAMBIGUATION EXAMPLE

CALCULATING RELATEDNESS BETWEEN WIKIPEDIA ENTITIES Relatedness between two entities is defined

as r(γ, γ’)= g(γ) · g(γ’). Cucerzan’s proposal defined relatedness

between entity based on cosine measure Milne et al. proposal: c = number of

Wikipedia pages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise.

CONTRIBUTIONS OF THIS PAPER The paper proposes posing entity

disambiguation as an optimization problem. The paper provides a single optimization

objective. Using integer linear programs Using heuristics for approximate solutions

Paper also describes about rich node features with systematic learning

Paper also describes about back off strategy for controlled annotations

MODELING COMPATIBILITY BETWEEN WIKIPEDIA ARTICLES Entities modeled using a feature vector defined as fs(γ). The feature vector expresses local textual compatibility

between (context of) spot s and candidate label γ. Components of the feature vector

Spot side Context of the spot

Wikipedia side Snippet Full text Anchor text Anchor text with context

Similarity Measures Dot product Cosine Similarity Jaccard Similarity

METHODS FOR EVALUATING THE MODEL Authors use two ways for evaluating the

model, Node score and Clique Score Node Score

Defined by the function W is a training set obtained from linear

adaptation of rank SVM Clique score

Uses the related measure of Milne and Witten. Total objective

BACK-OFF METHOD Not all spots in a web page may be tagged. Uses a special tag “NA” for articles that can’t

be tagged Spots in the webpage marked “NA” will not

contribute to the clique potential. A factor called “RNA” defines the

aggressiveness of the tagging algorithm.

IMPLEMENTATION Integer linear program (ILP) based

formulation Casting as 0/1 integer linear program Relaxing it to an LP

Simpler heuristics Hill climbing for optimization

EVALUATING THE ALGORITHM Evaluation measures used

Precision Number of spots tagged correctly out of total number

of spots tagged Recall

Number of spots tagged correctly out of total number of spots in ground truth

F1 F1 is described using the following formula

DATASETS USED FOR EVALUATION The authors use WebPages crawled and

stored in the IITB database. Publicly available data from Cucerzan’s

experiments (CZ)

EXPERIMENTAL RESULTS

NAMED ENTITY DISAMBIGUATION IN WIKIPEDIA Named ambiguity problem has resulted in a

demand for efficient high quality disambiguation methods

Not a trivial task, the application should be capable of deciding whether the group of name occurrences belong to the same entity

Traditional methods of named entity disambiguation uses the Bag Of Words (BOW) method

WIKIPEDIA AS A SEMANTIC NETWORK Wikipedia is an open database covering most

of the useful topics in the world. The title of Wikipedia article describes the

content within the article. The title may sometimes be noisy. These are

filtered using rules from Hu, et al.

SEMANTIC RELATIONS BETWEEN WIKIPEDIA CONCEPTS Wikipedia contains rich relation structures

within the page The relatedness is represented by links

between the Wikipedia pages.

WORKING OF NAMED ENTITY DISAMBIGUATION USING WIKIPEDIA Uses vectors as to represent a Wikipedia

entity. Similarity between each vector is measured

for named entity disambiguation.

MEASURING SIMILARITY BETWEEN TWO WIKIPEDIA ENTITIES The similarity measure takes into account the

full semantic relations indicated by hyperlinks in Wikipedia.

The algorithm works in three steps. Described as follows

STEP 1 In order to measure the similarity between

two vector representations, the correspondence between the concepts of one vector to another have to be defined

Semantic relations between articles is used to match the articles.

STEP 2 Compute the semantic relatedness from one concept

vector representation to another

Using the alignments shown in previous step SR(MJ1→MJ2) is computed as (0.42×0.47×0.54 + 0.54×0.51×0.66 + 0.51×0.51×0.65)/(0.42×0.47 + 0.54×0.51 + 0.51×0.51)=0.62, and

SR(MJ2→MJ1) is computed as (0.47×0.42×0.54 + 0.52×0.54×0.58 + 0.52 × 0.51 × 0.60 + 0.51 × 0.54 × 0.66 )/(0.47×0.42 + 0.52×0.54 + 0.52 × 0.51 + 0.51 × 0.54)=0.60.

STEP 3 Compute the similarity between two concept

vector representations.

Similarity SIM(MJ1, MJ2) is computed as (0.60 + 0.62)/2 = 0.61, SIM(MJ2, MJ3) is computed as 0.10 and SIM(MJ1, MJ3) is computed as 0.0.

RESULTS

QUESTIONS ????

Collective annotation of wikipedia entities in web text

Documents

Transcript of Collective annotation of wikipedia entities in web text