Wikipedia Knowledge Extraction. Pronoun Resolution module Infobox extraction SRL parsing ...

Post on 12-Jan-2016

240 views 0 download

Tags:

Transcript of Wikipedia Knowledge Extraction. Pronoun Resolution module Infobox extraction SRL parsing ...

Wikipedia Knowledge Extraction

Pronoun Resolution module Infobox extraction SRL parsing Improved refinement Clustering Hadoop compatibility

“His mother wanted him to get a good education so she sent him to live with his grandparents in Honolulu, HI” (Barack Obama)

“His mother wanted him to get a good education so she sent him to live with his grandparents in Honolulu, HI” (Barack Obama)

Current solution: replace pronouns with article title (very primitive)

Target solution: ◦ Nobody in the world has solved this yet◦ Use an existing system that is usually correct?◦ Simple rules for common patterns?

Convert information into simple sentences:◦ Joe Biden is Barack Obama’s Vice

President ◦ Barack Obama is preceded by

George W. Bush Use type of phrase (Noun

Phrase, Verb Phrase) to determine sentence to form.

Read papers from Turing Center (University of Washington)

Performs a deep analysis on each sentence. E.g. “Yoshi has a long tongue which he uses

to grab enemies and eat them.”◦ has (A0: Yoshi, A1: long tongue)◦ use (A0: Yoshi, A1: long tongue, A2: grab enemies

and eat them) Use SRL parsing to improve quality and

representation of knowledge. Problem: speed and complexity

Current system has Subject, Object, Verb tuples

Problem: hard to define what words to incorporate in each phrase

E.g. “'The dog ( Canis lupus familiaris )' 'is' 'a mammal from the family Canidae‘”◦ The dog? dog? The dog ( Canis lupus familiaris )?◦ a mammal? a mammal from the family Canidae?

Possible solutions: ◦ Different levels of information?◦ Simple rules based on part of speech tags?

Idea: Determine whether two separate mentions point to the same concept◦ ‘The dog’, ‘a dog’, ‘dogs’◦ ‘Cats’, ‘C.A.T.S’, ‘CAT Scan’◦ ‘President Obama’, ‘President Barack Obama’

Possible solutions:◦ Feature-based classification◦ Self organizing map◦ Terms associated

Need to ensure scaling is possible for move to regular Wikipedia

Hadoop is an open source implementation of the Map-Reduce algorithm

Map-Reduce is an algorithm that parallelizes a process by splitting its iterations over several machines